My CHI. My Future. Out-of-School Program Visualization & Modeling Project

Sep 11, 2024

—

Interactive data visualization and machine learning modeling project using My CHI. My Future. out-of-school program initiative data

Check out my GitHub repository for more information and source code of the project!

Original collaborative GitHub repository (unedited)

Project Overview

My CHI. My Future. is a youth initiative for the City of Chicago launched in May 2020 to connect young people, ranging from the ages 0-24, with out-of-school programs. The aim is to create a city-wide and community network for nearly 350 organizations and youth to easily discover and cultivate their passions and talents, as well as strengthen and center their identity and voice.

The goal of this project is to visualize the equity of opportunity for the City of Chicago’s youth to have access to and participate in local out-of-school programs across the city’s diverse neighborhoods and communities.

From the data we have, we want to understand how equitable the youth opportunity landscape is across geography by age and category. Do youth in different communities/regions have the same access to similar types of programs?

Since online programs technically serve all geographic regions of Chicago, we will separately analyze online programs and compare the results with those from in-person programs. However, we do acknowledge that the engagement and experiences vary between the different program types.

We also aim to use machine learning modeling and time series analysis to assist the MCMF initiative in accurately and efficiently categorizing their out-of-school programs and forecasting program count.

Data

My CHI. My Future. has over 350 providers, including the Park District and Public Library, that each offer thousands of community programs. Therefore, the database is quite vast, consisting of 56 variables and over 80,000 observations

Providers are required to specify information about their opportunities, which include the program name, description, organization name, minimum age, maximum age, meeting type, start and end date, and program price. Other variables also include category, capacity, URL, free food available, etc.

The data organizes these opportunities into 4 types:

Programs: lasting five days to a year or recurring
Events: four days or less
Jobs: work and career, paid opportunity
Resources: longer than 1 year

There are also 22 categories that the opportunities are divided into. For our visualization and analysis, we combined them into 4 general categories:

Academic
Community Service
Leisure & Arts
Professional Skill Building

We acquired the geospatial data for Chicago communities from the City of Chicago’s Open Data Portal:
https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6

Exploratory Data Analysis (EDA)

Data Wrangling

The data required extensive cleaning and wrangling. There were many unhelpful variables and variables with extreme missingness issues that were removed. We also dropped observations with minimum age > 24 (not for our target audience), face to face observations without address and coordinate data (unable to map), and non-Chicago observations. We also created new variables including a general category variable, minimum and maximum grade variables using minimum and maximum age, and priority community factor variable. We used geospatial data from the City of Chicago’s data portal to impute missing community values and allow mapping. Finally, we had to get rid of duplicate observations.

Visualizations

The following are some of the visualizations I created during EDA. Click on the images to expand.

Interactive Visualization App

Link: https://2r7ec0-emma-yu.shinyapps.io/mcmf_map/

The interactive visual map allows the user to click through a variety of options to see how MCMF programs are distributed across Chicago’s communities. The following images show a few ways users can play around with the map.

Selecting program category

Selecting grade level range

Highlighting MCMF priority communities

Text Classification Model

Techniques

The following are techniques I implemented in my modeling to improve performance:

Text pre-processing
- Text cleaning
- Stop word removal
- Stemming
- Frequent token removal
- TF-IDF vectorization
Grid search via stratified k-fold cross validation
Model stacking
- Logistic regression
- Multinomial Naive-Bayes
- Random forest
- Support vector machine (SVM)
- K-nearest neighbors

Results

1. Evaluation Using Testing Data

Metrics	Values
Accuracy	0.885
F1-score	0.733
Precision	0.811
Recall	0.680
ROC AUC	0.945
Hamming Loss	0.115

All of the performance metrics are in the acceptable to good range, indicating that the model is sufficient at categorizing the MCMF programs through their descriptions. Recall is the lowest, most likely due to the imbalance in the data.

2. Evaluation Using Updated Data from 2024

Metrics	Values
Accuracy	0.848
F1-score	0.639
Precision	0.739
Recall	0.595
ROC AUC	0.901
Hamming Loss	0.152

3. Evaluation Using Manually-Made Data

I hand wrote descriptions of imaginary afterschool programs to see if the model can correctly predict the category. The True column shows the category I labeled each program as and the Predicted column shows the category classified by the model.

True	Values
Professional Skill Building	Professional Skill Building
Leisure and Arts	Leisure and Arts
Community Service	Leisure and Arts
Academics	Academics
Professional Skill Building	Professional Skill Building
Leisure and Arts	Leisure and Arts
Community Service	Leisure and Arts
Academics	Academics

It looks like the model is good at predicting the categories on completely new data but has trouble classifying Community Service opportunities. This could be because of the imbalance in the training dataset. Community Service was the smallest category in the training data while Leisure and Arts was the largest.

Next Steps

In order to improve performance, I could implement text augmentation (synonym replacement, random insertion/deletion, text paraphrasing, etc.). I could also experiment with a hybrid of random over/undersampling and class weighting. I have tried both techniques separately but saw a decrease in performance (most likely due to overfitting), but I think an appropriate combination of those methods could possibly improve performance. Increasing the number of TF-IDF features have shown to increase performance and if I had the computing power, that is another strategy I would experiment with. However, I would have to be careful not to overfit the training data.

Time Series Forecasting Model

Time Series Data

The MCMF program count time series data consisted of daily program count from 1/1/2020 to 3/15/2023. The training/testing cutoff was at 12/2/2021. The plot shows a complex time series with significant dips that seem to display seasonality. The dips are found approximately every 2-5 months, during student holiday seasons (spring break, start of summer break, end of summer break, winter holidays). There is also a general trend of count growth from 2020 to the later quarter of 2021. The program count peaks at slightly over 2000 programs and then displays a decreasing trend. The data also seems to have a moderate amount of volatility even without the dips.

The next figure shows the time series plot of the MCMF program count. I used a 60-40 ratio to split the training and testing sets.

Models Tested

The following models were tested for program count forecasting:

Autoregressive Integrated Moving Average (ARIMA) model
Seasonal Autoregressive Integrated Moving Average (SARIMA) model
Autoregressive Fractionally Integrated Moving Average (ARFIMA) model

The following model was tested for volatility modeling/forecasting:

Generalized Autoregressive Conditional Heteroscedastic (GARCH) model

I ultimately decided to use ARFIMA to model the time series data because although the data passed the Augmented Dickey-Fuller test and was proven to be stationary, the ACF plot did not display a significant cutoff. This could be indicating that the data is long-range dependent and ARFIMA is the best-suited model for this type.

Results

Performance Metrics

Metrics	Values
Mean Absolute Error (MAE)	343.336
Root Mean Squared Error (RMSE)	402.349
Mean Percentage Error (MPE)	-32.841
Mean Average Percentage Error (MAPE)	60.986

The performance metrics indicate that the model is not completely good at forecasting program count. However, this is expected from running a complex time series data through a model that is used to simpler data.

ARFIMA Forecast

The following plot shows the forecasted data from the ARFIMA model compared to the testing data:

Although the performance metrics were not good, we can see from the plots that the forecast follows the general decreasing trend of the testing data.

ARFIMA-GARCH Forecast

The following plot shows the ARFIMA forecast along with the volatility forecast made by the GARCH model:

GARCH Variance Bounds

I also plotted the conditional variance bounds against the training data:

Next Steps

The time series data for MCMF programs was very complex, with extreme fluctuations that isn’t captured by simple seasonal differencing, contributing to the inadequate performance. Despite the limitations imposed by the nature of the data, the ARFIMA model was able to accurate forecast the general trend of the program count. In order to create a model that is able to capture the complexity of this data, I could try training a deep learning model such as LSTM. I could also take a completely opposite approach and try a simpler forecasting method such as exponential smoothing.

Check out my GitHub repository for more information and source code of the project!

My CHI. My Future. Out-of-School Program Visualization & Modeling Project

Project Overview

Data

Exploratory Data Analysis (EDA)

Data Wrangling

Visualizations

Interactive Visualization App

Text Classification Model

Techniques

Results

Next Steps

Time Series Forecasting Model

Time Series Data

Models Tested

Results

Next Steps

Comments

Leave a comment Cancel reply