Data AnalysisMachine Learning

NYC Street Safety Score Prediction

For my Pratt capstone, I engineered a novel geospatial dataset by joining two large NYC Open Data sources: 1.7 million DOT streetlight and traffic signal service requests and 2 million NYPD Motor Vehicle Collision records. Linking them required a custom spatial join using lat/long coordinate rounding and a temporal windowing strategy to match collisions to open infrastructure requests at the same intersection — producing a dataset of 26,271 rows and 123 features.

I designed a weighted heuristic scoring methodology to classify each intersection's safety risk on a 1–10 scale, aggregating collision severity within the active timeframe of each service request at the matched location. Because most locations had zero incidents during any given request period, the distribution was severely right-skewed. I applied Box-Cox transformations before binning into percentile-based deciles to produce a balanced training target.

The feature pipeline used a scikit-learn ColumnTransformer combining MinMaxScaler, StandardScaler, and OneHotEncoder to handle mixed numeric and categorical features. A baseline Keras neural network achieved 94% validation accuracy (F1-macro 0.79). I then ran 30 hyperparameter trials with Keras Tuner's Hyperband algorithm, arriving at an optimized architecture with BatchNormalization and Dropout(0.3) that reached 96.7% validation accuracy (F1-macro 0.82) with early stopping at epoch 44.

The full code, notebook, and trained model are available on GitHub.

View Project ↗