San Fanciscon Crime

Motivation

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

Overview

From Sunset to SOMA, and Marina to Excelsior, this dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

Approach

We will apply a full Data Science Development life cycle composed of the following steps:

  • Data Wrangling to perform all the necessary actions to clean the dataset.
  • Feature Engineering to create additional variables from the existing.
  • Data Normalization and Data Transformation for preparing the dataset for the learning algorithms.
  • Training / Testing data creation to evaluate the performance of our model.

Data Wrangling

Loading the data

# Core imports
import pandas as pd
import numpy as np

# Yeast imports
from yeast import Recipe
from yeast.steps import *
from yeast.transformers import *
from yeast.selectors import *

# Machine Learning imports
import xgboost as xgb
train = pd.read_csv('../../data/sf_train.csv')
test = pd.read_csv('../../data/sf_test.csv')

The cleaning recipe

recipe = Recipe([
    # Normalize all column names
    CleanColumnNamesStep('snake'),
    # This dataset contains 2323 duplicates that we should remove only on training set
    DropDuplicateRowsStep(role='training'),
    # Some Geolocation points are missplaced
    # We will replace the outlying coordinates with the average coordinates
    MutateStep({
        'x': MapValues({-120.5: np.NaN}),
        'y': MapValues({90: np.NaN})
    }),
    MeanImputeStep(['x', 'y']),
    # Extract some features drom the date:
    CastStep({'dates': 'datetime'}),
    MutateStep({
        'year': DateYear('dates'),
        'quarter': DateQuarter('dates'),
        'month': DateMonth('dates'),
        'week': DateWeek('dates'),
        'day': DateDay('dates'),
        'hour': DateHour('dates'),
        'minute': DateMinute('dates'),
        'dow': DateDayOfWeek('dates'),
        'doy': DateDayOfYear('dates')
    }),
    # Calculate the tenure: days(date - min(date)):
    MutateStep({
        'tenure': lambda df: (df['dates'] - df['dates'].min()).apply(lambda x: x.days)
    }),
    # Is it on a block?
    MutateStep({
        'is_block': StrContains('block', column='address', case=False)
    }),
    # Drop irrelevant Columns
    DropColumnsStep(['dates', 'day_of_week']),
    # Cast the numerical features
    CastStep({
        'is_block': 'integer'  # True and False to 1 and 0
    }),
    # Convert the category (target) into a numerical feature:
    OrdinalEncoderStep('category', role='training'),
    # Keep only numerical features
    SelectStep(AllNumeric()),
]).prepare(train)
baked_train = recipe.bake(train, role="training")
baked_test  = recipe.bake(test, role="testing")
baked_train.sample(5).head().T
304544 158399 244432 514937 16967
category 25 16 21 16 7
x -122.391 -122.468 -122.422 -122.403 -122.427
y 37.734 37.717 37.7416 37.7982 37.7692
year 2011 2013 2012 2008 2015
quarter 1 2 1 1 1
month 3 4 1 2 2
week 10 14 4 6 8
day 8 6 28 9 20
hour 18 18 2 0 10
minute 0 30 41 15 43
dow 1 5 5 5 4
doy 67 96 28 40 51
tenure 2983 3743 3309 1860 4428
is_block 0 1 1 1 1

Training and Validation using XGBoost

features = list(set(baked_train.columns)-set(['category']))
dtrain = xgb.DMatrix(
    baked_train[features].values, 
    label=baked_train['category'], 
    feature_names=baked_train[features].columns
)
params = {
    'max_depth': 3,
    'eta': 0.3,
    'objective': 'multi:softprob',
    'num_class': 39,
    'eval_metric': 'mlogloss'
}

history = xgb.cv(
    params=params, dtrain=dtrain, nfold=5, seed=42,
    num_boost_round=15, stratified=True, verbose_eval=False
)

history.tail()
train-mlogloss-mean train-mlogloss-std test-mlogloss-mean test-mlogloss-std
10 2.482818 0.000661 2.485002 0.002041
11 2.467268 0.000718 2.469610 0.002145
12 2.454027 0.000752 2.456544 0.002004
13 2.442648 0.001062 2.445381 0.001881
14 2.433441 0.001282 2.436348 0.001993

The above model achieved 2.433441 5-fold cross-validation score after 10 epochs and 2.436348 on the testing set while 2.49136 was the benchmark.

Feature Importance

A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute. Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model

model = xgb.train(
    params=params, dtrain=dtrain, num_boost_round=15
)
for feature, importance in model.get_score().items():
    print(f'{feature}: {importance}')
hour: 598
x: 649
y: 836
is_block: 383
dow: 35
minute: 626
tenure: 393
day: 51
doy: 61
year: 32