San Fanciscon Crime¶

Motivation¶

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

Overview¶

From Sunset to SOMA, and Marina to Excelsior, this dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

Approach¶

We will apply a full Data Science Development life cycle composed of the following steps:

Data Wrangling to perform all the necessary actions to clean the dataset.
Feature Engineering to create additional variables from the existing.
Data Normalization and Data Transformation for preparing the dataset for the learning algorithms.
Training / Testing data creation to evaluate the performance of our model.

Data Wrangling¶

Loading the data¶

# Core imports
import pandas as pd
import numpy as np

# Yeast imports
from yeast import Recipe
from yeast.steps import *
from yeast.transformers import *
from yeast.selectors import *

# Machine Learning imports
import xgboost as xgb

train = pd.read_csv('../../data/sf_train.csv')
test = pd.read_csv('../../data/sf_test.csv')

The cleaning recipe¶

recipe = Recipe([
    # Normalize all column names
    CleanColumnNamesStep('snake'),
    # This dataset contains 2323 duplicates that we should remove only on training set
    DropDuplicateRowsStep(role='training'),
    # Some Geolocation points are missplaced
    # We will replace the outlying coordinates with the average coordinates
    MutateStep({
        'x': MapValues({-120.5: np.NaN}),
        'y': MapValues({90: np.NaN})
    }),
    MeanImputeStep(['x', 'y']),
    # Extract some features drom the date:
    CastStep({'dates': 'datetime'}),
    MutateStep({
        'year': DateYear('dates'),
        'quarter': DateQuarter('dates'),
        'month': DateMonth('dates'),
        'week': DateWeek('dates'),
        'day': DateDay('dates'),
        'hour': DateHour('dates'),
        'minute': DateMinute('dates'),
        'dow': DateDayOfWeek('dates'),
        'doy': DateDayOfYear('dates')
    }),
    # Calculate the tenure: days(date - min(date)):
    MutateStep({
        'tenure': lambda df: (df['dates'] - df['dates'].min()).apply(lambda x: x.days)
    }),
    # Is it on a block?
    MutateStep({
        'is_block': StrContains('block', column='address', case=False)
    }),
    # Drop irrelevant Columns
    DropColumnsStep(['dates', 'day_of_week']),
    # Cast the numerical features
    CastStep({
        'is_block': 'integer'  # True and False to 1 and 0
    }),
    # Convert the category (target) into a numerical feature:
    OrdinalEncoderStep('category', role='training'),
    # Keep only numerical features
    SelectStep(AllNumeric()),
]).prepare(train)

baked_train = recipe.bake(train, role="training")
baked_test  = recipe.bake(test, role="testing")

baked_train.sample(5).head().T

	304544	158399	244432	514937	16967
category	25	16	21	16	7
x	-122.391	-122.468	-122.422	-122.403	-122.427
y	37.734	37.717	37.7416	37.7982	37.7692
year	2011	2013	2012	2008	2015
quarter	1	2	1	1	1
month	3	4	1	2	2
week	10	14	4	6	8
day	8	6	28	9	20
hour	18	18	2	0	10
minute	0	30	41	15	43
dow	1	5	5	5	4
doy	67	96	28	40	51
tenure	2983	3743	3309	1860	4428
is_block	0	1	1	1	1

Training and Validation using XGBoost¶

features = list(set(baked_train.columns)-set(['category']))
dtrain = xgb.DMatrix(
    baked_train[features].values, 
    label=baked_train['category'], 
    feature_names=baked_train[features].columns
)

params = {
    'max_depth': 3,
    'eta': 0.3,
    'objective': 'multi:softprob',
    'num_class': 39,
    'eval_metric': 'mlogloss'
}

history = xgb.cv(
    params=params, dtrain=dtrain, nfold=5, seed=42,
    num_boost_round=15, stratified=True, verbose_eval=False
)

history.tail()

	train-mlogloss-mean	train-mlogloss-std	test-mlogloss-mean	test-mlogloss-std
10	2.482818	0.000661	2.485002	0.002041
11	2.467268	0.000718	2.469610	0.002145
12	2.454027	0.000752	2.456544	0.002004
13	2.442648	0.001062	2.445381	0.001881
14	2.433441	0.001282	2.436348	0.001993

The above model achieved 2.433441 5-fold cross-validation score after 10 epochs and 2.436348 on the testing set while 2.49136 was the benchmark.

Feature Importance¶

A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute. Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model

model = xgb.train(
    params=params, dtrain=dtrain, num_boost_round=15
)

for feature, importance in model.get_score().items():
    print(f'{feature}: {importance}')

hour: 598
x: 649
y: 836
is_block: 383
dow: 35
minute: 626
tenure: 393
day: 51
doy: 61
year: 32