San Fanciscon Crime¶
Motivation¶
From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.
Overview¶
From Sunset to SOMA, and Marina to Excelsior, this dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.
Approach¶
We will apply a full Data Science Development life cycle composed of the following steps:
- Data Wrangling to perform all the necessary actions to clean the dataset.
- Feature Engineering to create additional variables from the existing.
- Data Normalization and Data Transformation for preparing the dataset for the learning algorithms.
- Training / Testing data creation to evaluate the performance of our model.
Data Wrangling¶
Loading the data¶
# Core imports
import pandas as pd
import numpy as np
# Yeast imports
from yeast import Recipe
from yeast.steps import *
from yeast.transformers import *
from yeast.selectors import *
# Machine Learning imports
import xgboost as xgb
train = pd.read_csv('../../data/sf_train.csv')
test = pd.read_csv('../../data/sf_test.csv')
The cleaning recipe¶
recipe = Recipe([
# Normalize all column names
CleanColumnNamesStep('snake'),
# This dataset contains 2323 duplicates that we should remove only on training set
DropDuplicateRowsStep(role='training'),
# Some Geolocation points are missplaced
# We will replace the outlying coordinates with the average coordinates
MutateStep({
'x': MapValues({-120.5: np.NaN}),
'y': MapValues({90: np.NaN})
}),
MeanImputeStep(['x', 'y']),
# Extract some features drom the date:
CastStep({'dates': 'datetime'}),
MutateStep({
'year': DateYear('dates'),
'quarter': DateQuarter('dates'),
'month': DateMonth('dates'),
'week': DateWeek('dates'),
'day': DateDay('dates'),
'hour': DateHour('dates'),
'minute': DateMinute('dates'),
'dow': DateDayOfWeek('dates'),
'doy': DateDayOfYear('dates')
}),
# Calculate the tenure: days(date - min(date)):
MutateStep({
'tenure': lambda df: (df['dates'] - df['dates'].min()).apply(lambda x: x.days)
}),
# Is it on a block?
MutateStep({
'is_block': StrContains('block', column='address', case=False)
}),
# Drop irrelevant Columns
DropColumnsStep(['dates', 'day_of_week']),
# Cast the numerical features
CastStep({
'is_block': 'integer' # True and False to 1 and 0
}),
# Convert the category (target) into a numerical feature:
OrdinalEncoderStep('category', role='training'),
# Keep only numerical features
SelectStep(AllNumeric()),
]).prepare(train)
baked_train = recipe.bake(train, role="training")
baked_test = recipe.bake(test, role="testing")
baked_train.sample(5).head().T
304544 | 158399 | 244432 | 514937 | 16967 | |
---|---|---|---|---|---|
category | 25 | 16 | 21 | 16 | 7 |
x | -122.391 | -122.468 | -122.422 | -122.403 | -122.427 |
y | 37.734 | 37.717 | 37.7416 | 37.7982 | 37.7692 |
year | 2011 | 2013 | 2012 | 2008 | 2015 |
quarter | 1 | 2 | 1 | 1 | 1 |
month | 3 | 4 | 1 | 2 | 2 |
week | 10 | 14 | 4 | 6 | 8 |
day | 8 | 6 | 28 | 9 | 20 |
hour | 18 | 18 | 2 | 0 | 10 |
minute | 0 | 30 | 41 | 15 | 43 |
dow | 1 | 5 | 5 | 5 | 4 |
doy | 67 | 96 | 28 | 40 | 51 |
tenure | 2983 | 3743 | 3309 | 1860 | 4428 |
is_block | 0 | 1 | 1 | 1 | 1 |
Training and Validation using XGBoost¶
features = list(set(baked_train.columns)-set(['category']))
dtrain = xgb.DMatrix(
baked_train[features].values,
label=baked_train['category'],
feature_names=baked_train[features].columns
)
params = {
'max_depth': 3,
'eta': 0.3,
'objective': 'multi:softprob',
'num_class': 39,
'eval_metric': 'mlogloss'
}
history = xgb.cv(
params=params, dtrain=dtrain, nfold=5, seed=42,
num_boost_round=15, stratified=True, verbose_eval=False
)
history.tail()
train-mlogloss-mean | train-mlogloss-std | test-mlogloss-mean | test-mlogloss-std | |
---|---|---|---|---|
10 | 2.482818 | 0.000661 | 2.485002 | 0.002041 |
11 | 2.467268 | 0.000718 | 2.469610 | 0.002145 |
12 | 2.454027 | 0.000752 | 2.456544 | 0.002004 |
13 | 2.442648 | 0.001062 | 2.445381 | 0.001881 |
14 | 2.433441 | 0.001282 | 2.436348 | 0.001993 |
The above model achieved 2.433441 5-fold cross-validation score after 10 epochs and 2.436348 on the testing set while 2.49136 was the benchmark.
Feature Importance¶
A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute. Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model
model = xgb.train(
params=params, dtrain=dtrain, num_boost_round=15
)
for feature, importance in model.get_score().items():
print(f'{feature}: {importance}')
hour: 598
x: 649
y: 836
is_block: 383
dow: 35
minute: 626
tenure: 393
day: 51
doy: 61
year: 32