Informe final: Analisis de crimenes en Chicago

Tema y Motivación

Los datos seleccionados corresponden a reportenes de crimenes (exceptuando asesinatos) que han ocurrido en la ciudad de Chicago, Estados Unidos, desde el año 2001 hasta el año 2017, los datos se encuentran disponibles en el sitio https://www.kaggle.com/currie32/crimes-in-chicago y estos fueron obtenidos del sistema del departamento de policia de Chicago, "CLEAR" (Citizen Law Enforcement Analysis and Reporting).

Este dataset fue elegido ya que al grupo le parece ideal para ejecutar tareas de clasificación, además se cuentan con una cantidad de datos lo suficientemente grande como para aplicar estas tareas en más de un periodo de tiempo definido, lo cual nos permitira saber si el comportamiento de los crimenes en Chicago a variado a lo largo de los años. Otro de los puntos que hizo que eligieramos estos datos fue el que los datos estan geolocalizados, por lo cual se puede trabajar con ellos de forma muy visual.

Hipótesis y Objetivo

Nuestras aseveraciones iniciales previo a trabajar la data corresponden a los siguientes.

  1. Existen relaciones entre las condiciones de un delito y el delito mismo.

  2. Los focos donde se producen los crímenes varían según estas relaciones.

  3. Estos patrones no son estáticos en el tiempo.

En base a nuestras hipótesis, se define el objetivo del trabajo, que corresponde a realizar tareas de clasificación que sean capaces de capturar estas diferencias entre los distintos tipos de crímenes. Posteriormente, se busca desarrollar modelos descriptivos y predictorios sobre estas clasificaciones, con el fin de poder detectar las posibles relaciones entre los crimenes y tambien poder predecir la ocurrencia de estos mismos.

Preguntas iniciales para abordar la exploración de datos.

Las siguientes son preguntas utilizadas como "baseline" para guiar el trabajo de análisis exploratorio de los datos:

  1. ¿Como ha cambiado el crimen en Chicago a lo largo de los años?.

  2. ¿Son ciertos tipos de crimenes mas propensos de ocurrir en ciertos sectores de la ciudad, o a cierta hora del día?.

  3. ¿Existe alguna correlación entre los distintos tipos de crimenes?.

  4. ¿Cuales son los crimenes predominantes en Chicago y si estos están relacionados entre sí?.

  5. ¿Que relación existe entre el tipo de crimen y el tipo de lugar donde se realizan?.

Desarrollo:

Imports, conexion con google drive y carga de los datos.

In [0]:
%matplotlib inline
# INSTALLS
!pip install pandas
!pip install -U -q PyDrive
!pip install scipy
!pip install xlrd
!pip install folium==0.6.0
!pip install geopandas
!pip install descartes
!pip install mlxtend


# Para trabajar la data
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import folium 
from folium import plugins
import geopandas as gpd
import descartes

#from pandas import ExcelWriter
#from pandas import ExcelFile

#imports para poder usar los datos desde drive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
In [0]:
# Descarga de archivos 
## Nos debemos autentificar como usuarios
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

local_download_path = os.path.expanduser('~/data_drive')
try:
    os.makedirs(local_download_path)
except: pass

folder_id = "1jr3rsT4w1XNJJyR4UUPeocUyyAgtDN7e" # id de la carpeta de google drive

# Descarga de archivos !
file_list = drive.ListFile(
    {'q': "'{}' in parents".format(folder_id)}).GetList()

for f in file_list:
    # Creamos los archivos y los descargamos
    print('title: %s, id: %s' % (f['title'], f['id']))
    fname = os.path.join(local_download_path, f['title'])
    print('downloading to {}'.format(fname))
    f_ = drive.CreateFile({'id': f['id']})
    f_.GetContentFile(fname)
    
In [0]:
#path de los datos
directory = os.path.dirname(os.getcwd())
data_path = os.path.join(directory, "root/data_drive")

data2001_path = os.path.join(data_path, "Chicago_Crimes_2001_to_2004.csv")
data2005_path = os.path.join(data_path, "Chicago_Crimes_2005_to_2007.csv")
data2008_path = os.path.join(data_path, "Chicago_Crimes_2008_to_2011.csv")
data2012_path = os.path.join(data_path, "Chicago_Crimes_2012_to_2017.csv")

#crear los data frames
parse_dates = ['Date']
data2001 = pd.read_csv(data2001_path, error_bad_lines=False)
data2005 = pd.read_csv(data2005_path, error_bad_lines=False)
data2008 = pd.read_csv(data2008_path, error_bad_lines=False)
data2012 = pd.read_csv(data2012_path, error_bad_lines=False)

#juntar las bases de datos
dataconcat = pd.concat([data2012, data2008, data2005, data2001], ignore_index=False, axis=0)

del data2001
del data2005
del data2008
del data2012

Formateo de datos

Empezamos por ver el tamaño del dataset antes de ser modificado, para luego eliminar elementos cuyo id o numero de caso este duplicado y borrar columnas que no son utilizadas en el análisis posterior.

In [0]:
print('Tamaño inicial del dataset: ', dataconcat.shape)
#eliminacion de duplicados
dataconcat.drop_duplicates(subset=['ID', 'Case Number'], inplace=True)

#eliminar columnas innecesarias
dataconcat.drop(['Unnamed: 0', 'Case Number','Updated On','FBI Code','Beat','X Coordinate','Y Coordinate','Latitude','Longitude'], inplace=True, axis=1)
dataconcat.dropna()

print('Tamaño final luego de eliminar duplicados y columnas innecesarias: ', dataconcat.shape)
#dataconcat.head()
Tamaño inicial del dataset:  (7941282, 23)
Tamaño final luego de eliminar duplicados y columnas innecesarias:  (6170812, 14)

Se arreglan los tipos de datos para trabajar con ellos de manera mas sencilla

In [0]:
#arreglar los tipos de datos
data=dataconcat
data['ID'] = data['ID'].astype('category')
data['Primary Type'] = data['Primary Type'].astype('category')
data.IUCR = data.IUCR.astype('category')
data['coorX'] = data['Location'].str.split(',').str[0]
data['coorX']= data['coorX'].str.replace('(', '')
data['coorY'] = data['Location'].str.split(',').str[1]
data['coorY']= data['coorY'].str.replace(')', '')
data['coorX'] = pd.to_numeric(data['coorX'])
data['coorY'] = pd.to_numeric(data['coorY'])

data['District'] = data['District'].astype('category')
data['Community Area'] = data['Community Area'].astype('category')
In [0]:
# muy lento
import time
start = time.time()
data['Date']=pd.to_datetime(data['Date'])
finish = time.time()
print(finish - start)
1501.078206539154
In [0]:
data.head() 
Out[0]:
ID Date Block IUCR Primary Type Description Location Description Arrest Domestic District Ward Community Area Year Location coorX coorY
0 10508693 2016-05-03 23:40:00 013XX S SAWYER AVE 0486 BATTERY DOMESTIC BATTERY SIMPLE APARTMENT True True 10.0 24.0 29.0 2016.0 (41.864073157, -87.706818608) 41.864073 -87.706819
1 10508695 2016-05-03 21:40:00 061XX S DREXEL AVE 0486 BATTERY DOMESTIC BATTERY SIMPLE RESIDENCE False True 3.0 20.0 42.0 2016.0 (41.782921527, -87.60436317) 41.782922 -87.604363
2 10508697 2016-05-03 23:31:00 053XX W CHICAGO AVE 0470 PUBLIC PEACE VIOLATION RECKLESS CONDUCT STREET False False 15.0 37.0 25.0 2016.0 (41.894908283, -87.758371958) 41.894908 -87.758372
3 10508698 2016-05-03 22:10:00 049XX W FULTON ST 0460 BATTERY SIMPLE SIDEWALK False False 15.0 28.0 25.0 2016.0 (41.885686845, -87.749515983) 41.885687 -87.749516
4 10508699 2016-05-03 22:00:00 003XX N LOTUS AVE 0820 THEFT $500 AND UNDER RESIDENCE False True 15.0 28.0 25.0 2016.0 (41.886297242, -87.761750709) 41.886297 -87.761751

Análisis exploratorio

Luego de tener los datos en el formato deseado, se inicia el análisis exploratorio. Empezamos por investigar el tipo de datos que hay en cada columna.

In [0]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6170812 entries, 0 to 1923514
Data columns (total 16 columns):
ID                      category
Date                    datetime64[ns]
Block                   object
IUCR                    category
Primary Type            category
Description             object
Location Description    object
Arrest                  bool
Domestic                bool
District                category
Ward                    float64
Community Area          category
Year                    float64
Location                object
coorX                   float64
coorY                   float64
dtypes: bool(2), category(5), datetime64[ns](1), float64(4), object(4)
memory usage: 742.6+ MB

¿Cuáles son los crimenes que mas se han cometido en Chicago según los datos?

In [0]:
most_common_crimes = data.groupby('Primary Type').size().reset_index(name='count').sort_values(by=['count'], ascending=False)
most_common_crimes.plot(x='Primary Type',kind='barh', figsize=(8,10))
plt.gca().invert_yaxis()
plt.ylabel('Tipos de crimenes')
plt.xlabel('Frecuencia')
plt.title('Crimenes más comunes')
# los crimenes mas comunes
Out[0]:
Text(0.5,1,'Crimenes más comunes')
In [0]:
print ("Los 20 crimenes más comunes")
most_common_crimes = data.groupby(['Description','Primary Type']).size().reset_index(name='count').sort_values(by=['count'], ascending=False)
most_common_crimes_top20=most_common_crimes[0:20]
most_common_crimes_top20.plot(x=['Primary Type','Description'],kind='barh', figsize=(8,10))
plt.gca().invert_yaxis()
plt.ylabel('Descripcion (Tipo, Subtipo)')
plt.xlabel('Frecuencia')
Los 20 crimenes más comunes
Out[0]:
Text(0.5,0,'Frecuencia')

Se puede apreciar que de las categorías generales, la mayor parte de los crimenes se concentran en una fracción de todas las categorías existentes en la base, siendo el mayor crimen el de "theft" (que corresponde a un robo que no involucra contacto con la persona a quien le roban). El segundo gráfico desglosa aun más como se dividen los crimenes al clasificarlos por su descripción secundaria (la cual especifica aun más el tipo de crimen).

¿En que lugares de la ciudad ocurren mayormente los crimenes?

In [0]:
lugar_comunes=data.groupby('Location Description').size().reset_index(name='count').sort_values(by=['count'], ascending=False).head(10)
lugar_comunes.index = range(1,len(lugar_comunes)+1)
print( "Tipos de lugares mas comunes")
lugar_comunes
Tipos de lugares mas comunes
Out[0]:
Location Description count
1 STREET 1637391
2 RESIDENCE 1043854
3 APARTMENT 627874
4 SIDEWALK 618938
5 OTHER 232737
6 PARKING LOT/GARAGE(NON.RESID.) 176895
7 ALLEY 139580
8 SCHOOL, PUBLIC, BUILDING 133258
9 RESIDENCE-GARAGE 122290
10 RESIDENCE PORCH/HALLWAY 108129

¿Cómo ha evolucionado la cantidad de crimenes en el tiempo?

In [0]:
print("Gráfico, cantidad de crimenes por año")
data.groupby('Year').size().reset_index(name='count').plot(x='Year', y='count')
plt.xlabel('Año')
plt.ylabel('Frecuencia')
plt.title('Crimenes por año')
Gráfico, cantidad de crimenes por año
Out[0]:
Text(0.5,1,'Crimenes por año')

Este gráfico muestra a simple vista un descenso constante en la cantidad de crimenes por año. Se puede apreciar un numero particularmente bajo de crimenes para el 2004, considerando el comportamiento que tienen los años mas próximos a el. Se aprecia tambien que del 2015 al 2016 hubo un aumento de los crimenes por año en lugar de una disminución. Los datos del año 2017 no están completos y eso explica el descenso final de la línea.

¿La frecuencia de los crímenes cambia a lo largo del día?

In [0]:
data.groupby(data['Date'].map(lambda x: x.hour)).size().plot(x='hour', y='count')
plt.ylabel('Frecuencia')
plt.xlabel('Hora del día')
plt.title('Frecuencia de crímenes por hora de ocurrencia')
plt.show()

El gráfico muestra claras diferencias en la frecuencia de ocurrencia de crímenes dependiendo de la hora del día. La mayor cantidad de crímenes se concentra en el día y disminuye de forma significativa en la madrugada.

¿Se comportan los crímenes de la misma manera?

Luego de obtener el comportamiento general de los crímenes dependiendo de la hora del día, nuestro objetivo es ver si este comportamiento se mantiene igual al discernir por tipo de crimen. A continuación se muestra el mismo grafico anterior para los 8 tipos de crímenes con mayor ocurrencia en Chicago.

In [0]:
fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(12,16))
axes[0,0].set_ylabel('crimenes tipo THEFT')
axes[0,0].set_xlabel('año')
axes[0,1].set_ylabel('crimenes tipo BATTERY')
axes[0,1].set_xlabel('año')
axes[1,0].set_ylabel('crimenes tipo CRIMINAL DAMAGE')
axes[1,0].set_xlabel('año')
axes[1,1].set_ylabel('crimenes tipo NARCOTICS')
axes[1,1].set_xlabel('año')
axes[2,0].set_ylabel('crimenes tipo OTHER OFFENSE')
axes[2,0].set_xlabel('año')
axes[2,1].set_ylabel('crimenes tipo ASSAULT')
axes[2,1].set_xlabel('año')
axes[3,0].set_ylabel('crimenes tipo BURGLARY')
axes[3,0].set_xlabel('año')
axes[3,1].set_ylabel('crimenes tipo MOTOR VEHICLE THEFT')
axes[3,1].set_xlabel('año')

grouped_data = data.groupby([data['Date'].map(lambda x: x.hour), data['Primary Type']]).size().reset_index(name='count').rename(index=str, columns={'Date': 'Hour'})
theft_data = grouped_data[grouped_data['Primary Type'] == 'THEFT']
theft_data.plot(x='Hour', y='count', ax=axes[0,0])
batt_data = grouped_data[grouped_data['Primary Type'] == 'BATTERY']
batt_data.plot(x='Hour', y='count', ax=axes[0,1])
cd_data = grouped_data[grouped_data['Primary Type'] == 'CRIMINAL DAMAGE']
cd_data.plot(x='Hour', y='count', ax=axes[1,0])
narc_data = grouped_data[grouped_data['Primary Type'] == 'NARCOTICS']
narc_data.plot(x='Hour', y='count', ax=axes[1,1])
oo_data = grouped_data[grouped_data['Primary Type'] == 'OTHER OFFENSE']
oo_data.plot(x='Hour', y='count', ax=axes[2,0])
assault_data = grouped_data[grouped_data['Primary Type'] == 'ASSAULT']
assault_data.plot(x='Hour', y='count', ax=axes[2,1])
burg_data = grouped_data[grouped_data['Primary Type'] == 'BURGLARY']
burg_data.plot(x='Hour', y='count', ax=axes[3,0])
mvt_data = grouped_data[grouped_data['Primary Type'] == 'MOTOR VEHICLE THEFT']
mvt_data.plot(x='Hour', y='count', ax=axes[3,1])
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcb3f2360b8>

De los tipos de crímenes graficados, se observa que se mantiene la disminución de frecuencia en la madrugada, pero se observa que el comportamiento posterior es diferente entre tipos de crímenes, habiendo algunos que presentan peaks de ocurrencia alrededor de las 10 a 12 horas (como la categoría Burglary) y otros con peaks alrededor de las 20 a 22 horas (como la categoría Motor Vehicle theft).

Distribución de los crimenes en el mapa de Chicago

El objetivo de las siguientes visualizaciones es de entender mejor como se distribuyen los crimenes en toda el area de la ciudad de Chicago, mediante diferentes representaciones es en el mapa de la ciudad.

In [0]:
conteo = data['Location'].value_counts()
df_conteo = pd.DataFrame({"Location" : conteo.index, "ValueCount":conteo})
df_conteo.index = range(len(df_conteo))
df_conteo['coorX'] = df_conteo['Location'].str.split(',').str[0]
df_conteo['coorX']= df_conteo['coorX'].str.replace('(', '')
df_conteo['coorY'] = df_conteo['Location'].str.split(',').str[1]
df_conteo['coorY']= df_conteo['coorY'].str.replace(')', '')
df_conteo = df_conteo.drop(columns=['Location'], axis = 1)

chicago_map_crime = folium.Map(location=[41.895140898, -87.624255632],
                        zoom_start=13,
                        tiles="Stamen Terrain")

for i in range(1000):
    lat = float(df_conteo['coorY'].iloc[i])
    long = float(df_conteo['coorX'].iloc[i])
    radius = df_conteo['ValueCount'].iloc[i] / 100
    
    if df_conteo['ValueCount'].iloc[i] > 1000:
        color = "#FF4500"
    else:
        color = "#008080"
    
    popup_text = """Latitude : {}<br>
                Longitude : {}<br>
                Criminal Incidents : {}<br>"""
    popup_text = popup_text.format(lat,
                               long,
                               df_conteo['ValueCount'].iloc[i]
                               )
    folium.CircleMarker(location = [long, lat], popup= popup_text,radius = radius, color = color, fill = True).add_to(chicago_map_crime)

print("1000 Puntos con más crímenes en Chicago")
chicago_map_crime
1000 Puntos con más crímenes en Chicago
Out[0]:
In [0]:
from folium import plugins
heat_map= folium.Map(location=[41.895140898, -87.624255632],
                        zoom_start=11.5,
                        tiles="Stamen Terrain")
data.dropna()
location=[]
for i in range(50000):
  lat = float(data['coorY'].iloc[i])
  long = float(data['coorX'].iloc[i])
  loc=(long,lat)
  if np.isfinite(lat):
    location.append(loc)


  
print ("Mapa de calor de los primeros 50.000 puntos")  
heat_map.add_children(plugins.HeatMap(location, radius=15))
heat_map
Mapa de calor de los primeros 50.000 puntos
Out[0]:

De los mapas se puede concluir que gran parte de los crimenes de chicago se producen en el aeropuerto "O'hare", cerca del "Millenium park", por la zona de "Hometown", los alrededores de "Garfield park" y por toda el área costera central mayoritariamente.

Mapa por distrito

La base de datos cuenta con varias divisiones geográficas que se utilizan en la ciudad de Chicago (las variables Block, District y Ward), arbitrariamente se elige graficar la cantidad de crímenes por distrito existente, lo cual entrega el mapa a continuación. Adicionalmente, se grafica en este caso sólo crimenes que ocurrieron el 2016.

In [0]:
#Cantidad de crímenes por distrito de la Ciudad

chicago_map_crime = folium.Map(location=[41.895140898, -87.624255632], zoom_start=10, tiles="Stamen Terrain")

distritos_path = os.path.join(data_path, "Boundaries - Police Districts (current).geojson")
#distritos = gpd.read_file(distritos_path)

conteodist=data.loc[:,['Primary Type','District','Year']]
#conteodist=data[['Primary Type','District','Year']]
conteodist['District'] = conteodist['District'].astype('int')
conteodist['District'] = conteodist['District'].astype('str')
conteodist=conteodist.loc[conteodist['Year'] == 2016]
conteodist=conteodist.groupby('District').size().reset_index(name='count').sort_values(by=['count'], ascending=False)
#conteodist['Count'] = conteodist['Count'].astype('str')
#conteodist=conteodist.drop(index=0)

chicago_map_crime.choropleth(geo_data = distritos_path, name ='distritos', 
                             data = conteodist, columns = ['District', 'count'], 
                             key_on = 'feature.properties.dist_num', fill_color = 'YlOrRd',
                             legend_name = 'Cantidad de crímenes',threshold_scale=[0,100, 3000, 8000, 12000, 15000, 18000])
folium.LayerControl().add_to(chicago_map_crime)
chicago_map_crime

#conteodist
#highlight = True
Out[0]:

Para el año 2016, se pueden ver grandes variaciones entre distritos, habiendo distritos con una frecuencia entre 0 y 3 mil crimenes versus otros con frecuencias superiores a los 12 mil crímenes.

Hito 2

Outliers

A continuación se buscan posibles outliers sobre las coordenadas en donde se realiza el crimen, con el objetivo de eliminar datos muy diferentes al resto.

In [0]:
data['coorX'] = data['coorX'].astype('float')
data['coorY'] = data['coorY'].astype('float')

#boxplots sobre las coordenas de los crimenes
green_diamond = dict(markerfacecolor='g', marker='D')  #para ver los outliers y de forma bonita
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(10,6))
boxp1 = data.boxplot(column='coorX',showfliers=True, flierprops=green_diamond, ax=ax1)
boxp2 = data.boxplot(column='coorY',showfliers=True, flierprops=green_diamond, ax=ax2)
#boxp3 = data.boxplot(column='coorX',showfliers=False, ax=ax3)
#boxp4 = data.boxplot(column='coorY',showfliers=False, ax=ax4)
plt.tight_layout() 

De los gráficos se puede apreciar que existe un valor tanto para la longitud (coorX) como la latitud (coorY) que se escapan de forma significativa del resto de los datos, por lo cual se decide eliminar estos valores del dataset, siendo el motivo principal para esta desición el evitar que estos puntos perjudiquen una posible normalización futura de estos valores, o que perjudiquen a la vez los seleccionadores de clusteres en lo que sigue del trabajo. La distribución de los datos sin los outliers mas extremos queda como se ve a continuación.

In [0]:
#Eliminar outlier
data = data[data.coorX > 37]
data = data[data.coorY > -91]
green_diamond = dict(markerfacecolor='g', marker='D')  #para ver los outliers y de forma bonita
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(10,6))
boxp1 = data.boxplot(column='coorX',showfliers=True, flierprops=green_diamond, ax=ax1)
boxp2 = data.boxplot(column='coorY',showfliers=True, flierprops=green_diamond, ax=ax2)
plt.tight_layout() 

En cuanto al resto de valores identificados como outliers en la latitud, estos se conservaran por no ser tan distantes como los que se vieron en el paso previo y por que ya no es claro si se ganará algo o no con eliminar estos casos.

Clusterizar

In [0]:
#install e imports necesarios para esta parte del proyecto

import sklearn
from sklearn.cluster import KMeans, DBSCAN
from sklearn.model_selection import cross_val_score
from sklearn import metrics, preprocessing
from folium import plugins

K -Means

Nuestro primer acercamiento para hacer cluster sobre los crimenes en chicago corresponde a la utilización de la técnica K-means, debido a que es ideal para trabajar con datos geolocalizados ya que minimiza el SSE.

La intuición detrás de la utilización de K-means va de la mano con encontrar diferencias significativas en el comportamiento de los delitos en Chicago según las zonas geografica de la ciudad.

In [0]:
relevant_data = data[data['Primary Type'] != 'OTHER OFFENSE'].dropna()
relevant_crimes = relevant_data.groupby('Primary Type').size().reset_index(name='count').sort_values(by=['count'], ascending=False)[:15]
relevant_crimes = relevant_crimes['Primary Type']
relevant_data = relevant_data[relevant_data['Primary Type'].isin(relevant_crimes)]
relevant_data['Primary Type'] = relevant_data['Primary Type'].cat.remove_unused_categories()

Solo coordenadas

In [0]:
data_sample = relevant_data.sample(frac=0.1).reset_index(drop=True)

Elegir K

El primer acercamiento con K-means incluye únicamente las coordenadas para definir los distintos clusters. El primer paso es definir el K ideal mediante la inspección de la variación de la suma de los errores cuadraticos según este valor. En este caso el K ideal es 5

In [0]:
sse = {}
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data_sample[['coorX',  'coorY']])
    sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

Graficando

A continuación se realiza el algoritmo y se procede a graficar en un mapa los primeros 10.000 puntos de cada clusters. Se puede apreciar que efectivamente la técnica K-means divide el mapa en 5 zonas de igual densidad global.

In [0]:
# probemos con  5
kmeans = KMeans(n_clusters=5, max_iter=1000).fit(data_sample[['coorX',  'coorY']])
data_sample["cluster"] = kmeans.labels_
data_sample['coorX'] = pd.to_numeric(data_sample['coorX'])
data_sample['coorY'] = pd.to_numeric(data_sample['coorY'])

c0 = data_sample[data_sample['cluster'] == 0]
c1 = data_sample[data_sample['cluster'] == 1]
c2 = data_sample[data_sample['cluster'] == 2]
c3 = data_sample[data_sample['cluster'] == 3]
c4 = data_sample[data_sample['cluster'] == 4]
In [0]:
hm0= folium.Map(location=[41.8, -87.6],
                        zoom_start=10.5,
                        tiles="Stamen Terrain")
n_points = min(c0.shape[0], 5000)
location=[]
for i in range(n_points):
  lat = c0['coorY'].iloc[i]
  long = c0['coorX'].iloc[i]
  loc=(long, lat)
  if np.isfinite(lat):
    location.append(loc)
hm0.add_child(plugins.HeatMap(location, radius=15, gradient={.4:'blue', .65:'blue', 1:'blue'}))

n_points = min(c1.shape[0], 5000)
location=[]
for i in range(n_points):
  lat = c1['coorY'].iloc[i]
  long = c1['coorX'].iloc[i]
  loc=(long, lat)
  if np.isfinite(lat):
    location.append(loc)

hm0.add_child(plugins.HeatMap(location, radius=15, gradient={.4:'yellow', .65:'yellow', 1:'yellow'}))

n_points = min(c2.shape[0], 5000)
location=[]
for i in range(n_points):
  lat = c2['coorY'].iloc[i]
  long = c2['coorX'].iloc[i]
  loc=(long, lat)
  if np.isfinite(lat):
    location.append(loc)

hm0.add_child(plugins.HeatMap(location, radius=15, gradient={.4:'red', .65:'red', 1:'red'}))

n_points = min(c3.shape[0], 5000)
location=[]
for i in range(n_points):
  lat = c3['coorY'].iloc[i]
  long = c3['coorX'].iloc[i]
  loc=(long, lat)
  if np.isfinite(lat):
    location.append(loc)

hm0.add_child(plugins.HeatMap(location, radius=15, gradient={.4:'green', .65:'green', 1:'green'}))

n_points = min(c4.shape[0], 5000)
location=[]
for i in range(n_points):
  lat = c4['coorY'].iloc[i]
  long = c4['coorX'].iloc[i]
  loc=(long, lat)
  if np.isfinite(lat):
    location.append(loc)

hm0.add_child(plugins.HeatMap(location, radius=15, gradient={.4:'purple', .65:'purple', 1:'purple'}))
hm0
Out[0]:

Distribución de los tipos de delitos según cluster

Como se puede apreciar a continuación, a exepción del cluster número 4 no existe una diferencia notoria a simple vista entre los distintos clusters, debido a que la distribución de los distintos delitos es (en general) similar.

Por lo tanto en este caso la técnica K-means no nos entrega información novedosa respecto a zonas más propensas al delito que otras además de que el cluster 4 se caracteriza por poseer menos delitos que los demás.

In [0]:
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(100,60))
data_sample[data_sample['cluster'] ==0].groupby(['Primary Type']).size().reset_index(name='count').sort_values(by=['Primary Type'], ascending=False).plot(x='Primary Type',kind='barh', figsize=(15,15),label=1 ,ax=axes[0,0])
data_sample[data_sample['cluster'] ==1].groupby(['Primary Type']).size().reset_index(name='count').sort_values(by=['Primary Type'], ascending=False).plot(x='Primary Type',kind='barh', figsize=(15,15),label=2 ,ax=axes[0,1])
data_sample[data_sample['cluster'] ==2].groupby(['Primary Type']).size().reset_index(name='count').sort_values(by=['Primary Type'], ascending=False).plot(x='Primary Type',kind='barh', figsize=(15,15),label=3 ,ax=axes[1,0])
data_sample[data_sample['cluster'] ==3].groupby(['Primary Type']).size().reset_index(name='count').sort_values(by=['Primary Type'], ascending=False).plot(x='Primary Type',kind='barh', figsize=(15,15),label=4 ,ax=axes[1,1])
data_sample[data_sample['cluster'] ==4].groupby(['Primary Type']).size().reset_index(name='count').sort_values(by=['Primary Type'], ascending=False).plot(x='Primary Type',kind='barh', figsize=(15,15),label=5 ,ax=axes[2,0])
plt.tight_layout() 

Coordenadas y tipo de crimes

A continuación se vuelve a utilizar la técnica K-means con al excepción de que en esta ocación se utiliza además la variables correspondiente al tipo de crimen ocurrido.

In [0]:
# se toma una muestra de los datos y se formatea para realizar clusters

data_sample = relevant_data.sample(frac=0.1)
data_sample['Date'] = pd.to_datetime(data_sample['Date'])
data_sample['Hour'] = [d.hour for d in data_sample['Date']]
data_sample['Month'] = [d.month for d in data_sample['Date']]
In [0]:
data_to_cluster = data_sample.drop(['ID', 'Date', 'Block', 'IUCR', 'Description',
                                'Location Description', 'Arrest', 'Domestic',
                                'District', 'Ward', 'Community Area', 'Location',
                                'Year', 'Month', 'Hour'], axis=1)
type_weight = 1
data_to_cluster = pd.get_dummies(data_to_cluster, columns=["Primary Type"])
column_names = data_to_cluster.columns.values
coor_data = data_to_cluster[['coorX', 'coorY']].values
type_data = data_to_cluster.drop(['coorX', 'coorY'], axis=1).values
min_max_scaler = preprocessing.MinMaxScaler()
coor_data = min_max_scaler.fit_transform(coor_data)
data_to_cluster = np.concatenate((coor_data, type_data * type_weight), axis=1)
data_to_cluster = pd.DataFrame(data_to_cluster, columns=column_names)

Elección de K

En este ocación se puede notar que desde K=15 la disminución de la suma de los errores cuadraticos de existir es inperceptible, y se probó con este número primerro, sin embargo, al utilizar K=15 nos percatamos de que K-means solo separaba cada tipo de crimen en un cluster diferente, por lo que se decidio utilizar k=30 para intentar apreciar otros tipos de agrupación. El mapa y tabla siguiente muestran el resultado para uno de los cluster utilizando este último k (esto para no ocupar tanto espacio en muchos mapas de calor).

In [0]:
sse = {}
for k in range(1, 30):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data_to_cluster)
    sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
In [0]:
k = 30
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data_to_cluster)
data_sample["cluster"] = kmeans.labels_

clusters = []
for i in range(k):
  clusters.append(data_sample[data_sample['cluster'] == i])
In [0]:
hm= folium.Map(location=[41.8, -87.6],
                        zoom_start=10.5,
                        tiles="Stamen Terrain")
index = 4
n_points = min(clusters[index].shape[0], 10000)
location=[]
for i in range(n_points):
  lat = clusters[index]['coorY'].iloc[i]
  long = clusters[index]['coorX'].iloc[i]
  loc=(long, lat)
  if np.isfinite(lat):
    location.append(loc)
hm.add_child(plugins.HeatMap(location, radius=15))
hm
Out[0]:

Como se puede apreciar en la siguiente tabla, K-mean aún entrega como ouputs clusters que se caracterizan por poseer un único crimen, esto debido a que K-means se trata de un metodo de clusterización particional que asocia cada punto a un centroide. Por lo tanto no es la mejor técnica si lo que se busca es descubrir un patrón en la aparición de los crimenes.

In [0]:
clusters[3].groupby('Primary Type').size().reset_index(name='count').sort_values(by=['count'], ascending=False)
Out[0]:
Primary Type count
4 CRIMINAL DAMAGE 33243
0 ASSAULT 0
1 BATTERY 0
2 BURGLARY 0
3 CRIM SEXUAL ASSAULT 0
5 CRIMINAL TRESPASS 0
6 DECEPTIVE PRACTICE 0
7 MOTOR VEHICLE THEFT 0
8 NARCOTICS 0
9 OFFENSE INVOLVING CHILDREN 0
10 PROSTITUTION 0
11 PUBLIC PEACE VIOLATION 0
12 ROBBERY 0
13 THEFT 0
14 WEAPONS VIOLATION 0

DBSCAN

do En lo que sigue se hara uso de la técnica de clusterización DBSCAN que se caracteriza por ser un cluster basado en densidad, esto nos permitira a traves de DBSCAN diferenciar las zonas de alta densidad de crimen de aquellas donde han ocurrido crimenes esporádicos, pues como se aprecía en el EDA, en prácticamente toda la ciudad se han cometido crimenes.

Otro de las ventajas de ocupar un metodo de clusterización basado en densidad como DBSCAN es que nos permite filtrar la data y obtener distintos clusters, como se vera a continuación, al tomar los datos de un tipo específico de crimen a distintas horas del día aparecen distintos clusters. Esto nos da pistas sobre el comportamiento de un delito a lo largo del día.

Tipo y Hora fijos, clusterizar Coordenadas

A continuación se trabajara exclusivamente con los datos de los delitos asociados a narcoticos utilizando la técnica DBSCAN durante distintos periodos del día (día,mañana y noche).

In [0]:
data_sample = relevant_data.sample(frac=0.05)
data_sample['Date'] = pd.to_datetime(data_sample['Date'])
data_sample['Hour'] = [d.hour for d in data_sample['Date']]
data_sample['Month'] = [d.month for d in data_sample['Date']]
In [0]:
narcotics = data_sample[data_sample['Primary Type'] == 'NARCOTICS']
narcotics_day = narcotics[(narcotics['Hour'] > 9) & (narcotics['Hour'] <= 17)]
narcotics_night = narcotics[(narcotics['Hour'] > 17) | (narcotics['Hour'] <= 1)]
narcotics_morning = narcotics[(narcotics['Hour'] > 1) & (narcotics['Hour'] <= 9)]
narcotics_night.shape
Out[0]:
(14798, 18)

Realizando Clustering

Lo que procede será realizar DBSCAN para cada uno de los subconjuntos de los crimenes asociados de narcoticos, luego lo que se procederá será graficar mapas de calor con los puntos pertenecientes a cada uno de los clusters

In [0]:
db = DBSCAN(eps=0.007, min_samples=100).fit(narcotics_night[['coorX', 'coorY']])
k = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
narcotics_night['cluster'] = db.labels_
clusters_night = []
for i in range(k):
  clusters_night.append(narcotics_night[narcotics_night['cluster'] == i])
k
Out[0]:
11
In [0]:
hm= folium.Map(location=[41.8, -87.6],
                        zoom_start=10.5,
                        tiles="Stamen Terrain")
for cluster in clusters_night:
  n_points = min(cluster.shape[0], 10000)
  location=[]
  for i in range(n_points):
    lat = cluster['coorY'].iloc[i]
    long = cluster['coorX'].iloc[i]
    loc=(long, lat)
    if np.isfinite(lat):
      location.append(loc)
  hm.add_child(plugins.HeatMap(location, radius=15))
hm
Out[0]:
In [0]:
db = DBSCAN(eps=0.008, min_samples=100).fit(narcotics_day[['coorX', 'coorY']])
k = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
narcotics_day['cluster'] = db.labels_
clusters_day = []
for i in range(k):
  clusters_day.append(narcotics_day[narcotics_day['cluster'] == i])
k
In [0]:
hm= folium.Map(location=[41.8, -87.6],
                        zoom_start=10.5,
                        tiles="Stamen Terrain")
for cluster in clusters_day:
  n_points = min(cluster.shape[0], 10000)
  location=[]
  for i in range(n_points):
    lat = cluster['coorY'].iloc[i]
    long = cluster['coorX'].iloc[i]
    loc=(long, lat)
    if np.isfinite(lat):
      location.append(loc)
  hm.add_child(plugins.HeatMap(location, radius=15))
hm
Out[0]:
In [0]:
db = DBSCAN(eps=0.008, min_samples=100).fit(narcotics_morning[['coorX', 'coorY']])
k = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
narcotics_morning['cluster'] = db.labels_
clusters_morning = []
for i in range(k):
  clusters_morning.append(narcotics_day[narcotics_day['cluster'] == i])
k
In [0]:
hm= folium.Map(location=[41.8, -87.6],
                        zoom_start=10.5,
                        tiles="Stamen Terrain")
for cluster in clusters_morning:
  n_points = min(cluster.shape[0], 10000)
  location=[]
  for i in range(n_points):
    lat = cluster['coorY'].iloc[i]
    long = cluster['coorX'].iloc[i]
    loc=(long, lat)
    if np.isfinite(lat):
      location.append(loc)
  hm.add_child(plugins.HeatMap(location, radius=15))
hm
Out[0]:

Interpretación DBSCAN

El uso de DBSCAN permitio descubrir distintas zonas de concentración de los delitos asociados a narcoticos en la ciudad de chicago. Además se puede ver que existe una zona en particular en la que se producen este tipo de delitos a toda hora del día.

Lo anterior sugiere que DBSCAN es al técnica más apropiada para trabajar con este tipo de datos debido a que se basa en la densidad para determinar los clusters, por lo que no es sensible al ruido.

Hito 3

Continuación de DBSCAN

A partir nuestras resultados en el primer acercamiento a DBSCAN, se decide seguir buscando cluster con la implementación de éste método.

Nuestro objetivo ahora es ver como evolucionan los cluster de crímenes a través de los años. Si bien ya sabemos del análisis exploratorio inicial que los crímenes disminuyen a lo largo del tiempo, queremos ver el comportamiento de esta disminución, si es uniforme en toda la ciudad o no . Para lo cual se repite el método de clasificación utilizando los datos del 2001, 2006, 2011 y 2015, específicamente para los tipos de crimen Narcotics y Assault, ya que estos crimenes se encuentran dentro de los 6 más recurrentes según el análisis exploratorio previo..

In [0]:
def plot_crime_year_clusters(crime, year, data, eps=0.008, min_samples=200):
  # data = data.copy()
  if crime:
    data = data[data['Primary Type'] == crime]
  if year:
    data = data[data['Year'] == year]
  db = DBSCAN(eps=eps, min_samples=min_samples).fit(data[['coorX', 'coorY']])
  k = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
  data['cluster'] = db.labels_
  clusters = []
  for i in range(k):
    clusters.append(data[data['cluster'] == i])

  hm= folium.Map(location=[41.8, -87.6],
                          zoom_start=10.5,
                          tiles="Stamen Terrain")
  for cluster in clusters:
    n_points = min(cluster.shape[0], 10000)
    location=[]
    for i in range(n_points):
      lat = cluster['coorY'].iloc[i]
      long = cluster['coorX'].iloc[i]
      loc=(long, lat)
      if np.isfinite(lat):
        location.append(loc)
    hm.add_child(plugins.HeatMap(location, radius=15))
  return hm

Assault

Comenzamos con el crimen assault, con el objetivo de ver como disminuye este tipo de crimen.

Assault en el año 2001

In [0]:
plot_crime_year_clusters('ASSAULT', 2001, data, eps=0.008, min_samples=150)
Out[0]:

Para el año 2001 se puede apreciar que los cluster encontrados son grandes tanto en densidad como en el área de la ciudad que cubren, estando presentes en casi toda la ciudad.

Assault en el año 2006

In [0]:
plot_crime_year_clusters('ASSAULT', 2006, data, eps=0.008, min_samples=150)
Out[0]:

Se aprecia una leve disminución de la canitdad de crímenes en los cluster, donde las mayores disminuciones se ven en ciertos puntos de la costa que dejaron de aparecer dentro de la selección.

Assault en el año 2011

In [0]:
plot_crime_year_clusters('ASSAULT', 2011, data, eps=0.008, min_samples=150)
Out[0]:

La disminución de crímenes es mucho mas notoria desde el 2006 al 2011, donde los dos focos interiores de la ciudad con alta densidad se han reducido notoriamente, similar a la parte norte de la costa, donde solo queda un punto con concentración alta de este tipo de crímenes.

Assault en el año 2015

In [0]:
plot_crime_year_clusters('ASSAULT', 2015, data, eps=0.008, min_samples=150)
Out[0]:

Para este último gráfico, se ve que se ha reducido la cantida de crimenes en cada cluster, pero de manera leve en comparacion al cambio 2006 - 2011. Los mismos sectores de mayor densidad vistos para el año 2011 se mantienen todos para el año 2015, no pudiendo lograr despejar la costa por completo de este tipo de crímenes.

Narcóticos

A continuación se muestran los resultados para el análisis del crimen narcóticos para los mismos años.

Narcóticos en el año 2001

In [0]:
plot_crime_year_clusters('NARCOTICS', 2001, data)
Out[0]:

La concentración y área que cubren los cluster para narcóticos en el 2001 es también bastante alta, presentando este crimen una leve mayor concentración en la zona sur de la ciudad, en comparación con el caso anterior para el mismo año.

Narcóticos en el año 2006

In [0]:
plot_crime_year_clusters('NARCOTICS', 2006, data)
Out[0]:

Principalmente se aprecian leves disminuciones del tamaño de los cluster, desapareciendo ya ciertos sectores por el norte de chicago.

Narcóticos en el año 2011

In [0]:
plot_crime_year_clusters('NARCOTICS', 2011, data)
Out[0]:

El mayor trabajo o efectividad en la disminución del crimen parece estar concentrado a lo largo de toda la costa y en el sector más al sur de la ciudad, los cuales parecen ir disminuyendo de "afuera hacia dentro", en vez de ir disminuyendo uniformemente a lo largo de todo el territorio de la ciudad.

Narcóticos en el año 2015

In [0]:
plot_crime_year_clusters('NARCOTICS', 2015, data)
Out[0]:

Practicamente toda la costa a excepción de un par de lugares dejaron de aparecer como cluster, ocurriendo lo mismo para todo el sector sur de Chicago. Este crimen presenta una disminución mucho mayor en cuanto a densidad y área que cubren las zonas de alta concentración, a diferencia del caso anterior. También es este periodo entre el 2011 al 2015 el que presentó la mayor disminución, en ves de ser el periodo 2006 - 2011 como el caso anterior, lo que pueden ser indicios de que los esfuerzos de disminución de crimenes pueden no ser uniformes para todos los tipos, si no que han ido rotando en cuanto al foco de tipos de crímenes que se busca disminuir en la ciudad.

Reglas de asociación

Los análisis realizados con DBSCAN resultaron muy útiles para ver de manera gráfica la forma en que los distintos focos de crímenes específicos (en este caso narcoticos y assault) varían a lo largo de los años.

Sin embargo se cree que aún pueden haber relaciones no encontradas por el método utilizado, por lo cual se decide trabajar los datos de manera diferente, ahora utilizando reglas de asociación, con el objetivo de encontrar nuevas relaciones entre los datos disponibles. Específicamente se quiere encontrar relaciones entre el espacio, tiempo, el tipo de crimen y la descripción del lugar donde ocurre este, con la diferencia de que ahora se utilizarán datos agregados.

Las agrupaciones realizadas a los datos son:

  • Los meses del año se agrupan en estaciones del año (primavera, invierno, verano, otoño)
  • La hora a la que ocurre el crimen se divide en 3 grupos: día (de 9 am a 5 pm), noche (de 5 pm a 1 am) y madrugada (de 1 am a 9 am).

Para poder trabajar con reglas de asociación, el formato de los datos debió ser transformados, proceso en el cual se eliminaron todas las columnas que no se utilizan en las partes posteriores, y cada fila del dataset paso a ser una "transacción". En una transacción todos los posibles valores que podían tomas las columnas elegidas en el dataset original pasaron a ser "items" que puede contener una transacción. Así, una transacción corresponde a un subconjunto del itemset que resulta de la unión de todos estos posibles valores.

In [0]:
from mlxtend.frequent_patterns import apriori, association_rules
data_sample = data
data_sample['Date'] = pd.to_datetime(data_sample['Date'])
data_sample['Hour'] = [d.hour for d in data_sample['Date']]
data_sample['Month'] = [d.month for d in data_sample['Date']]
import calendar
data_sample['Month'] = data_sample['Month'].apply(lambda x: calendar.month_abbr[x])

def day_period(x):
  if x > 9 and x <= 17:
    return 'day'
  elif x > 17 or x <= 1:
    return 'night'
  else:
    return 'morning'

def season(x):
  if x in ['Dec', 'Jan', 'Feb']:
    return 'winter'
  elif x in ['Mar', 'Apr', 'May']:
    return 'spring'
  elif x in ['Jun', 'Jul', 'Aug']:
    return 'summer'
  elif x in ['Sep', 'Oct', 'Nov']:
    return 'fall'

data_sample['Day period'] = data_sample['Hour'].apply(day_period)
data_sample['Season'] = data_sample['Month'].apply(season)
input_data = data_sample.drop(['ID', 'Date', 'Block', 'IUCR', 'Location', 'Description',
                                'coorX', 'coorY', 'Ward', 'Community Area', ], axis=1)
input_data['Year']=pd.to_numeric(input_data['Year'], downcast='integer')
#input_data = input_data[['Year', 'Month','Hour','District','Location Description','Primary Type']]

input_data = input_data[['Year', 'Month', 'Day period', 'District','Location Description','Primary Type','Hour','Season']]

input_data.head()
Out[0]:
Year Month Day period District Location Description Primary Type Hour Season
0 2016 May night 10.0 APARTMENT BATTERY 23 spring
1 2016 May night 3.0 RESIDENCE BATTERY 21 spring
2 2016 May night 15.0 STREET PUBLIC PEACE VIOLATION 23 spring
3 2016 May night 15.0 SIDEWALK BATTERY 22 spring
4 2016 May night 15.0 RESIDENCE THEFT 22 spring
In [0]:
delitos=['Primary Type_ASSAULT', #0
 'Primary Type_BATTERY',  #1
 'Primary Type_BURGLARY', #2
 'Primary Type_CRIM SEXUAL ASSAULT', #3
 'Primary Type_CRIMINAL DAMAGE', #4
 'Primary Type_CRIMINAL TRESPASS', #5
 'Primary Type_DECEPTIVE PRACTICE', #6
 'Primary Type_LIQUOR LAW VIOLATION', #7
 'Primary Type_MOTOR VEHICLE THEFT', #8
 'Primary Type_NARCOTICS', #9
 'Primary Type_OTHER OFFENSE', #10
 'Primary Type_PUBLIC PEACE VIOLATION', #11
 'Primary Type_ROBBERY', #12
 'Primary Type_SEX OFFENSE', #13
 'Primary Type_THEFT', #14
 'Primary Type_WEAPONS VIOLATION'] #15

def reglas_anuales(año, delito):
  sample=input_data
  sample["Hour"] = sample["Hour"].astype('str')
  sample["District"] = sample["District"].astype('str')
  sample=sample[(sample['Year'] == año)]
  basket_set=pd.get_dummies(sample)
  sample=sample[['Day period', 'District', 'Primary Type']]
  basket_set=pd.get_dummies(sample)
  frequent_itemsets = apriori(basket_set, min_support=0.01, use_colnames=True)
  rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
  rules['length'] = rules['antecedents'].apply(lambda x: len(x))
  #rules=rules[ (rules['length'] >= 2)  ]
  rules.head()
  frequent_itemsets
  reglas_utiles=pd.DataFrame()
  crimen=set(sample['Primary Type'].unique())
  for i in crimen:
    a=("Primary Type_"+i)
    aux=rules[ (rules['consequents'] == {a})  ]
    reglas_utiles=reglas_utiles.append(aux)
  
  reglas_utiles=reglas_utiles.sort_values(['lift','antecedents'], ascending=[0,1])
  reglas_utiles=reglas_utiles[(reglas_utiles['consequents']=={delito})]
  años_col=pd.DataFrame({'Año': np.ones(len(reglas_utiles.index), dtype=np.int)*año}).astype('str') 
  print(len(reglas_utiles.index))
  print(años_col)
  reglas_utiles=reglas_utiles.join(años_col)
  reglas_utiles['Año']=año
  print("Reglas para año " + str(año) + ",delito:"+ delito)
  return reglas_utiles

años=[2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016]

#robos
dicc={}
for i in delitos:
  aux=pd.DataFrame()
  for j in años:
    a=reglas_anuales(j,i)
    aux=aux.append(a)
  dicc[i]=aux
  
  #Dic contiene el df con las reglas de asociación para cada delito

Reglas Anuales: {Distrito}-> {Delito}

A continuación se presentan las reglas anuales para cada tipo de delito, se puede apreciar que el support de las teglas es bajo (~0,015), esto es debido a la atomizados de los delitos en la ciudad de chicago. Sin embargo se puede apreciar que se repite para distintos años eel mismo antecedente correspondiente a los distritos.

In [0]:
#Para ver otros delitos cambiar el índice del diccionario delitos.
dicc[delitos[1]].sort_values(['lift','antecedents'], ascending=[0,1])
Out[0]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction length Año
106 (District_7.0) (Primary Type_BATTERY) 0.054928 0.192714 0.015754 0.286818 1.488311 0.005169 1.131950 1 2001
110 (District_7.0) (Primary Type_BATTERY) 0.058907 0.194568 0.016991 0.288441 1.482468 0.005530 1.131926 1 2002
106 (District_5.0) (Primary Type_BATTERY) 0.041930 0.186287 0.011117 0.265138 1.423273 0.003306 1.107300 1 2003
110 (District_7.0) (Primary Type_BATTERY) 0.058582 0.186287 0.015380 0.262538 1.409318 0.004467 1.103396 1 2003
104 (District_7.0) (Primary Type_BATTERY) 0.061021 0.185776 0.015939 0.261208 1.406035 0.004603 1.102102 1 2004
106 (District_5.0) (Primary Type_BATTERY) 0.040217 0.194568 0.010846 0.269687 1.386083 0.003021 1.102859 1 2002
106 (District_7.0) (Primary Type_BATTERY) 0.061350 0.180837 0.015267 0.248848 1.376088 0.004172 1.090542 1 2006
100 (District_7.0) (Primary Type_BATTERY) 0.060270 0.185894 0.015403 0.255559 1.374755 0.004199 1.093580 1 2005
102 (District_5.0) (Primary Type_BATTERY) 0.041137 0.192714 0.010848 0.263706 1.368381 0.002920 1.096418 1 2001
100 (District_5.0) (Primary Type_BATTERY) 0.043258 0.185776 0.010933 0.252733 1.360413 0.002896 1.089601 1 2004
106 (District_7.0) (Primary Type_BATTERY) 0.062891 0.182608 0.015529 0.246925 1.352217 0.004045 1.085407 1 2007
104 (District_7.0) (Primary Type_BATTERY) 0.064442 0.179662 0.015345 0.238122 1.325389 0.003767 1.076731 1 2008
100 (District_7.0) (Primary Type_BATTERY) 0.059402 0.177019 0.013825 0.232740 1.314773 0.003310 1.072623 1 2013
106 (District_7.0) (Primary Type_BATTERY) 0.056873 0.180685 0.013503 0.237421 1.314001 0.003227 1.074399 1 2014
110 (District_7.0) (Primary Type_BATTERY) 0.052031 0.193757 0.013149 0.252715 1.304287 0.003068 1.078896 1 2016
82 (District_10.0) (Primary Type_BATTERY) 0.047152 0.193757 0.011852 0.251350 1.297243 0.002716 1.076929 1 2016
92 (District_15.0) (Primary Type_BATTERY) 0.042428 0.193757 0.010618 0.250258 1.291605 0.002397 1.075360 1 2016
114 (District_7.0) (Primary Type_BATTERY) 0.060520 0.176607 0.013680 0.226034 1.279867 0.002991 1.063862 1 2012
106 (District_7.0) (Primary Type_BATTERY) 0.059663 0.175548 0.013346 0.223698 1.274287 0.002873 1.062026 1 2009
114 (District_7.0) (Primary Type_BATTERY) 0.061101 0.176819 0.013744 0.224946 1.272177 0.002941 1.062094 1 2010
108 (District_7.0) (Primary Type_BATTERY) 0.060576 0.172265 0.013177 0.217522 1.262719 0.002741 1.057838 1 2011
92 (District_10.0) (Primary Type_BATTERY) 0.045200 0.176607 0.010005 0.221348 1.253335 0.002022 1.057459 1 2012
100 (District_3.0) (Primary Type_BATTERY) 0.049531 0.186990 0.011575 0.233680 1.249697 0.002313 1.060929 1 2015
84 (District_10.0) (Primary Type_BATTERY) 0.044639 0.186990 0.010420 0.233422 1.248317 0.002073 1.060571 1 2015
96 (District_5.0) (Primary Type_BATTERY) 0.044574 0.185894 0.010335 0.231862 1.247280 0.002049 1.059843 1 2005
98 (District_3.0) (Primary Type_BATTERY) 0.050051 0.192714 0.012010 0.239955 1.245136 0.002364 1.062156 1 2001
102 (District_3.0) (Primary Type_BATTERY) 0.046042 0.193757 0.011068 0.240384 1.240643 0.002147 1.061382 1 2016
102 (District_3.0) (Primary Type_BATTERY) 0.048394 0.194568 0.011648 0.240686 1.237026 0.002232 1.060736 1 2002
102 (District_3.0) (Primary Type_BATTERY) 0.049555 0.186287 0.011372 0.229481 1.231865 0.002140 1.056058 1 2003
106 (District_7.0) (Primary Type_BATTERY) 0.059262 0.186990 0.013553 0.228696 1.223039 0.002472 1.054072 1 2015
... ... ... ... ... ... ... ... ... ... ... ...
38 (Day period_morning) (Primary Type_BATTERY) 0.194024 0.182608 0.037752 0.194573 1.065522 0.002321 1.014855 1 2007
94 (District_11.0) (Primary Type_BATTERY) 0.065659 0.176607 0.012282 0.187051 1.059132 0.000686 1.012846 1 2012
70 (Day period_night) (Primary Type_BATTERY) 0.392621 0.193757 0.080566 0.205200 1.059056 0.004493 1.014397 1 2016
104 (District_6.0) (Primary Type_BATTERY) 0.052635 0.192714 0.010724 0.203738 1.057203 0.000580 1.013845 1 2001
88 (District_11.0) (Primary Type_BATTERY) 0.063159 0.180837 0.012071 0.191116 1.056841 0.000649 1.012708 1 2006
78 (Day period_night) (Primary Type_BATTERY) 0.401654 0.176607 0.074860 0.186380 1.055333 0.003925 1.012011 1 2012
42 (Day period_morning) (Primary Type_BATTERY) 0.185054 0.186287 0.036226 0.195760 1.050851 0.001753 1.011779 1 2003
68 (Day period_night) (Primary Type_BATTERY) 0.391481 0.177019 0.072809 0.185985 1.050650 0.003510 1.011015 1 2013
34 (Day period_morning) (Primary Type_BATTERY) 0.189518 0.185776 0.036953 0.194983 1.049554 0.001745 1.011436 1 2004
72 (Day period_night) (Primary Type_BATTERY) 0.397993 0.180685 0.075435 0.189538 1.048996 0.003523 1.010923 1 2014
74 (Day period_night) (Primary Type_BATTERY) 0.424296 0.194568 0.086469 0.203794 1.047416 0.003914 1.011587 1 2002
34 (Day period_morning) (Primary Type_BATTERY) 0.188124 0.185894 0.036595 0.194527 1.046441 0.001624 1.010718 1 2005
72 (Day period_night) (Primary Type_BATTERY) 0.398741 0.186990 0.077642 0.194718 1.041328 0.003081 1.009597 1 2015
116 (District_9.0) (Primary Type_BATTERY) 0.050760 0.194568 0.010244 0.201805 1.037198 0.000367 1.009067 1 2002
74 (Day period_night) (Primary Type_BATTERY) 0.401728 0.172265 0.071749 0.178600 1.036777 0.002545 1.007713 1 2011
30 (Day period_morning) (Primary Type_BATTERY) 0.182883 0.192714 0.036510 0.199638 1.035926 0.001266 1.008650 1 2001
98 (District_25.0) (Primary Type_BATTERY) 0.056866 0.180685 0.010641 0.187129 1.035660 0.000366 1.007927 1 2014
34 (Day period_morning) (Primary Type_BATTERY) 0.187824 0.194568 0.037750 0.200985 1.032980 0.001205 1.008031 1 2002
80 (Day period_night) (Primary Type_BATTERY) 0.409488 0.176819 0.074767 0.182587 1.032617 0.002362 1.007056 1 2010
98 (District_25.0) (Primary Type_BATTERY) 0.057087 0.186990 0.011009 0.192839 1.031283 0.000334 1.007247 1 2015
72 (Day period_night) (Primary Type_BATTERY) 0.414328 0.175548 0.074579 0.180000 1.025360 0.001845 1.005429 1 2009
74 (Day period_night) (Primary Type_BATTERY) 0.423854 0.186287 0.080816 0.190670 1.023528 0.001858 1.005416 1 2003
70 (Day period_night) (Primary Type_BATTERY) 0.421067 0.185776 0.080052 0.190118 1.023368 0.001828 1.005360 1 2004
68 (Day period_night) (Primary Type_BATTERY) 0.420906 0.185894 0.079907 0.189846 1.021255 0.001663 1.004877 1 2005
72 (Day period_night) (Primary Type_BATTERY) 0.423246 0.179662 0.077251 0.182521 1.015914 0.001210 1.003497 1 2008
72 (Day period_night) (Primary Type_BATTERY) 0.423987 0.182608 0.078600 0.185383 1.015196 0.001177 1.003406 1 2007
86 (District_11.0) (Primary Type_BATTERY) 0.073300 0.186990 0.013876 0.189308 1.012400 0.000170 1.002860 1 2015
100 (District_25.0) (Primary Type_BATTERY) 0.054586 0.193757 0.010670 0.195465 1.008814 0.000093 1.002123 1 2016
96 (District_25.0) (Primary Type_BATTERY) 0.056453 0.182608 0.010377 0.183810 1.006586 0.000068 1.001473 1 2007
76 (Day period_night) (Primary Type_BATTERY) 0.420606 0.180837 0.076504 0.181890 1.005823 0.000443 1.001287 1 2006

129 rows × 11 columns

En este caso se análiza la concurrencia entre un tipo de delitos, la hora en la que ocurre y el lugar donde se comete, en está ocasión las reglas que se obtienen son bastante intuitivos, redundantes y/o inútiles ya que se obtienen reglas como :

  1. {Seanson_Winter, Day period_night} -> {Location Sidewalk}
  2. {Location Sidewalk,Day period_day} ->{Nartcotics}
  3. {Day period_day,Nartcotics} ->{Location Sidewalk,}

Esto muestra la importancia de filtrar las reglas para evitar aquellas que no entregan información valiosa.

In [0]:
new_basket = pd.get_dummies(input_data[['Season', 'Day period', 'District', 'Location Description', 'Primary Type']])
new_itemsets = apriori(new_basket, min_support=0.01, use_colnames=True)
new_rules = association_rules(new_itemsets, metric="lift", min_threshold=1)
new_rules["antecedent_len"] = new_rules["antecedents"].apply(lambda x: len(x))
new_rules[new_rules['antecedent_len'] >= 2].sort_values(['lift'], ascending=[0])
Out[0]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len
602 (Location Description_SIDEWALK, Day period_day) (Primary Type_NARCOTICS) 0.040259 0.109059 0.015880 0.394458 3.616906 0.011490 1.471310 2
601 (Primary Type_NARCOTICS, Day period_day) (Location Description_SIDEWALK) 0.046908 0.100762 0.015880 0.338541 3.359794 0.011154 1.359475 2
650 (Location Description_SIDEWALK, Day period_night) (Primary Type_NARCOTICS) 0.048317 0.109059 0.016949 0.350787 3.216473 0.011680 1.372339 2
664 (Primary Type_MOTOR VEHICLE THEFT, Day period_night) (Location Description_STREET) 0.022396 0.266018 0.018679 0.834036 3.135259 0.012722 4.422545 2
649 (Primary Type_NARCOTICS, Day period_night) (Location Description_SIDEWALK) 0.053776 0.100762 0.016949 0.315178 3.127934 0.011530 1.313097 2
663 (Location Description_STREET, Day period_night) (Primary Type_MOTOR VEHICLE THEFT) 0.131018 0.047122 0.018679 0.142571 3.025578 0.012506 1.111320 2
638 (Location Description_RESIDENCE, Day period_night) (Primary Type_OTHER OFFENSE) 0.062321 0.061915 0.010961 0.175883 2.840722 0.007103 1.138292 2
594 (Location Description_RESIDENCE, Day period_day) (Primary Type_OTHER OFFENSE) 0.067358 0.061915 0.011777 0.174843 2.823923 0.007607 1.136857 2
639 (Primary Type_OTHER OFFENSE, Day period_night) (Location Description_RESIDENCE) 0.023314 0.168030 0.010961 0.470158 2.798053 0.007044 1.570222 2
611 (Primary Type_MOTOR VEHICLE THEFT, Day period_day) (Location Description_STREET) 0.013782 0.266018 0.010135 0.735363 2.764333 0.006468 2.773543 2
596 (Day period_day, Primary Type_OTHER OFFENSE) (Location Description_RESIDENCE) 0.026022 0.168030 0.011777 0.452577 2.693421 0.007405 1.519792 2
610 (Location Description_STREET, Day period_day) (Primary Type_MOTOR VEHICLE THEFT) 0.084811 0.047122 0.010135 0.119495 2.535871 0.006138 1.082195 2
626 (Location Description_APARTMENT, Day period_night) (Primary Type_BATTERY) 0.039986 0.183289 0.016306 0.407795 2.224879 0.008977 1.379102 2
628 (Primary Type_BATTERY, Day period_night) (Location Description_APARTMENT) 0.078525 0.101638 0.016306 0.207655 2.043080 0.008325 1.133801 2
644 (Primary Type_BATTERY, Day period_night) (Location Description_SIDEWALK) 0.078525 0.100762 0.013512 0.172070 1.707684 0.005599 1.086128 2
615 (Location Description_STREET, Day period_day) (Primary Type_NARCOTICS) 0.084811 0.109059 0.015258 0.179904 1.649597 0.006008 1.086386 2
586 (Location Description_APARTMENT, Day period_day) (Primary Type_BATTERY) 0.036354 0.183289 0.010506 0.288982 1.576654 0.003842 1.148652 2
587 (Day period_day, Primary Type_BATTERY) (Location Description_APARTMENT) 0.067043 0.101638 0.010506 0.156701 1.541752 0.003692 1.065294 2
658 (Primary Type_CRIMINAL DAMAGE, Day period_night) (Location Description_STREET) 0.055561 0.266018 0.022555 0.405947 1.526011 0.007775 1.235549 2
642 (Location Description_SIDEWALK, Day period_night) (Primary Type_BATTERY) 0.048317 0.183289 0.013512 0.279649 1.525730 0.004656 1.133768 2
657 (Location Description_STREET, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.131018 0.115668 0.022555 0.172150 1.488313 0.007400 1.068227 2
670 (Location Description_STREET, Day period_night) (Primary Type_NARCOTICS) 0.131018 0.109059 0.020982 0.160149 1.468454 0.006694 1.060831 2
669 (Primary Type_NARCOTICS, Day period_night) (Location Description_STREET) 0.053776 0.266018 0.020982 0.390182 1.466749 0.006677 1.203608 2
632 (Location Description_RESIDENCE, Day period_night) (Primary Type_BATTERY) 0.062321 0.183289 0.016238 0.260556 1.421562 0.004815 1.104494 2
385 (Season_fall, Location Description_STREET) (Primary Type_CRIMINAL DAMAGE) 0.069081 0.115668 0.011210 0.162268 1.402878 0.003219 1.055626 2
452 (Season_spring, Location Description_STREET) (Primary Type_CRIMINAL DAMAGE) 0.064608 0.115668 0.010424 0.161343 1.394881 0.002951 1.054462 2
384 (Season_fall, Primary Type_CRIMINAL DAMAGE) (Location Description_STREET) 0.030324 0.266018 0.011210 0.369662 1.389613 0.003143 1.164426 2
676 (Primary Type_THEFT, Day period_night) (Location Description_STREET) 0.074132 0.266018 0.027351 0.368947 1.386922 0.007630 1.163106 2
536 (Season_summer, Primary Type_CRIMINAL DAMAGE) (Location Description_STREET) 0.032388 0.266018 0.011796 0.364219 1.369149 0.003181 1.154456 2
453 (Season_spring, Primary Type_CRIMINAL DAMAGE) (Location Description_STREET) 0.028984 0.266018 0.010424 0.359651 1.351979 0.002714 1.146222 2
534 (Location Description_STREET, Season_summer) (Primary Type_CRIMINAL DAMAGE) 0.075542 0.115668 0.011796 0.156154 1.350025 0.003058 1.047979 2
580 (Season_winter, Day period_night) (Primary Type_NARCOTICS) 0.088656 0.109059 0.012947 0.146034 1.339034 0.003278 1.043298 2
656 (Location Description_STREET, Primary Type_CRIMINAL DAMAGE) (Day period_night) 0.041876 0.414109 0.022555 0.538612 1.300652 0.005214 1.269844 2
668 (Primary Type_NARCOTICS, Location Description_STREET) (Day period_night) 0.039064 0.414109 0.020982 0.537130 1.297075 0.004806 1.265780 2
499 (Season_summer, Day period_night) (Location Description_SIDEWALK) 0.119740 0.100762 0.015548 0.129851 1.288687 0.003483 1.033430 2
528 (Season_summer, Location Description_RESIDENCE) (Primary Type_BATTERY) 0.045076 0.183289 0.010577 0.234642 1.280178 0.002315 1.067097 2
448 (Season_spring, Day period_night) (Primary Type_NARCOTICS) 0.100382 0.109059 0.013707 0.136551 1.252078 0.002760 1.031839 2
474 (Season_summer, Day period_day) (Primary Type_THEFT) 0.103882 0.207835 0.026683 0.256855 1.235862 0.005092 1.065964 2
634 (Primary Type_BATTERY, Day period_night) (Location Description_RESIDENCE) 0.078525 0.168030 0.016238 0.206788 1.230661 0.003043 1.048862 2
522 (Primary Type_NARCOTICS, Season_summer) (Day period_night) 0.027675 0.414109 0.014017 0.506480 1.223061 0.002556 1.187168 2
614 (Primary Type_NARCOTICS, Day period_day) (Location Description_STREET) 0.046908 0.266018 0.015258 0.325273 1.222745 0.002780 1.087820 2
365 (Season_fall, Day period_night) (Location Description_STREET) 0.105330 0.266018 0.033987 0.322673 1.212974 0.005967 1.083645 2
504 (Location Description_STREET, Season_summer) (Day period_night) 0.075542 0.414109 0.037922 0.501992 1.212223 0.006639 1.176470 2
373 (Season_fall, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.105330 0.115668 0.014739 0.139931 1.209769 0.002556 1.028211 2
498 (Season_summer, Location Description_SIDEWALK) (Day period_night) 0.031150 0.414109 0.015548 0.499143 1.205343 0.002649 1.169777 2
662 (Location Description_STREET, Primary Type_MOTOR VEHICLE THEFT) (Day period_night) 0.037456 0.414109 0.018679 0.498695 1.204261 0.003168 1.168732 2
441 (Season_spring, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.100382 0.115668 0.013872 0.138193 1.194737 0.002261 1.026137 2
506 (Season_summer, Day period_night) (Location Description_STREET) 0.119740 0.266018 0.037922 0.316699 1.190517 0.006069 1.074171 2
563 (Primary Type_THEFT, Season_winter) (Day period_day) 0.045069 0.393950 0.021120 0.468608 1.189512 0.003365 1.140496 2
643 (Location Description_SIDEWALK, Primary Type_BATTERY) (Day period_night) 0.027436 0.414109 0.013512 0.492476 1.189242 0.002150 1.154410 2
529 (Season_summer, Primary Type_BATTERY) (Location Description_RESIDENCE) 0.052962 0.168030 0.010577 0.199702 1.188488 0.001677 1.039575 2
359 (Season_fall, Day period_night) (Location Description_SIDEWALK) 0.105330 0.100762 0.012611 0.119731 1.188252 0.001998 1.021549 2
364 (Season_fall, Location Description_STREET) (Day period_night) 0.069081 0.414109 0.033987 0.491991 1.188072 0.005380 1.153308 2
431 (Season_spring, Day period_night) (Location Description_STREET) 0.100382 0.266018 0.031722 0.316013 1.187936 0.005019 1.073093 2
446 (Primary Type_NARCOTICS, Season_spring) (Day period_night) 0.027871 0.414109 0.013707 0.491815 1.187646 0.002166 1.152909 2
430 (Season_spring, Location Description_STREET) (Day period_night) 0.064608 0.414109 0.031722 0.490991 1.185658 0.004967 1.151044 2
478 (Season_summer, Location Description_RESIDENCE) (Day period_morning) 0.045076 0.191942 0.010257 0.227552 1.185529 0.001605 1.046101 2
578 (Primary Type_NARCOTICS, Season_winter) (Day period_night) 0.026374 0.414109 0.012947 0.490898 1.185434 0.002025 1.150834 2
648 (Primary Type_NARCOTICS, Location Description_SIDEWALK) (Day period_night) 0.034590 0.414109 0.016949 0.489994 1.183250 0.002625 1.148793 2
349 (Season_fall, Day period_day) (Primary Type_THEFT) 0.101283 0.207835 0.024867 0.245522 1.181329 0.003817 1.049950 2
591 (Day period_day, Primary Type_BATTERY) (Location Description_RESIDENCE) 0.067043 0.168030 0.013297 0.198337 1.180366 0.002032 1.037805 2
420 (Season_spring, Primary Type_THEFT) (Day period_day) 0.048253 0.393950 0.022395 0.464115 1.178107 0.003386 1.130933 2
516 (Season_summer, Primary Type_CRIMINAL DAMAGE) (Day period_night) 0.032388 0.414109 0.015748 0.486227 1.174153 0.002336 1.140370 2
372 (Season_fall, Primary Type_CRIMINAL DAMAGE) (Day period_night) 0.030324 0.414109 0.014739 0.486051 1.173728 0.002182 1.139979 2
532 (Location Description_STREET, Primary Type_BATTERY) (Season_summer) 0.033025 0.277510 0.010752 0.325579 1.173216 0.001587 1.071275 2
378 (Primary Type_NARCOTICS, Season_fall) (Day period_night) 0.027140 0.414109 0.013105 0.482862 1.166026 0.001866 1.132949 2
607 (Primary Type_CRIMINAL DAMAGE, Day period_day) (Location Description_STREET) 0.034699 0.266018 0.010762 0.310140 1.165858 0.001531 1.063957 2
600 (Primary Type_NARCOTICS, Location Description_SIDEWALK) (Day period_day) 0.034590 0.393950 0.015880 0.459101 1.165379 0.002254 1.120449 2
558 (Day period_day, Season_winter) (Primary Type_NARCOTICS) 0.090260 0.109059 0.011469 0.127066 1.165110 0.001625 1.020628 2
570 (Location Description_STREET, Season_winter) (Day period_night) 0.056787 0.414109 0.027387 0.482274 1.164608 0.003871 1.131664 2
571 (Season_winter, Day period_night) (Location Description_STREET) 0.088656 0.266018 0.027387 0.308911 1.161239 0.003803 1.062065 2
500 (Location Description_SIDEWALK, Day period_night) (Season_summer) 0.048317 0.277510 0.015548 0.321801 1.159603 0.002140 1.065307 2
622 (Primary Type_THEFT, Day period_morning) (Location Description_STREET) 0.038639 0.266018 0.011893 0.307790 1.157026 0.001614 1.060346 2
440 (Season_spring, Primary Type_CRIMINAL DAMAGE) (Day period_night) 0.028984 0.414109 0.013872 0.478615 1.155771 0.001870 1.123721 2
348 (Season_fall, Primary Type_THEFT) (Day period_day) 0.054689 0.393950 0.024867 0.454699 1.154206 0.003322 1.111405 2
358 (Season_fall, Location Description_SIDEWALK) (Day period_night) 0.026458 0.414109 0.012611 0.476651 1.151028 0.001655 1.119504 2
512 (Primary Type_BATTERY, Day period_night) (Season_summer) 0.078525 0.277510 0.024951 0.317745 1.144988 0.003159 1.058974 2
426 (Season_spring, Location Description_SIDEWALK) (Day period_night) 0.024260 0.414109 0.011500 0.474021 1.144677 0.001453 1.113906 2
674 (Location Description_STREET, Primary Type_THEFT) (Day period_night) 0.057853 0.414109 0.027351 0.472767 1.141650 0.003394 1.111258 2
379 (Season_fall, Day period_night) (Primary Type_NARCOTICS) 0.105330 0.109059 0.013105 0.124418 1.140830 0.001618 1.017541 2
621 (Location Description_STREET, Day period_morning) (Primary Type_THEFT) 0.050189 0.207835 0.011893 0.236955 1.140113 0.001462 1.038163 2
511 (Season_summer, Primary Type_BATTERY) (Day period_night) 0.052962 0.414109 0.024951 0.471106 1.137638 0.003019 1.107767 2
517 (Season_summer, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.119740 0.115668 0.015748 0.131517 1.137022 0.001898 1.018249 2
427 (Season_spring, Day period_night) (Location Description_SIDEWALK) 0.100382 0.100762 0.011500 0.114560 1.136929 0.001385 1.015582 2
510 (Season_summer, Day period_night) (Primary Type_BATTERY) 0.119740 0.183289 0.024951 0.208374 1.136866 0.003004 1.031689 2
479 (Season_summer, Day period_morning) (Location Description_RESIDENCE) 0.053887 0.168030 0.010257 0.190345 1.132801 0.001202 1.027561 2
472 (Primary Type_THEFT, Season_summer) (Day period_day) 0.059824 0.393950 0.026683 0.446020 1.132174 0.003115 1.093992 2
574 (Primary Type_CRIMINAL DAMAGE, Season_winter) (Day period_night) 0.023972 0.414109 0.011202 0.467284 1.128408 0.001275 1.099818 2
564 (Day period_day, Season_winter) (Primary Type_THEFT) 0.090260 0.207835 0.021120 0.233985 1.125823 0.002360 1.034138 2
416 (Season_spring, Day period_day) (Primary Type_NARCOTICS) 0.098524 0.109059 0.012075 0.122557 1.123761 0.001330 1.015382 2
391 (Season_fall, Location Description_STREET) (Primary Type_THEFT) 0.069081 0.207835 0.016078 0.232742 1.119839 0.001721 1.032462 2
556 (Primary Type_NARCOTICS, Day period_day) (Season_winter) 0.046908 0.221053 0.011469 0.244499 1.106066 0.001100 1.031034 2
344 (Primary Type_NARCOTICS, Season_fall) (Day period_day) 0.027140 0.393950 0.011819 0.435473 1.105402 0.001127 1.073554 2
390 (Season_fall, Primary Type_THEFT) (Location Description_STREET) 0.054689 0.266018 0.016078 0.293987 1.105137 0.001530 1.039615 2
557 (Primary Type_NARCOTICS, Season_winter) (Day period_day) 0.026374 0.393950 0.011469 0.434863 1.103853 0.001079 1.072394 2
627 (Location Description_APARTMENT, Primary Type_BATTERY) (Day period_night) 0.035678 0.414109 0.016306 0.457028 1.103643 0.001531 1.079045 2
488 (Season_summer, Day period_morning) (Primary Type_BATTERY) 0.053887 0.183289 0.010876 0.201834 1.101180 0.000999 1.023235 2
414 (Primary Type_NARCOTICS, Season_spring) (Day period_day) 0.027871 0.393950 0.012075 0.433241 1.099737 0.001095 1.069327 2
411 (Day period_day, Primary Type_BATTERY) (Season_spring) 0.067043 0.245450 0.018080 0.269679 1.098713 0.001624 1.033176 2
606 (Location Description_STREET, Day period_day) (Primary Type_CRIMINAL DAMAGE) 0.084811 0.115668 0.010762 0.126888 1.097000 0.000952 1.012850 2
421 (Season_spring, Day period_day) (Primary Type_THEFT) 0.098524 0.207835 0.022395 0.227303 1.093672 0.001918 1.025195 2
464 (Season_summer, Day period_day) (Location Description_SIDEWALK) 0.103882 0.100762 0.011441 0.110133 1.093001 0.000973 1.010531 2
575 (Season_winter, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.088656 0.115668 0.011202 0.126351 1.092363 0.000947 1.012229 2
579 (Primary Type_NARCOTICS, Day period_night) (Season_winter) 0.053776 0.221053 0.012947 0.240757 1.089137 0.001060 1.025952 2
654 (Location Description_STREET, Primary Type_BATTERY) (Day period_night) 0.033025 0.414109 0.014858 0.449889 1.086404 0.001182 1.065043 2
542 (Location Description_STREET, Season_summer) (Primary Type_THEFT) 0.075542 0.207835 0.017055 0.225770 1.086294 0.001355 1.023165 2
392 (Location Description_STREET, Primary Type_THEFT) (Season_fall) 0.057853 0.255987 0.016078 0.277913 1.085650 0.001268 1.030364 2
590 (Location Description_RESIDENCE, Day period_day) (Primary Type_BATTERY) 0.067358 0.183289 0.013297 0.197411 1.077051 0.000951 1.017596 2
546 (Location Description_RESIDENCE, Day period_day) (Season_winter) 0.067358 0.221053 0.016017 0.237792 1.075725 0.001128 1.021962 2
523 (Season_summer, Day period_night) (Primary Type_NARCOTICS) 0.119740 0.109059 0.014017 0.117059 1.073346 0.000958 1.009060 2
541 (Primary Type_THEFT, Season_summer) (Location Description_STREET) 0.059824 0.266018 0.017055 0.285089 1.071690 0.001141 1.026676 2
620 (Location Description_STREET, Primary Type_THEFT) (Day period_morning) 0.057853 0.191942 0.011893 0.205567 1.070987 0.000788 1.017151 2
345 (Season_fall, Day period_day) (Primary Type_NARCOTICS) 0.101283 0.109059 0.011819 0.116691 1.069972 0.000773 1.008639 2
489 (Season_summer, Primary Type_BATTERY) (Day period_morning) 0.052962 0.191942 0.010876 0.205358 1.069896 0.000711 1.016883 2
526 (Primary Type_THEFT, Day period_night) (Season_summer) 0.074132 0.277510 0.021918 0.295660 1.065402 0.001345 1.025768 2
540 (Location Description_STREET, Primary Type_THEFT) (Season_summer) 0.057853 0.277510 0.017055 0.294805 1.062323 0.001001 1.024525 2
568 (Location Description_RESIDENCE, Day period_night) (Season_winter) 0.062321 0.221053 0.014631 0.234775 1.062075 0.000855 1.017932 2
468 (Primary Type_NARCOTICS, Season_summer) (Day period_day) 0.027675 0.393950 0.011546 0.417190 1.058992 0.000643 1.039876 2
483 (Location Description_STREET, Day period_morning) (Season_summer) 0.050189 0.277510 0.014749 0.293874 1.058968 0.000821 1.023175 2
434 (Season_spring, Day period_night) (Primary Type_BATTERY) 0.100382 0.183289 0.019474 0.193999 1.058437 0.001075 1.013289 2
548 (Day period_day, Season_winter) (Location Description_RESIDENCE) 0.090260 0.168030 0.016017 0.177456 1.056094 0.000851 1.011459 2
618 (Location Description_STREET, Day period_day) (Primary Type_THEFT) 0.084811 0.207835 0.018609 0.219418 1.055732 0.000982 1.014839 2
402 (Season_spring, Location Description_SIDEWALK) (Day period_day) 0.024260 0.393950 0.010076 0.415317 1.054238 0.000518 1.036545 2
415 (Primary Type_NARCOTICS, Day period_day) (Season_spring) 0.046908 0.245450 0.012075 0.257413 1.048741 0.000561 1.016111 2
494 (Primary Type_THEFT, Day period_morning) (Season_summer) 0.038639 0.277510 0.011224 0.290475 1.046720 0.000501 1.018273 2
386 (Location Description_STREET, Primary Type_CRIMINAL DAMAGE) (Season_fall) 0.041876 0.255987 0.011210 0.267687 1.045705 0.000490 1.015977 2
505 (Location Description_STREET, Day period_night) (Season_summer) 0.131018 0.277510 0.037922 0.289440 1.042989 0.001563 1.016789 2
490 (Day period_morning, Primary Type_BATTERY) (Season_summer) 0.037721 0.277510 0.010876 0.288336 1.039011 0.000408 1.015212 2
447 (Primary Type_NARCOTICS, Day period_night) (Season_spring) 0.053776 0.245450 0.013707 0.254897 1.038489 0.000508 1.012679 2
337 (Season_fall, Day period_day) (Location Description_SIDEWALK) 0.101283 0.100762 0.010578 0.104445 1.036545 0.000373 1.004112 2
374 (Primary Type_CRIMINAL DAMAGE, Day period_night) (Season_fall) 0.055561 0.255987 0.014739 0.265276 1.036285 0.000516 1.012642 2
382 (Primary Type_THEFT, Day period_night) (Season_fall) 0.074132 0.255987 0.019650 0.265070 1.035480 0.000673 1.012358 2
595 (Location Description_RESIDENCE, Primary Type_OTHER OFFENSE) (Day period_day) 0.028932 0.393950 0.011777 0.407067 1.033297 0.000380 1.022123 2
484 (Season_summer, Day period_morning) (Location Description_STREET) 0.053887 0.266018 0.014749 0.273707 1.028904 0.000414 1.010587 2
356 (Primary Type_THEFT, Day period_morning) (Season_fall) 0.038639 0.255987 0.010172 0.263261 1.028412 0.000281 1.009872 2
547 (Location Description_RESIDENCE, Season_winter) (Day period_day) 0.039568 0.393950 0.016017 0.404801 1.027545 0.000429 1.018231 2
338 (Location Description_SIDEWALK, Day period_day) (Season_fall) 0.040259 0.255987 0.010578 0.262764 1.026471 0.000273 1.009191 2
465 (Location Description_SIDEWALK, Day period_day) (Season_summer) 0.040259 0.277510 0.011441 0.284186 1.024056 0.000269 1.009326 2
461 (Season_summer, Day period_day) (Location Description_RESIDENCE) 0.103882 0.168030 0.017866 0.171983 1.023526 0.000411 1.004774 2
350 (Primary Type_THEFT, Day period_day) (Season_fall) 0.095064 0.255987 0.024867 0.261583 1.021859 0.000532 1.007578 2
633 (Location Description_RESIDENCE, Primary Type_BATTERY) (Day period_night) 0.038376 0.414109 0.016238 0.423129 1.021781 0.000346 1.015636 2
518 (Primary Type_CRIMINAL DAMAGE, Day period_night) (Season_summer) 0.055561 0.277510 0.015748 0.283436 1.021353 0.000329 1.008270 2
334 (Season_fall, Location Description_RESIDENCE) (Day period_day) 0.041846 0.393950 0.016810 0.401698 1.019668 0.000324 1.012950 2
404 (Location Description_SIDEWALK, Day period_day) (Season_spring) 0.040259 0.245450 0.010076 0.250271 1.019644 0.000194 1.006431 2
360 (Location Description_SIDEWALK, Day period_night) (Season_fall) 0.048317 0.255987 0.012611 0.261011 1.019623 0.000243 1.006798 2
342 (Location Description_STREET, Day period_day) (Season_fall) 0.084811 0.255987 0.022134 0.260976 1.019487 0.000423 1.006750 2
469 (Season_summer, Day period_day) (Primary Type_NARCOTICS) 0.103882 0.109059 0.011546 0.111141 1.019084 0.000216 1.002342 2
370 (Season_fall, Primary Type_BATTERY) (Day period_night) 0.045165 0.414109 0.019060 0.422011 1.019082 0.000357 1.013672 2
396 (Season_spring, Location Description_RESIDENCE) (Day period_day) 0.041540 0.393950 0.016665 0.401179 1.018350 0.000300 1.012072 2
442 (Primary Type_CRIMINAL DAMAGE, Day period_night) (Season_spring) 0.055561 0.245450 0.013872 0.249674 1.017210 0.000235 1.005630 2
482 (Location Description_STREET, Season_summer) (Day period_morning) 0.075542 0.191942 0.014749 0.195245 1.017209 0.000250 1.004105 2
535 (Location Description_STREET, Primary Type_CRIMINAL DAMAGE) (Season_summer) 0.041876 0.277510 0.011796 0.281698 1.015093 0.000175 1.005831 2
403 (Season_spring, Day period_day) (Location Description_SIDEWALK) 0.098524 0.100762 0.010076 0.102265 1.014915 0.000148 1.001674 2
336 (Season_fall, Location Description_SIDEWALK) (Day period_day) 0.026458 0.393950 0.010578 0.399822 1.014907 0.000155 1.009785 2
454 (Location Description_STREET, Primary Type_CRIMINAL DAMAGE) (Season_spring) 0.041876 0.245450 0.010424 0.248929 1.014175 0.000146 1.004632 2
366 (Location Description_STREET, Day period_night) (Season_fall) 0.131018 0.255987 0.033987 0.259408 1.013362 0.000448 1.004619 2
473 (Primary Type_THEFT, Day period_day) (Season_summer) 0.095064 0.277510 0.026683 0.280681 1.011427 0.000301 1.004408 2
554 (Day period_day, Primary Type_BATTERY) (Season_winter) 0.067043 0.221053 0.014985 0.223513 1.011129 0.000165 1.003168 2
436 (Primary Type_BATTERY, Day period_night) (Season_spring) 0.078525 0.245450 0.019474 0.247999 1.010387 0.000200 1.003390 2
354 (Location Description_STREET, Day period_morning) (Season_fall) 0.050189 0.255987 0.012960 0.258221 1.008725 0.000112 1.003011 2
398 (Location Description_RESIDENCE, Day period_day) (Season_spring) 0.067358 0.245450 0.016665 0.247410 1.007987 0.000132 1.002605 2
397 (Season_spring, Day period_day) (Location Description_RESIDENCE) 0.098524 0.168030 0.016665 0.169147 1.006643 0.000110 1.001344 2
408 (Location Description_STREET, Day period_day) (Season_spring) 0.084811 0.245450 0.020944 0.246948 1.006106 0.000127 1.001990 2
460 (Season_summer, Location Description_RESIDENCE) (Day period_day) 0.045076 0.393950 0.017866 0.396355 1.006106 0.000108 1.003985 2
552 (Location Description_STREET, Day period_day) (Season_winter) 0.084811 0.221053 0.018862 0.222402 1.006100 0.000114 1.001734 2
424 (Location Description_RESIDENCE, Day period_night) (Season_spring) 0.062321 0.245450 0.015382 0.246821 1.005585 0.000085 1.001820 2
562 (Primary Type_THEFT, Day period_day) (Season_winter) 0.095064 0.221053 0.021120 0.222160 1.005009 0.000105 1.001424 2
584 (Location Description_STREET, Season_winter) (Primary Type_THEFT) 0.056787 0.207835 0.011857 0.208802 1.004652 0.000055 1.001222 2
675 (Location Description_STREET, Day period_night) (Primary Type_THEFT) 0.131018 0.207835 0.027351 0.208756 1.004433 0.000121 1.001164 2
495 (Season_summer, Day period_morning) (Primary Type_THEFT) 0.053887 0.207835 0.011224 0.208279 1.002137 0.000024 1.000561 2
458 (Season_spring, Primary Type_THEFT) (Location Description_STREET) 0.048253 0.266018 0.012862 0.266557 1.002024 0.000026 1.000734 2
410 (Season_spring, Day period_day) (Primary Type_BATTERY) 0.098524 0.183289 0.018080 0.183510 1.001208 0.000022 1.000271 2
435 (Season_spring, Primary Type_BATTERY) (Day period_night) 0.046985 0.414109 0.019474 0.414476 1.000888 0.000017 1.000628 2
In [0]:
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 500)
new_rules[new_rules['antecedent_len'] >= 2].sort_values(['lift'], ascending=[0])[['antecedents', 'consequents', 'support', 'confidence', 'lift']]
Out[0]:
antecedents consequents support confidence lift
602 (Location Description_SIDEWALK, Day period_day) (Primary Type_NARCOTICS) 0.015880 0.394458 3.616906
601 (Primary Type_NARCOTICS, Day period_day) (Location Description_SIDEWALK) 0.015880 0.338541 3.359794
650 (Location Description_SIDEWALK, Day period_night) (Primary Type_NARCOTICS) 0.016949 0.350787 3.216473
664 (Primary Type_MOTOR VEHICLE THEFT, Day period_night) (Location Description_STREET) 0.018679 0.834036 3.135259
649 (Primary Type_NARCOTICS, Day period_night) (Location Description_SIDEWALK) 0.016949 0.315178 3.127934
663 (Location Description_STREET, Day period_night) (Primary Type_MOTOR VEHICLE THEFT) 0.018679 0.142571 3.025578
638 (Location Description_RESIDENCE, Day period_night) (Primary Type_OTHER OFFENSE) 0.010961 0.175883 2.840722
594 (Location Description_RESIDENCE, Day period_day) (Primary Type_OTHER OFFENSE) 0.011777 0.174843 2.823923
639 (Primary Type_OTHER OFFENSE, Day period_night) (Location Description_RESIDENCE) 0.010961 0.470158 2.798053
611 (Primary Type_MOTOR VEHICLE THEFT, Day period_day) (Location Description_STREET) 0.010135 0.735363 2.764333
596 (Day period_day, Primary Type_OTHER OFFENSE) (Location Description_RESIDENCE) 0.011777 0.452577 2.693421
610 (Location Description_STREET, Day period_day) (Primary Type_MOTOR VEHICLE THEFT) 0.010135 0.119495 2.535871
626 (Location Description_APARTMENT, Day period_night) (Primary Type_BATTERY) 0.016306 0.407795 2.224879
628 (Primary Type_BATTERY, Day period_night) (Location Description_APARTMENT) 0.016306 0.207655 2.043080
644 (Primary Type_BATTERY, Day period_night) (Location Description_SIDEWALK) 0.013512 0.172070 1.707684
615 (Location Description_STREET, Day period_day) (Primary Type_NARCOTICS) 0.015258 0.179904 1.649597
586 (Location Description_APARTMENT, Day period_day) (Primary Type_BATTERY) 0.010506 0.288982 1.576654
587 (Day period_day, Primary Type_BATTERY) (Location Description_APARTMENT) 0.010506 0.156701 1.541752
658 (Primary Type_CRIMINAL DAMAGE, Day period_night) (Location Description_STREET) 0.022555 0.405947 1.526011
642 (Location Description_SIDEWALK, Day period_night) (Primary Type_BATTERY) 0.013512 0.279649 1.525730
657 (Location Description_STREET, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.022555 0.172150 1.488313
670 (Location Description_STREET, Day period_night) (Primary Type_NARCOTICS) 0.020982 0.160149 1.468454
669 (Primary Type_NARCOTICS, Day period_night) (Location Description_STREET) 0.020982 0.390182 1.466749
632 (Location Description_RESIDENCE, Day period_night) (Primary Type_BATTERY) 0.016238 0.260556 1.421562
385 (Season_fall, Location Description_STREET) (Primary Type_CRIMINAL DAMAGE) 0.011210 0.162268 1.402878
452 (Season_spring, Location Description_STREET) (Primary Type_CRIMINAL DAMAGE) 0.010424 0.161343 1.394881
384 (Season_fall, Primary Type_CRIMINAL DAMAGE) (Location Description_STREET) 0.011210 0.369662 1.389613
676 (Primary Type_THEFT, Day period_night) (Location Description_STREET) 0.027351 0.368947 1.386922
536 (Season_summer, Primary Type_CRIMINAL DAMAGE) (Location Description_STREET) 0.011796 0.364219 1.369149
453 (Season_spring, Primary Type_CRIMINAL DAMAGE) (Location Description_STREET) 0.010424 0.359651 1.351979
534 (Location Description_STREET, Season_summer) (Primary Type_CRIMINAL DAMAGE) 0.011796 0.156154 1.350025
580 (Season_winter, Day period_night) (Primary Type_NARCOTICS) 0.012947 0.146034 1.339034
656 (Location Description_STREET, Primary Type_CRIMINAL DAMAGE) (Day period_night) 0.022555 0.538612 1.300652
668 (Primary Type_NARCOTICS, Location Description_STREET) (Day period_night) 0.020982 0.537130 1.297075
499 (Season_summer, Day period_night) (Location Description_SIDEWALK) 0.015548 0.129851 1.288687
528 (Season_summer, Location Description_RESIDENCE) (Primary Type_BATTERY) 0.010577 0.234642 1.280178
448 (Season_spring, Day period_night) (Primary Type_NARCOTICS) 0.013707 0.136551 1.252078
474 (Season_summer, Day period_day) (Primary Type_THEFT) 0.026683 0.256855 1.235862
634 (Primary Type_BATTERY, Day period_night) (Location Description_RESIDENCE) 0.016238 0.206788 1.230661
522 (Primary Type_NARCOTICS, Season_summer) (Day period_night) 0.014017 0.506480 1.223061
614 (Primary Type_NARCOTICS, Day period_day) (Location Description_STREET) 0.015258 0.325273 1.222745
365 (Season_fall, Day period_night) (Location Description_STREET) 0.033987 0.322673 1.212974
504 (Location Description_STREET, Season_summer) (Day period_night) 0.037922 0.501992 1.212223
373 (Season_fall, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.014739 0.139931 1.209769
498 (Season_summer, Location Description_SIDEWALK) (Day period_night) 0.015548 0.499143 1.205343
662 (Location Description_STREET, Primary Type_MOTOR VEHICLE THEFT) (Day period_night) 0.018679 0.498695 1.204261
441 (Season_spring, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.013872 0.138193 1.194737
506 (Season_summer, Day period_night) (Location Description_STREET) 0.037922 0.316699 1.190517
563 (Primary Type_THEFT, Season_winter) (Day period_day) 0.021120 0.468608 1.189512
643 (Location Description_SIDEWALK, Primary Type_BATTERY) (Day period_night) 0.013512 0.492476 1.189242
529 (Season_summer, Primary Type_BATTERY) (Location Description_RESIDENCE) 0.010577 0.199702 1.188488
359 (Season_fall, Day period_night) (Location Description_SIDEWALK) 0.012611 0.119731 1.188252
364 (Season_fall, Location Description_STREET) (Day period_night) 0.033987 0.491991 1.188072
431 (Season_spring, Day period_night) (Location Description_STREET) 0.031722 0.316013 1.187936
446 (Primary Type_NARCOTICS, Season_spring) (Day period_night) 0.013707 0.491815 1.187646
430 (Season_spring, Location Description_STREET) (Day period_night) 0.031722 0.490991 1.185658
478 (Season_summer, Location Description_RESIDENCE) (Day period_morning) 0.010257 0.227552 1.185529
578 (Primary Type_NARCOTICS, Season_winter) (Day period_night) 0.012947 0.490898 1.185434
648 (Primary Type_NARCOTICS, Location Description_SIDEWALK) (Day period_night) 0.016949 0.489994 1.183250
349 (Season_fall, Day period_day) (Primary Type_THEFT) 0.024867 0.245522 1.181329
591 (Day period_day, Primary Type_BATTERY) (Location Description_RESIDENCE) 0.013297 0.198337 1.180366
420 (Season_spring, Primary Type_THEFT) (Day period_day) 0.022395 0.464115 1.178107
516 (Season_summer, Primary Type_CRIMINAL DAMAGE) (Day period_night) 0.015748 0.486227 1.174153
372 (Season_fall, Primary Type_CRIMINAL DAMAGE) (Day period_night) 0.014739 0.486051 1.173728
532 (Location Description_STREET, Primary Type_BATTERY) (Season_summer) 0.010752 0.325579 1.173216
378 (Primary Type_NARCOTICS, Season_fall) (Day period_night) 0.013105 0.482862 1.166026
607 (Primary Type_CRIMINAL DAMAGE, Day period_day) (Location Description_STREET) 0.010762 0.310140 1.165858
600 (Primary Type_NARCOTICS, Location Description_SIDEWALK) (Day period_day) 0.015880 0.459101 1.165379
558 (Day period_day, Season_winter) (Primary Type_NARCOTICS) 0.011469 0.127066 1.165110
570 (Location Description_STREET, Season_winter) (Day period_night) 0.027387 0.482274 1.164608
571 (Season_winter, Day period_night) (Location Description_STREET) 0.027387 0.308911 1.161239
500 (Location Description_SIDEWALK, Day period_night) (Season_summer) 0.015548 0.321801 1.159603
622 (Primary Type_THEFT, Day period_morning) (Location Description_STREET) 0.011893 0.307790 1.157026
440 (Season_spring, Primary Type_CRIMINAL DAMAGE) (Day period_night) 0.013872 0.478615 1.155771
348 (Season_fall, Primary Type_THEFT) (Day period_day) 0.024867 0.454699 1.154206
358 (Season_fall, Location Description_SIDEWALK) (Day period_night) 0.012611 0.476651 1.151028
512 (Primary Type_BATTERY, Day period_night) (Season_summer) 0.024951 0.317745 1.144988
426 (Season_spring, Location Description_SIDEWALK) (Day period_night) 0.011500 0.474021 1.144677
674 (Location Description_STREET, Primary Type_THEFT) (Day period_night) 0.027351 0.472767 1.141650
379 (Season_fall, Day period_night) (Primary Type_NARCOTICS) 0.013105 0.124418 1.140830
621 (Location Description_STREET, Day period_morning) (Primary Type_THEFT) 0.011893 0.236955 1.140113
511 (Season_summer, Primary Type_BATTERY) (Day period_night) 0.024951 0.471106 1.137638
517 (Season_summer, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.015748 0.131517 1.137022
427 (Season_spring, Day period_night) (Location Description_SIDEWALK) 0.011500 0.114560 1.136929
510 (Season_summer, Day period_night) (Primary Type_BATTERY) 0.024951 0.208374 1.136866
479 (Season_summer, Day period_morning) (Location Description_RESIDENCE) 0.010257 0.190345 1.132801
472 (Primary Type_THEFT, Season_summer) (Day period_day) 0.026683 0.446020 1.132174
574 (Primary Type_CRIMINAL DAMAGE, Season_winter) (Day period_night) 0.011202 0.467284 1.128408
564 (Day period_day, Season_winter) (Primary Type_THEFT) 0.021120 0.233985 1.125823
416 (Season_spring, Day period_day) (Primary Type_NARCOTICS) 0.012075 0.122557 1.123761
391 (Season_fall, Location Description_STREET) (Primary Type_THEFT) 0.016078 0.232742 1.119839
556 (Primary Type_NARCOTICS, Day period_day) (Season_winter) 0.011469 0.244499 1.106066
344 (Primary Type_NARCOTICS, Season_fall) (Day period_day) 0.011819 0.435473 1.105402
390 (Season_fall, Primary Type_THEFT) (Location Description_STREET) 0.016078 0.293987 1.105137
557 (Primary Type_NARCOTICS, Season_winter) (Day period_day) 0.011469 0.434863 1.103853
627 (Location Description_APARTMENT, Primary Type_BATTERY) (Day period_night) 0.016306 0.457028 1.103643
488 (Season_summer, Day period_morning) (Primary Type_BATTERY) 0.010876 0.201834 1.101180
414 (Primary Type_NARCOTICS, Season_spring) (Day period_day) 0.012075 0.433241 1.099737
411 (Day period_day, Primary Type_BATTERY) (Season_spring) 0.018080 0.269679 1.098713
606 (Location Description_STREET, Day period_day) (Primary Type_CRIMINAL DAMAGE) 0.010762 0.126888 1.097000
421 (Season_spring, Day period_day) (Primary Type_THEFT) 0.022395 0.227303 1.093672
464 (Season_summer, Day period_day) (Location Description_SIDEWALK) 0.011441 0.110133 1.093001
575 (Season_winter, Day period_night) (Primary Type_CRIMINAL DAMAGE) 0.011202 0.126351 1.092363
579 (Primary Type_NARCOTICS, Day period_night) (Season_winter) 0.012947 0.240757 1.089137
654 (Location Description_STREET, Primary Type_BATTERY) (Day period_night) 0.014858 0.449889 1.086404
542 (Location Description_STREET, Season_summer) (Primary Type_THEFT) 0.017055 0.225770 1.086294
392 (Location Description_STREET, Primary Type_THEFT) (Season_fall) 0.016078 0.277913 1.085650
590 (Location Description_RESIDENCE, Day period_day) (Primary Type_BATTERY) 0.013297 0.197411 1.077051
546 (Location Description_RESIDENCE, Day period_day) (Season_winter) 0.016017 0.237792 1.075725
523 (Season_summer, Day period_night) (Primary Type_NARCOTICS) 0.014017 0.117059 1.073346
541 (Primary Type_THEFT, Season_summer) (Location Description_STREET) 0.017055 0.285089 1.071690
620 (Location Description_STREET, Primary Type_THEFT) (Day period_morning) 0.011893 0.205567 1.070987
345 (Season_fall, Day period_day) (Primary Type_NARCOTICS) 0.011819 0.116691 1.069972
489 (Season_summer, Primary Type_BATTERY) (Day period_morning) 0.010876 0.205358 1.069896
526 (Primary Type_THEFT, Day period_night) (Season_summer) 0.021918 0.295660 1.065402
540 (Location Description_STREET, Primary Type_THEFT) (Season_summer) 0.017055 0.294805 1.062323
568 (Location Description_RESIDENCE, Day period_night) (Season_winter) 0.014631 0.234775 1.062075
468 (Primary Type_NARCOTICS, Season_summer) (Day period_day) 0.011546 0.417190 1.058992
483 (Location Description_STREET, Day period_morning) (Season_summer) 0.014749 0.293874 1.058968
434 (Season_spring, Day period_night) (Primary Type_BATTERY) 0.019474 0.193999 1.058437
548 (Day period_day, Season_winter) (Location Description_RESIDENCE) 0.016017 0.177456 1.056094
618 (Location Description_STREET, Day period_day) (Primary Type_THEFT) 0.018609 0.219418 1.055732
402 (Season_spring, Location Description_SIDEWALK) (Day period_day) 0.010076 0.415317 1.054238
415 (Primary Type_NARCOTICS, Day period_day) (Season_spring) 0.012075 0.257413 1.048741
494 (Primary Type_THEFT, Day period_morning) (Season_summer) 0.011224 0.290475 1.046720
386 (Location Description_STREET, Primary Type_CRIMINAL DAMAGE) (Season_fall) 0.011210 0.267687 1.045705
505 (Location Description_STREET, Day period_night) (Season_summer) 0.037922 0.289440 1.042989
490 (Day period_morning, Primary Type_BATTERY) (Season_summer) 0.010876 0.288336 1.039011
447 (Primary Type_NARCOTICS, Day period_night) (Season_spring) 0.013707 0.254897 1.038489
337 (Season_fall, Day period_day) (Location Description_SIDEWALK) 0.010578 0.104445 1.036545
374 (Primary Type_CRIMINAL DAMAGE, Day period_night) (Season_fall) 0.014739 0.265276 1.036285
382 (Primary Type_THEFT, Day period_night) (Season_fall) 0.019650 0.265070 1.035480
595 (Location Description_RESIDENCE, Primary Type_OTHER OFFENSE) (Day period_day) 0.011777 0.407067 1.033297
484 (Season_summer, Day period_morning) (Location Description_STREET) 0.014749 0.273707 1.028904
356 (Primary Type_THEFT, Day period_morning) (Season_fall) 0.010172 0.263261 1.028412
547 (Location Description_RESIDENCE, Season_winter) (Day period_day) 0.016017 0.404801 1.027545
338 (Location Description_SIDEWALK, Day period_day) (Season_fall) 0.010578 0.262764 1.026471
465 (Location Description_SIDEWALK, Day period_day) (Season_summer) 0.011441 0.284186 1.024056
461 (Season_summer, Day period_day) (Location Description_RESIDENCE) 0.017866 0.171983 1.023526
350 (Primary Type_THEFT, Day period_day) (Season_fall) 0.024867 0.261583 1.021859
633 (Location Description_RESIDENCE, Primary Type_BATTERY) (Day period_night) 0.016238 0.423129 1.021781
518 (Primary Type_CRIMINAL DAMAGE, Day period_night) (Season_summer) 0.015748 0.283436 1.021353
334 (Season_fall, Location Description_RESIDENCE) (Day period_day) 0.016810 0.401698 1.019668
404 (Location Description_SIDEWALK, Day period_day) (Season_spring) 0.010076 0.250271 1.019644
360 (Location Description_SIDEWALK, Day period_night) (Season_fall) 0.012611 0.261011 1.019623
342 (Location Description_STREET, Day period_day) (Season_fall) 0.022134 0.260976 1.019487
469 (Season_summer, Day period_day) (Primary Type_NARCOTICS) 0.011546 0.111141 1.019084
370 (Season_fall, Primary Type_BATTERY) (Day period_night) 0.019060 0.422011 1.019082
396 (Season_spring, Location Description_RESIDENCE) (Day period_day) 0.016665 0.401179 1.018350
442 (Primary Type_CRIMINAL DAMAGE, Day period_night) (Season_spring) 0.013872 0.249674 1.017210
482 (Location Description_STREET, Season_summer) (Day period_morning) 0.014749 0.195245 1.017209
535 (Location Description_STREET, Primary Type_CRIMINAL DAMAGE) (Season_summer) 0.011796 0.281698 1.015093
403 (Season_spring, Day period_day) (Location Description_SIDEWALK) 0.010076 0.102265 1.014915
336 (Season_fall, Location Description_SIDEWALK) (Day period_day) 0.010578 0.399822 1.014907
454 (Location Description_STREET, Primary Type_CRIMINAL DAMAGE) (Season_spring) 0.010424 0.248929 1.014175
366 (Location Description_STREET, Day period_night) (Season_fall) 0.033987 0.259408 1.013362
473 (Primary Type_THEFT, Day period_day) (Season_summer) 0.026683 0.280681 1.011427
554 (Day period_day, Primary Type_BATTERY) (Season_winter) 0.014985 0.223513 1.011129
436 (Primary Type_BATTERY, Day period_night) (Season_spring) 0.019474 0.247999 1.010387
354 (Location Description_STREET, Day period_morning) (Season_fall) 0.012960 0.258221 1.008725
398 (Location Description_RESIDENCE, Day period_day) (Season_spring) 0.016665 0.247410 1.007987
397 (Season_spring, Day period_day) (Location Description_RESIDENCE) 0.016665 0.169147 1.006643
408 (Location Description_STREET, Day period_day) (Season_spring) 0.020944 0.246948 1.006106
460 (Season_summer, Location Description_RESIDENCE) (Day period_day) 0.017866 0.396355 1.006106
552 (Location Description_STREET, Day period_day) (Season_winter) 0.018862 0.222402 1.006100
424 (Location Description_RESIDENCE, Day period_night) (Season_spring) 0.015382 0.246821 1.005585
562 (Primary Type_THEFT, Day period_day) (Season_winter) 0.021120 0.222160 1.005009
584 (Location Description_STREET, Season_winter) (Primary Type_THEFT) 0.011857 0.208802 1.004652
675 (Location Description_STREET, Day period_night) (Primary Type_THEFT) 0.027351 0.208756 1.004433
495 (Season_summer, Day period_morning) (Primary Type_THEFT) 0.011224 0.208279 1.002137
458 (Season_spring, Primary Type_THEFT) (Location Description_STREET) 0.012862 0.266557 1.002024
410 (Season_spring, Day period_day) (Primary Type_BATTERY) 0.018080 0.183510 1.001208
435 (Season_spring, Primary Type_BATTERY) (Day period_night) 0.019474 0.414476 1.000888
In [0]:
new_rules[new_rules['antecedent_len'] >= 2].sort_values(['lift'], ascending=[0])[['antecedents', 'consequents', 'support', 'confidence', 'lift']].loc[[602, 664, 626, 658, 504, 420]]
Out[0]:
antecedents consequents support confidence lift
602 (Location Description_SIDEWALK, Day period_day) (Primary Type_NARCOTICS) 0.015880 0.394458 3.616906
664 (Primary Type_MOTOR VEHICLE THEFT, Day period_night) (Location Description_STREET) 0.018679 0.834036 3.135259
626 (Location Description_APARTMENT, Day period_night) (Primary Type_BATTERY) 0.016306 0.407795 2.224879
658 (Primary Type_CRIMINAL DAMAGE, Day period_night) (Location Description_STREET) 0.022555 0.405947 1.526011
504 (Location Description_STREET, Season_summer) (Day period_night) 0.037922 0.501992 1.212223
420 (Season_spring, Primary Type_THEFT) (Day period_day) 0.022395 0.464115 1.178107
In [0]:
# ejemplos de casos con un antecedente
new_rules.sort_values(['lift'], ascending=[0])[0:20]
Out[0]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len
317 (Primary Type_BURGLARY) (Location Description_RESIDENCE-GARAGE) 0.058909 0.019942 0.010350 0.175689 8.809850 0.009175 1.188941 1
316 (Location Description_RESIDENCE-GARAGE) (Primary Type_BURGLARY) 0.019942 0.058909 0.010350 0.518978 8.809850 0.009175 1.956441 1
300 (Primary Type_THEFT) (Location Description_DEPARTMENT STORE) 0.207835 0.012128 0.010155 0.048859 4.028656 0.007634 1.038618 1
301 (Location Description_DEPARTMENT STORE) (Primary Type_THEFT) 0.012128 0.207835 0.010155 0.837296 4.028656 0.007634 4.868747 1
602 (Location Description_SIDEWALK, Day period_day) (Primary Type_NARCOTICS) 0.040259 0.109059 0.015880 0.394458 3.616906 0.011490 1.471310 2
603 (Primary Type_NARCOTICS) (Location Description_SIDEWALK, Day period_day) 0.109059 0.040259 0.015880 0.145611 3.616906 0.011490 1.123308 1
604 (Location Description_SIDEWALK) (Primary Type_NARCOTICS, Day period_day) 0.100762 0.046908 0.015880 0.157602 3.359794 0.011154 1.131403 1
601 (Primary Type_NARCOTICS, Day period_day) (Location Description_SIDEWALK) 0.046908 0.100762 0.015880 0.338541 3.359794 0.011154 1.359475 2
650 (Location Description_SIDEWALK, Day period_night) (Primary Type_NARCOTICS) 0.048317 0.109059 0.016949 0.350787 3.216473 0.011680 1.372339 2
651 (Primary Type_NARCOTICS) (Location Description_SIDEWALK, Day period_night) 0.109059 0.048317 0.016949 0.155410 3.216473 0.011680 1.126799 1
321 (Location Description_SIDEWALK) (Primary Type_NARCOTICS) 0.100762 0.109059 0.034590 0.343284 3.147674 0.023601 1.356660 1
320 (Primary Type_NARCOTICS) (Location Description_SIDEWALK) 0.109059 0.100762 0.034590 0.317167 3.147674 0.023601 1.316922 1
664 (Primary Type_MOTOR VEHICLE THEFT, Day period_night) (Location Description_STREET) 0.022396 0.266018 0.018679 0.834036 3.135259 0.012722 4.422545 2
665 (Location Description_STREET) (Primary Type_MOTOR VEHICLE THEFT, Day period_night) 0.266018 0.022396 0.018679 0.070218 3.135259 0.012722 1.051433 1
649 (Primary Type_NARCOTICS, Day period_night) (Location Description_SIDEWALK) 0.053776 0.100762 0.016949 0.315178 3.127934 0.011530 1.313097 2
652 (Location Description_SIDEWALK) (Primary Type_NARCOTICS, Day period_night) 0.100762 0.053776 0.016949 0.168207 3.127934 0.011530 1.137572 1
322 (Location Description_SIDEWALK) (Primary Type_ROBBERY) 0.100762 0.037931 0.011749 0.116603 3.074075 0.007927 1.089056 1
323 (Primary Type_ROBBERY) (Location Description_SIDEWALK) 0.037931 0.100762 0.011749 0.309751 3.074075 0.007927 1.302773 1
663 (Location Description_STREET, Day period_night) (Primary Type_MOTOR VEHICLE THEFT) 0.131018 0.047122 0.018679 0.142571 3.025578 0.012506 1.111320 2
666 (Primary Type_MOTOR VEHICLE THEFT) (Location Description_STREET, Day period_night) 0.047122 0.131018 0.018679 0.396405 3.025578 0.012506 1.439676 1

Conclusiones Finales

Luego de haber realizado el proyecto de minería de datos se pueden extraer las siguientes conclusiones:

  1. DBSCAN es una herramienta más apropiada para detectar focos en datos geolocalizados de Kmeans, debido a la naturaleza del método.
  2. El análisis de DBSCAN sugiere que la disminución de los crímenes en la ciudad de Chicago se debe principalmente a la reducción de focos criminales en ciertas zonas de la ciudad, sin embargo en otras zonas de la ciudad los focos se mantienen.
  3. El uso de reglas de asociación es factible para este tipo de datos al considerar los datos de un crimen como los datos de una transacción, sin embargo, su uso conlleva la aparición de reglas inservibles, por lo que es necesario filtrar las reglas según interés y eliminar las reglas redundantes.
  4. A pesar de que no se obtuvieron reglas que generen un conocimiento novedoso respecto al comportamiento de los crímenes, el uso de reglas de asociación sirvió para verificar los resultados obtenidos mediante el uso de DBSCAN, es decir, las reglas de asociación mostraron que ciertos distritos prevalecen a lo largo de los años como antecedentes de ciertos tipos de delito (consecuente), mientras otros distritos aparecen como consecuente en años pasados, pero no se repiten en años más recientes.

A partir de estas conclusiones se pueden extraer las siguientes sugerencias:

  1. Investigar sobre políticas publicas recientes en la ciudad de Chicago con el fin de cruzar está información con los resultados obtenidos.
  2. Sería de utilidad para las policías de Chicago entender el comportamiento, ocurrencia y prevalencia de los delitos, todo esto con el fin de optimizar sus esfuerzos respecto a sus objetivos, por ejemplo, si desean detener el robo de vehículo deberían enfocar sus esfuerzos durante la noche en las calles, que es donde se cometen estos delitos principalmente.
  3. Para futuras experiencias, analizar las métricas de support, confidence y lift para delitos específicos dentro de un distrito específico, para a partir de estás métricas comprender la relevancia del delito dentro de estos distritos, también se sugiere para estos casos utilizar el año como variable antecedente, de esta forma a través de la variación de las métricas se podrá comprender la evolución de estos delitos.