Clasificadores

Importar y Filtrar

Se agregan todas las columnas por cuenta del conjunto de datos usando funciones definida de clean_import.

In [54]:
from importlib import reload
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

import clean_import
reload(clean_import)

import clasificadores
reload(clasificadores)
from clasificadores import ClassifierGenerator

import visualization
reload(visualization)
from visualization import *

import warnings
warnings.filterwarnings("ignore")
In [2]:
data_dir = "../datos/"
data = clean_import.import_dataset(data_dir)
data = clean_import.filter_lang(data, 'English')
In [3]:
col = data.groupby('author').agg({'content' : 'count'})['content']
print('Hay {} cuentas en total'.format(col.size))


@interact
def graph_hist_count(percentil=(1,100,5), nmax=(0,10000,100)):
    nmax = 20
    percentil_val = np.percentile(col,percentil)
    print('El percentil {} corresponde a aquellos con menos de {} tweets'.format(percentil, percentil_val))
    print('Hay {1} cuentas con menos de {0} tweets, de un total de {2}'.format(percentil_val,
                                                                               col[col > percentil_val].size,
                                                                              col.size))
    col_cut = col[col < nmax]
    plt.figure(figsize=(10, 5))
    plt.hist(col_cut, bins=50)
    plt.title('Histograma para cuentas con mas de {} tweets'.format(nmax))
    plt.show()
Hay 2161 cuentas en total
In [4]:
reload(clean_import)
account_data = clean_import.get_accounts(data, 1, 2)

del data
El tamaño de data es de 224550 MB
In [5]:
col = account_data['followers']['linear_slope']
percentil_val = np.percentile(col,90)
print('Hay {1} cuentas con mas de {0} popularidad, de un total de {2}'.format(percentil_val,
                                                                            col[col > percentil_val].size,
                                                                              col.size))
account_data.head()
Hay 216 cuentas con mas de 0.00013259122337470848 popularidad, de un total de 2161
Out[5]:
followers following updates region language account_type account_category publish_date content
min max delta_int linear_slope min max delta_int linear_slope min max delta_int linear_slope first first first first delta_date election_distance_score aggregate_content count
author
10_GOP 0 10465 478.248175 2.976215e-03 0 1074 49.081561 1.006333e-04 1 352 16.040622 0.000106 Unknown English Right RightTroll 21.881944 0.055058 "We have a sitting Democrat US Senator on tria... 372
1D_NICOLE_ 40 53 0.159936 -4.578379e-07 48 59 0.135330 -4.552007e-07 352 395 0.529018 0.000002 United States English Koch Fearmonger 81.282639 0.013539 #FoodPoisoning is not a joke! #Walmart #KochFa... 41
1ERIK_LEE 74 74 0.000000 4.629564e-09 239 239 0.000000 2.923761e-09 330 336 8640.000000 0.100000 United States English Right RightTroll 0.000694 0.003712 Why is someone even against the #petition? I'l... 2
1LORENAFAVA1 1 103 1.906146 1.165116e-05 128 416 5.382060 1.800171e-05 24 3693 68.565199 0.000740 Italy English Italian NonEnglish 53.511111 0.044342 Come vedere Juventus-Milan in streaming o in t... 62
2NDHALFONION 1 1 0.000000 -7.867717e-12 22 22 0.000000 1.358883e-10 9 11 720.000000 0.007693 United States English Right RightTroll 0.002778 0.016187 '@HalfOnionInABag Follow the other half an oni... 3
  • Las columnas linear_slope corresponden a la pendiente del ajuste lineal de la variable vs la fecha. Es una medida de su crecimiento, y nos permitira medir que tanto la cuenta ha crecido en terminos de seguidores gracias su historia de tweeteo y de seguidores.
  • La columna contenido corresponde a todos los tweets de la cuenta en un solo documento.
  • La columna election_distance_score corresponde a la raiz de la suma de los cuarados de los inversos de la distancia en dias de cada uno de los tweets de la cuenta. Mientras mayor es este valor, mayor es su actividad cerca del dia de la eleccion, 8 de Noviembre de 2016

Set up

Se analizan los n-gramas entre ngrammin y ngrammax, y las clases se dividen en los percentiles de percentiles.

In [6]:
# Configs
percentiles = np.array([50, 90]) # Percentiles para las clases
account_data_filtered = account_data.dropna(subset=[('followers', 'linear_slope')])
class_col = account_data_filtered['followers']['linear_slope'].values
ngrammin = 1
ngrammax = 3
training_frac = .2
In [7]:
# Clasificacion segun pais
countries = account_data_filtered['region']['first'].unique()
country_ids = np.arange(countries.size)
country_dict = dict(zip(country_ids, countries))
country_dict_inv = dict(zip(countries, country_ids))

country_id_col = account_data_filtered.loc[:, ('region','first')].apply(
                                            lambda country : country_dict_inv[country])

account_categories = account_data_filtered['account_category']['first'].unique()
account_categories_ids = np.arange(account_categories.size)
account_categories_dict = dict(zip(account_categories_ids, account_categories))
account_categories_dict_inv = dict(zip(account_categories, account_categories_ids))

account_category_id_col = account_data_filtered.loc[:, ('region','first')].apply(
                                            lambda country : country_dict_inv[country])

election_distance_col = account_data_filtered.loc[:, ('publish_date','election_distance_score')]
In [8]:
"""
Configura los distintos classifiergen que se usaran, para clasificacion numerica y de texto,
y para regresión numérica y de texto. Asi hacemos el preprocesamiento solo una vez
"""

# Set up clasificacion de texto
classifiergen_text  = ClassifierGenerator()
classifiergen_text.set_classes(class_col, percentiles)

## Agregar atributos
classifiergen_text.add_attrib(account_data_filtered['content']['aggregate_content'], name='content')

## Preprocesar
classifiergen_text.preprocess_data(frac=training_frac, ngrammin=ngrammin, ngrammax=ngrammax)


# Set up clasificacion numerica
classifiergen_numeric = ClassifierGenerator()
classifiergen_numeric.set_classes(class_col, percentiles)

## Agregar atributos
classifiergen_numeric.add_attrib(account_data_filtered['updates']['max'], name='Maximum updates')
classifiergen_numeric.add_attrib(account_data_filtered['following']['max'], name='Maximum following')
classifiergen_numeric.add_attrib(country_id_col, name='Country ID')
classifiergen_numeric.add_attrib(election_distance_col, name='Election Activity Distance')
#classifiergen_numeric.add_attrib(account_category_id_col, name='Category_id')


# Preprocesar
classifiergen_numeric.preprocess_data(frac=training_frac)
In [9]:
# Set up regresion de texto
regressiongen_text = ClassifierGenerator()
regressiongen_text.set_values(class_col)

## Agregar atributos
regressiongen_text.add_attrib(account_data_filtered['content']['aggregate_content'], name='content')

## Preprocesar
regressiongen_text.preprocess_data(frac=training_frac, ngrammin=ngrammin, ngrammax=ngrammax)



# Set up regresion numerica
regressiongen_numeric = ClassifierGenerator()
regressiongen_numeric.set_values(class_col)

## Agregar atributos
regressiongen_numeric.add_attrib(account_data_filtered['updates']['max'], name='Maximum updates')
regressiongen_numeric.add_attrib(account_data_filtered['following']['max'], name='Maximum following')
regressiongen_numeric.add_attrib(country_id_col, name='Country ID')
regressiongen_numeric.add_attrib(election_distance_col, name='Election Activity Distance')
#regressiongen_numeric.add_attrib(account_category_id_col, name='Category_id')

## Preprocesar
regressiongen_numeric.preprocess_data(frac=training_frac)

Clasificacion

Clasificación con Decision tree

Numerico decision tree

El clasificador tiene mejor performance con los parametros por defecto, pero es mucho mas compacto con los parametros encontrados por grid search. El F1-Score de la clase 2 (Popularidad muy) es de 0.5, que significa que hay mas falsos positivos y negativos que en las otras clases.

In [51]:
from sklearn.tree import DecisionTreeClassifier

classifiergen_numeric.set_classifier(DecisionTreeClassifier())
classifiergen_numeric.train()
In [55]:
classifiergen_numeric.classifier_report()
graph_tree(classifiergen_numeric, 'decision_num_tree')
graph_2D(classifiergen_numeric, plot_decision_regions, 'decision_num_surf')
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.7297921478060047
              precision    recall  f1-score   support

           0       0.81      0.78      0.80       216
           1       0.67      0.73      0.70       173
           2       0.57      0.45      0.51        44

   micro avg       0.73      0.73      0.73       433
   macro avg       0.68      0.66      0.67       433
weighted avg       0.73      0.73      0.73       433

Tree 0 Election Activity Distance <= 0.001168 gini = 0.580069 samples = 100.0% value = [0.5, 0.399884, 0.100116] class = 0 1 Election Activity Distance <= 0.000101 gini = 0.428374 samples = 65.8% value = [0.709763, 0.258575, 0.031662] class = 0 0->1 True 468 Maximum following <= 0.037928 gini = 0.485724 samples = 34.2% value = [0.096447, 0.671743, 0.23181] class = 1 0->468 False 2 Maximum updates <= 3.6e-05 gini = 0.01869 samples = 12.3% value = [0.990566, 0.009434, 0.0] class = 0 1->2 11 Maximum following <= 0.001647 gini = 0.482286 samples = 53.5% value = [0.645405, 0.315676, 0.038919] class = 0 1->11 3 Maximum updates <= 2.1e-05 gini = 0.5 samples = 0.1% value = [0.5, 0.5, 0.0] class = 0 2->3 6 Maximum updates <= 0.144343 gini = 0.009478 samples = 12.2% value = [0.995238, 0.004762, 0.0] class = 0 2->6 4 (...) 3->4 5 (...) 3->5 7 (...) 6->7 8 (...) 6->8 12 Maximum updates <= 0.003472 gini = 0.306074 samples = 31.7% value = [0.815356, 0.170018, 0.014625] class = 0 11->12 217 Election Activity Distance <= 0.000163 gini = 0.557781 samples = 21.9% value = [0.399471, 0.526455, 0.074074] class = 1 11->217 13 (...) 12->13 116 (...) 12->116 218 (...) 217->218 277 (...) 217->277 469 Maximum following <= 0.00601 gini = 0.400441 samples = 22.1% value = [0.143979, 0.753927, 0.102094] class = 1 468->469 624 Election Activity Distance <= 0.009452 gini = 0.508047 samples = 12.1% value = [0.009569, 0.521531, 0.4689] class = 1 468->624 470 Maximum following <= 0.001201 gini = 0.555692 samples = 7.8% value = [0.350746, 0.559701, 0.089552] class = 1 469->470 567 Country ID <= 0.5 gini = 0.249447 samples = 14.4% value = [0.032258, 0.858871, 0.108871] class = 1 469->567 471 (...) 470->471 512 (...) 470->512 568 (...) 567->568 597 (...) 567->597 625 Maximum updates <= 0.01011 gini = 0.396952 samples = 5.1% value = [0.022727, 0.238636, 0.738636] class = 2 624->625 658 Maximum following <= 0.077418 gini = 0.396694 samples = 7.0% value = [0.0, 0.727273, 0.272727] class = 1 624->658 626 (...) 625->626 639 (...) 625->639 659 (...) 658->659 672 (...) 658->672

Usar cross validation con k=5 para encontrar los mejores parametros

In [47]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {  
    'max_depth': [10, 50, 100, 500, 1000, None],
    'min_samples_split' : np.arange(2, 10),
    'min_samples_leaf' : [.1, 0.3, 0.5],
    'criterion': ['gini', 'entropy']
}
classifiergen_numeric.set_classifier(GridSearchCV(estimator=DecisionTreeClassifier(),
                                                  param_grid=param_grid, scoring='accuracy',
                                                  cv=5, n_jobs=-1, refit=True))
classifiergen_numeric.train()
In [48]:
classifiergen_numeric.classifier = classifiergen_numeric.classifier.best_estimator_
classifiergen_numeric.classifier_report()
graph_tree(classifiergen_numeric, 'decision_num_tree_cross')
graph_2D(classifiergen_numeric, plot_decision_regions, 'decision_num_surf_cross')
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.7182448036951501
              precision    recall  f1-score   support

           0       0.78      0.82      0.80       216
           1       0.66      0.64      0.65       173
           2       0.60      0.55      0.57        44

   micro avg       0.72      0.72      0.72       433
   macro avg       0.68      0.67      0.67       433
weighted avg       0.71      0.72      0.72       433

Tree 0 Election Activity Distance <= 0.001168 entropy = 1.361195 samples = 100.0% value = [0.5, 0.399884, 0.100116] class = 0 1 Election Activity Distance <= 0.000101 entropy = 1.013325 samples = 65.8% value = [0.709763, 0.258575, 0.031662] class = 0 0->1 True 10 Maximum following <= 0.013233 entropy = 1.199901 samples = 34.2% value = [0.096447, 0.671743, 0.23181] class = 1 0->10 False 2 entropy = 0.077017 samples = 12.3% value = [0.990566, 0.009434, 0.0] class = 0 1->2 3 Maximum following <= 0.001647 entropy = 1.115111 samples = 53.5% value = [0.645405, 0.315676, 0.038919] class = 0 1->3 4 Maximum updates <= 0.003472 entropy = 0.763874 samples = 31.7% value = [0.815356, 0.170018, 0.014625] class = 0 3->4 7 Election Activity Distance <= 0.00035 entropy = 1.294271 samples = 21.9% value = [0.399471, 0.526455, 0.074074] class = 1 3->7 5 entropy = 0.896927 samples = 13.5% value = [0.729614, 0.261803, 0.008584] class = 0 4->5 6 entropy = 0.608434 samples = 18.2% value = [0.878981, 0.101911, 0.019108] class = 0 4->6 8 entropy = 1.219773 samples = 11.3% value = [0.512821, 0.441026, 0.046154] class = 0 7->8 9 entropy = 1.282451 samples = 10.6% value = [0.278689, 0.617486, 0.103825] class = 1 7->9 11 entropy = 1.111982 samples = 12.7% value = [0.240909, 0.695455, 0.063636] class = 1 10->11 12 Election Activity Distance <= 0.011647 entropy = 0.996113 samples = 21.5% value = [0.010782, 0.657682, 0.331536] class = 1 10->12 13 entropy = 1.133504 samples = 10.0% value = [0.023121, 0.462428, 0.514451] class = 2 12->13 14 entropy = 0.661618 samples = 11.5% value = [0.0, 0.828283, 0.171717] class = 1 12->14

Texto decision tree

In [56]:
from sklearn.tree import DecisionTreeClassifier

classifiergen_text.set_classifier(DecisionTreeClassifier())
classifiergen_text.train()
In [57]:
classifiergen_text.classifier_report()
graph_tree(classifiergen_text, 'decision_text')
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.7090069284064665
              precision    recall  f1-score   support

           0       0.75      0.79      0.77       216
           1       0.66      0.68      0.67       173
           2       0.63      0.43      0.51        44

   micro avg       0.71      0.71      0.71       433
   macro avg       0.68      0.63      0.65       433
weighted avg       0.71      0.71      0.71       433

Tree 0 using <= 5.1e-05 gini = 0.580069 samples = 100.0% value = [0.5, 0.399884, 0.100116] class = 0 1 book <= 0.001346 gini = 0.4769 samples = 72.2% value = [0.659984, 0.291901, 0.048115] class = 0 0->1 True 330 another huge <= 0.00021 gini = 0.47537 samples = 27.8% value = [0.085239, 0.679834, 0.234927] class = 1 0->330 False 2 merkel <= 0.003144 gini = 0.451846 samples = 66.4% value = [0.692509, 0.256969, 0.050523] class = 0 1->2 309 kitten <= 0.009201 gini = 0.433833 samples = 5.7% value = [0.282828, 0.69697, 0.020202] class = 1 1->309 3 amp <= 0.007567 gini = 0.437848 samples = 64.6% value = [0.707885, 0.241935, 0.050179] class = 0 2->3 300 girls <= 0.003692 gini = 0.361328 samples = 1.9% value = [0.15625, 0.78125, 0.0625] class = 1 2->300 4 (...) 3->4 281 (...) 3->281 301 (...) 300->301 306 (...) 300->306 310 stage <= 0.008844 gini = 0.347366 samples = 5.1% value = [0.193182, 0.784091, 0.022727] class = 1 309->310 329 gini = 0.0 samples = 0.6% value = [1.0, 0.0, 0.0] class = 0 309->329 311 (...) 310->311 326 (...) 310->326 331 p2 <= 0.000224 gini = 0.389444 samples = 24.1% value = [0.098321, 0.76259, 0.139089] class = 1 330->331 396 new <= 0.052096 gini = 0.241699 samples = 3.7% value = [0.0, 0.140625, 0.859375] class = 2 330->396 332 one picture <= 0.000124 gini = 0.527038 samples = 15.5% value = [0.153558, 0.636704, 0.209738] class = 1 331->332 393 across world <= 0.001803 gini = 0.026311 samples = 8.7% value = [0.0, 0.986667, 0.013333] class = 1 331->393 333 (...) 332->333 390 (...) 332->390 394 (...) 393->394 395 (...) 393->395 397 fire mayor replaces <= 0.005213 gini = 0.035077 samples = 3.2% value = [0.0, 0.017857, 0.982143] class = 2 396->397 400 gini = 0.0 samples = 0.5% value = [0.0, 1.0, 0.0] class = 1 396->400 398 (...) 397->398 399 (...) 397->399

Usando cross validation para encontrar el mejor set de parametros

In [16]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {  
    'max_depth': [50, 100, 1000, None],
    'min_samples_split' : np.linspace(0.1, 1.0, 2, endpoint=True),
    'min_samples_leaf' : [.1, .25, .5],
    'criterion': ['gini', 'entropy']
}
classifiergen_text.set_classifier(GridSearchCV(estimator=DecisionTreeClassifier(),
                                                  param_grid=param_grid, scoring='accuracy',
                                                  cv=5, n_jobs=-1, refit=True))
classifiergen_text.train()
In [17]:
classifiergen_text.classifier = classifiergen_text.classifier.best_estimator_
classifiergen_text.classifier_report()
graph_tree(classifiergen_text, 'decision_cross_text')
#graph_2D(classifiergen_numeric, plot_decision_regions)
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.6558891454965358
              precision    recall  f1-score   support

           0       0.69      0.87      0.77       216
           1       0.60      0.56      0.58       173
           2       0.00      0.00      0.00        44

   micro avg       0.66      0.66      0.66       433
   macro avg       0.43      0.48      0.45       433
weighted avg       0.58      0.66      0.61       433

Tree 0 texas <= 4.4e-05 entropy = 1.361195 samples = 100.0% value = [0.5, 0.399884, 0.100116] class = 0 1 come <= 0.000613 entropy = 1.135455 samples = 75.0% value = [0.648148, 0.304784, 0.047068] class = 0 0->1 True 8 via <= 0.022547 entropy = 1.110302 samples = 25.0% value = [0.055556, 0.685185, 0.259259] class = 1 0->8 False 2 new <= 0.008363 entropy = 1.040709 samples = 61.5% value = [0.70461, 0.25682, 0.03857] class = 0 1->2 7 entropy = 1.322565 samples = 13.5% value = [0.390558, 0.523605, 0.085837] class = 1 1->7 3 people <= 0.006393 entropy = 0.966214 samples = 43.3% value = [0.744652, 0.220588, 0.034759] class = 0 2->3 6 entropy = 1.173989 samples = 18.2% value = [0.609524, 0.342857, 0.047619] class = 0 2->6 4 entropy = 0.982337 samples = 32.7% value = [0.736283, 0.228319, 0.035398] class = 0 3->4 5 entropy = 0.912947 samples = 10.6% value = [0.770492, 0.196721, 0.032787] class = 0 3->5 9 entropy = 1.274729 samples = 14.9% value = [0.065891, 0.523256, 0.410853] class = 1 8->9 10 entropy = 0.457663 samples = 10.1% value = [0.04023, 0.925287, 0.034483] class = 1 8->10

Clasificacion con forest de decision trees

El bosque de decision trees entrega mejores estadísticas que usar un solo decision tree

Numerico

El parámetro mas importante, con 44% de importancia dado por el clasificador, es la distancia al dia de eleccion, y la región no resulta ser muy relevante. El recall y f-score de la clases 2 resulta ser de los mas altos.

In [58]:
from sklearn.ensemble import RandomForestClassifier

classifiergen_numeric.set_classifier(RandomForestClassifier(n_estimators=100))
classifiergen_numeric.train()
In [59]:
classifiergen_numeric.classifier_report()
graph_2D(classifiergen_numeric, plot_decision_regions, 'random_fortest_num_surf')
print('Las impotancias de los parametros son', np.array(classifiergen_numeric.classifier.feature_importances_))
print('Para los parametros', classifiergen_numeric.feature_names)
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.7736720554272517
              precision    recall  f1-score   support

           0       0.83      0.83      0.83       216
           1       0.71      0.76      0.74       173
           2       0.73      0.55      0.62        44

   micro avg       0.77      0.77      0.77       433
   macro avg       0.76      0.71      0.73       433
weighted avg       0.77      0.77      0.77       433

Las impotancias de los parametros son [0.19984121 0.31288534 0.06592858 0.42134486]
Para los parametros ['Maximum updates' 'Maximum following' 'Country ID'
 'Election Activity Distance']

Texto

La importancia entregada por el clasificador permite hacer una nube de palabras que representa la relevancia de las palabras.

In [20]:
from sklearn.ensemble import RandomForestClassifier

classifiergen_text.set_classifier(RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=1313))
classifiergen_text.train()
In [21]:
classifiergen_text.classifier_report()
#graph_2D(classifiergen_text, plot_decision_regions
importances = np.array(classifiergen_text.classifier.feature_importances_)
names = np.array(classifiergen_text.feature_names)
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.7413394919168591
              precision    recall  f1-score   support

           0       0.73      0.94      0.82       216
           1       0.75      0.61      0.67       173
           2       0.86      0.27      0.41        44

   micro avg       0.74      0.74      0.74       433
   macro avg       0.78      0.61      0.64       433
weighted avg       0.75      0.74      0.72       433

In [22]:
word_cloud(names, importances, ngrammin, ngrammax)

Clasificacion con kneighbors supervisado

No funciona con el texto, entrega mejores resultados que los decision trees pero no mejores que el bosuqe. Tiene mayor índice de falsos negativos para las clases de popularidad alta y muy alta

Numeric

In [23]:
from sklearn.neighbors import KNeighborsClassifier

param_grid = {  
    'algorithm' : ['ball_tree', 'kd_tree'],
    'p' : [1, 2, 5],
    'n_neighbors' : [3, 5, 7, 9, 12]
}
classifiergen_numeric.set_classifier(GridSearchCV(estimator=KNeighborsClassifier(),
                                                  param_grid=param_grid, scoring='accuracy',
                                                  cv=5, n_jobs=-1, refit=True))
classifiergen_numeric.train()
In [24]:
classifiergen_numeric.classifier_report()
graph_2D(classifiergen_numeric, plot_decision_regions, 'kneigh_num_surf')
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.7321016166281755
              precision    recall  f1-score   support

           0       0.75      0.86      0.80       216
           1       0.71      0.65      0.68       173
           2       0.72      0.41      0.52        44

   micro avg       0.73      0.73      0.73       433
   macro avg       0.73      0.64      0.67       433
weighted avg       0.73      0.73      0.72       433

Clasificacion con Naive Bayes

Naive bayes no es apropiado para clasificacion numerica. No tiene mayor capacidad predictiva que otros de los metodos usados

Clasificacion con BernoulliNB

In [25]:
from sklearn.naive_bayes import BernoulliNB

classifiergen_text.set_classifier(BernoulliNB())
classifiergen_text.train()
In [26]:
classifiergen_text.classifier_report()
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.6420323325635104
              precision    recall  f1-score   support

           0       0.60      0.98      0.75       216
           1       0.84      0.30      0.44       173
           2       0.70      0.32      0.44        44

   micro avg       0.64      0.64      0.64       433
   macro avg       0.71      0.53      0.54       433
weighted avg       0.71      0.64      0.59       433

Clasificación con ComplementNB

Falla pues los valores de f1-score y recall quedan indefinidos

In [27]:
from sklearn.naive_bayes import MultinomialNB

classifiergen_text.set_classifier(MultinomialNB())
classifiergen_text.train()
In [28]:
classifiergen_text.classifier_report()
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.605080831408776
              precision    recall  f1-score   support

           0       0.85      0.50      0.63       216
           1       0.50      0.90      0.65       173
           2       0.00      0.00      0.00        44

   micro avg       0.61      0.61      0.61       433
   macro avg       0.45      0.46      0.42       433
weighted avg       0.63      0.61      0.57       433

Clasificacion con MultinomialNB

Ocurre lo mismo que en el caso anterior.

In [29]:
from sklearn.naive_bayes import ComplementNB

classifiergen_text.set_classifier(ComplementNB())
classifiergen_text.train()
In [30]:
classifiergen_text.classifier_report()
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.6004618937644342
              precision    recall  f1-score   support

           0       0.84      0.50      0.62       216
           1       0.50      0.88      0.64       173
           2       0.00      0.00      0.00        44

   micro avg       0.60      0.60      0.60       433
   macro avg       0.45      0.46      0.42       433
weighted avg       0.62      0.60      0.57       433

Clasificacion usando SVC

SVC no lineal es util para clasificacion de texto. SVC lineal es multiproposito

Kernel: linear

No entrega mejores resultados que naive bayes, recall y f-score bajos para la clase de popularidad muy alta, comparado con otros métodos.

In [31]:
from sklearn.svm import LinearSVC

classifiergen_numeric.set_classifier(LinearSVC(C=.5, max_iter=10000))
classifiergen_numeric.train()
In [32]:
classifiergen_numeric.classifier_report()
graph_2D(classifiergen_numeric, plot_decision_regions, 'LinearSVC_num_surf')
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.6073903002309469
              precision    recall  f1-score   support

           0       0.58      0.99      0.73       216
           1       0.80      0.25      0.38       173
           2       0.75      0.14      0.23        44

   micro avg       0.61      0.61      0.61       433
   macro avg       0.71      0.46      0.45       433
weighted avg       0.68      0.61      0.54       433

LinearSVC

SVC lineal resulta ser uno de los mejores clasificadores, pero el recall y f1-score no superan otro de los métodos usados.

In [33]:
from sklearn.svm import LinearSVC

classifiergen_text.set_classifier(LinearSVC(C=.5, max_iter=10000))
classifiergen_text.train()
In [34]:
classifiergen_text.classifier_report()
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.7713625866050808
              precision    recall  f1-score   support

           0       0.76      0.91      0.83       216
           1       0.79      0.69      0.73       173
           2       0.78      0.41      0.54        44

   micro avg       0.77      0.77      0.77       433
   macro avg       0.78      0.67      0.70       433
weighted avg       0.77      0.77      0.76       433

Kernel: rbf

El performance es mas cercano a naive bayes, para la clasificacion numérica y de texto.

In [35]:
from sklearn.svm import SVC

classifiergen_numeric.set_classifier(SVC(gamma=10.0, C=.5, cache_size=1200, kernel='rbf'))
classifiergen_numeric.train()
In [36]:
classifiergen_numeric.classifier_report()
graph_2D(classifiergen_numeric, plot_decision_regions, 'SVC_rbf_num_surf')
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.6189376443418014
              precision    recall  f1-score   support

           0       0.59      0.96      0.73       216
           1       0.75      0.34      0.47       173
           2       0.67      0.05      0.09        44

   micro avg       0.62      0.62      0.62       433
   macro avg       0.67      0.45      0.43       433
weighted avg       0.66      0.62      0.56       433

In [37]:
from sklearn.svm import SVC

classifiergen_text.set_classifier(SVC(gamma=10.0, C=.5, cache_size=1200, kernel='rbf'))
classifiergen_text.train()
In [38]:
classifiergen_text.classifier_report()
El tamaño es de 2161 elementos
Hay 3 clases
Hay [1080  864  217] instancias de las clases [0 1 2], respectivamente
Accuracy:  0.5150115473441108
              precision    recall  f1-score   support

           0       0.51      1.00      0.67       216
           1       1.00      0.01      0.01       173
           2       1.00      0.14      0.24        44

   micro avg       0.52      0.52      0.52       433
   macro avg       0.84      0.38      0.31       433
weighted avg       0.75      0.52      0.36       433

Regresion

Ninguno de los métodos de regresión usados entregó un valor de R^2 positivo, por lo que todos resultaron ser peores que elegir el promedio para todos los valores.

SVC con kernel lineal

In [39]:
from sklearn.svm import SVR

regressiongen_numeric.set_classifier(SVR(gamma=1.0, C=.5, cache_size=1200, kernel='linear'))
regressiongen_numeric.train()
In [62]:
regressiongen_numeric.regression_report()
El tamaño es de 2161 elementos
El coeficiente R^2 es -1487.7963272656639
El mean square error es 6.711167170995799e-05
La explained variance score es -2.220446049250313e-16

Observaciones generales

Los mejores clasificadores fueron los bosques de decision trees para la clasifición numérica y de texto, K-neighbors supervisado para la clasificación numérica, y LinearSVC para la clasificación de texto, y la regresión no entrego buenos resultados. Aún asi, todos los clasificadores usados entregaron mejores resultados que la eleccion aleatoria (accuracy > 0.3).

En general el F1-score para la clase de popularidad muy alta es relativamente bajo (<0.5). Sin embargo, los valores clasificación numérica usando forest de decision trees entregaron los mejores valores de F1-score de todos los demás clasificadores, por lo que si el interés es de predecir si la popularidad de una cuenta estará en los percentiles mas altos, este seria el clasificador mas util.