Youtube es una plataforma web dedicada a compartir vídeos de diferentes temas que van desde entretención hasta educación. Comenzó como una simple aplicación web lanzada en 2005 por Chud Hurley, Steven Chen y Jamed Karim para convertirse en una de las redes sociales más utilizadas en el mundo.
Hoy en día la plataforma tiene mas de 1.9 billones de usuarios activos cada mes y se estima que se suben 400 horas de vídeo por cada minuto del día y a su ves los usuarios ven alrededor de 1 billón de horas por día.
La plataforma recomienda vídeos automáticamente, de forma personalizada a cada usuario, y considerando que mas del 70% del tiempo gastado en YouTube es usado viendo lo que la plataforma recomienda, el algoritmo de recomendación es un tema de fundamental importancia para los creadores de contenido de la plataforma que quieren ganar popularidad, visitas, dar a conocer su trabajo e intereses, y muchas veces hacer una carrera en YouTube.
El algoritmo de YouTube busca aumentar el tiempo de vistas se vuelve es sumamente complejo, cambia constantemente y se puede considerar imposible de comprender en su totalidad, especialmente para quienes no trabajan en Google.
Utilizando los conceptos y algoritmos de minería de datos es posible estudiar los datos generados por los usuarios, y así encontrar patrones que determinen los contenidos más vistos y comentados dentro de la plataforma, donde dicha información es valiosa para los autores de contenido.
El dataset utilizado tiene el nombre “Trending YouTube Video Statistics”, podemos encontrarlo en
Este incluye diversas publicaciones de datos de los vídeos de Youtube separados por país junto a estadísticas focalizadas en el contenidos de los mismos. El dataset trae 200 vídeos de tendencias listados por día, donde los datos de cada región se encuentran en archivos separados, y en cada archivo se agregan los títulos de vídeos, el canal del vídeo, la fecha de publicación, tags, número de visitas, cantidad de likes y dislikes.
Podemos obtener los siguientes datos para cada registro:
Como hipótesis iniciales, creemos que:
Podemos conocer cuales son las características que hacen un video popular en YouTube. Esto lo podemos medir en con el número de Views y Likes que obtiene un video.
Podemos conocer cuales son las características que hacen un video controversial en YouTube. Esto lo podemos medir en la relacion entre Dislikes y Likes que obtiene un video.
La cantidad de datos obtenida por cada región se encuentra en distintos idiomas, entre ellos ruso, inglés, español, francés,japonés, entre otros. Para simplificar la información mostrada en el hito 1 se decidió utilizar la información proveniente de Estados Unidos. La idea del análisis a seguir es generar un predictor de contenido que seleccione los tópicos más interesante dentro de los datos a estudiar.
library(gdata)
library(purrr)
library(ggplot2)
library(jsonlite)
data_dir <- ""
us_videos <- read.csv(paste(data_dir, "USvideos.csv", sep=""))
us_videos$trending_date <- as.Date(us_videos$trending_date, format = "%y.%d.%m")
us_videos$title <- as.character(us_videos$title)
us_category <- fromJSON("US_category_id.json")
us_category <- data.frame(id=us_category$items$id, name=us_category$items$snippet$title)
us_category$name <- as.character(us_category$name)
category_name <- function(id) {
return(us_category[us_category$id == id, "name"])
}
us_videos$category <- map_chr(us_videos$category_id, category_name)
Cantidad de datos nulos
print(sum(is.na(us_videos)))
## [1] 0
Número de canales distintos
print(length(unique(us_videos$channel_title)))
## [1] 2207
Número de videos distintos (hay videos que son trending más de 1 día)
print(length(unique(us_videos$video_id)))
## [1] 6351
Fechas de publicación
print(min(us_videos$trending_date))
## [1] "2017-11-14"
print(max(us_videos$trending_date))
## [1] "2018-06-14"
print(max(us_videos$trending_date) - min(us_videos$trending_date))
## Time difference of 212 days
Cantidad de categorías distintas
print(length(unique(us_videos$category)))
## [1] 16
ggplot(data.frame(Category=us_videos$category)) +
geom_bar(aes(x=reorder(Category)), stat="count") +
coord_flip() +
ggtitle("Histograma de categorías") +
xlab("Categorías") +
ylab("Número de videos")
t <- unique(us_videos[,c("video_id", "trending_date")])
t <- as.data.frame(table(t["video_id"]))
t <- as.data.frame(table(t[2]))
t$Var1 <- as.numeric(t$Var1)
ggplot(data.frame(t)) +
geom_bar(aes(x = Var1, y = Freq), stat="identity") +
ggtitle("Frecuencia de videos por la cantidad de días trending") +
xlab("Cantidad de días") +
ylab("Frecuencia de videos")
Top 10 canales con mayor cantidad de días trending
t <- unique(us_videos[,c("trending_date", "channel_title")])
t <- as.data.frame(table(t["channel_title"]))
t <- t[order(t$Freq, decreasing=TRUE),]
t[1:10,]
## Var1 Freq
## 635 ESPN 202
## 1936 The Tonight Show Starring Jimmy Fallon 197
## 1387 Netflix 193
## 1955 TheEllenShow 192
## 2118 Vox 192
## 1904 The Late Show with Stephen Colbert 187
## 975 Jimmy Kimmel Live 185
## 1102 Late Night with Seth Meyers 183
## 1687 Screen Junkies 182
## 1373 NBA 181
Frecuencia de cantidad de días trending por canal
t <- data.frame(table(t[2]))
t$Var1 <- as.numeric(t$Var1)
ggplot(t) +
geom_bar(aes(x = Var1, y = Freq), stat="identity") +
ggtitle("Frecuencia de canales por la cantidad de días trending") +
xlab("Cantidad de días") +
ylab("Frecuencia de canales")
Porcentaje de vídeos con una palabra (de largo al menos 4) en mayúscula en el título
has_upper <- function(line) {
v <- strsplit(line, " ")
for (s in v) {
if (length(s) > 3 && s == toupper(s)) {
return(1)
}
}
return(0)
}
t <- map(unique(us_videos$title), has_upper)
print(Reduce("+", t))
## [1] 1318
print(Reduce("+", t) / length(t))
## [1] 0.2041828
Porcentaje de videos con ratings/comentarios deshabilitados
t <- us_videos[us_videos$comments_disabled == "True" | us_videos$ratings_disabled == "True",]
t <- unique(t$video_id)
print(length(t))
## [1] 122
print(length(t) / length(unique(us_videos$video_id)))
## [1] 0.01920957
Otras visualizaciones exploratiroas fueron realizadas con otras herramientas, y son presantadas como imagenes a continuacion
se crea un nuevo dataframe con las columnas que nos sirven y que no necesitan preprocesado.
Mas abajo se le agregan mas features
import pandas as pd
import numpy as np
import io
path_to_file = 'USvideos.xlsx'
dataset = pd.read_excel(path_to_file)
print(dataset.columns)
print(dataset.shape) #(40949, 16)
clean_columns = [
'category_id',
'comment_count',
#'comments_disabled',
#'ratings_disabled',
#'views',
]
dataset = dataset[dataset['ratings_disabled']==False]
dataset = dataset[dataset['likes'].notnull()]
dataset = dataset[dataset['dislikes'].notnull()]
clean_df = dataset[clean_columns]
print(clean_df.shape) #(40949, 16)
clean_df.head()
se limpia y agregan features a clean_df. es necesario correr toda la seccion, lo de Bag of Words no
import re
from urllib.parse import urlparse
url_re = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/"
def delete_urls(text):
return re.sub(url_re, "", text)
# genera lista con las urls en un texto
def urls_in_text(text):
return [url_array[0] for url_array in re.findall(url_re, text)]
# ['http://www.cwi.nl:80/%7Eguido/Python.html',] >> ['www.cwi.nl:80',]
# solo funciona cuando tienen http:// , https://
def urls2netloc(urls):
res = []
for u in urls:
res.append(urlparse(u).netloc) #print(u,urlparse(u))
return res
# remplaza las urls en un texto, dejando los netloc
def replace_urls(text):
urls = urls_in_text(text)
netlocs = urls2netloc(urls)
# print(urls) # print(netlocs)
for u, n in zip(urls, netlocs):
text = text.replace(u, n)
return text
# example = '''
# blabllablbl http://www.dasd.com/asd \n
# blab llablblblabllablbl blabllablbl https://xdasd.net/asdasdasds
# https://www.youtub.com/123123
# '''
# replace_urls(example)
Retornan el string limpio
def clean_description(de):
clean_str = de.replace('\n', ' \n ')
clean_str = replace_urls(clean_str)
return clean_str
def clean_title(ti):
# nada que limpiar ?
return ti
# se cambiara el formato para que funcione con el BoW
# string uno"|"str2"|"... >> string_uno str2 ...
def clean_tags(ta):
clean_str = ta.replace(' ', '_')
clean_str = clean_str.replace('"|"', ' ')
return clean_str
def urls_count(text):
return len(urls_in_text(text))
def get_ratio(a, b):
if b == 0 :
return -1
return a/b
def uppercase_count(text):
return sum(1 for c in text if c.isupper())
def spaces_count(text):
return sum(1 for c in text if c.isspace())
def numbers_count(text):
return sum(1 for c in text if c.isdigit())
def words_count(text):
return sum(1 for c in text if c.isalpha())
cleaned = {
'desc': [],
'title': [],
'tags': [],
}
new = {
'desc': {
'url_cnt': [],
'question_cnt': [],
'exclamation_cnt': [],
'spaces_cnt': [],
'numbers_cnt': [],
'words_cnt': [],
'uppercase_ratio': [],
'len': [],
},
'title': {
'question_cnt': [],
'exclamation_cnt': [],
'spaces_cnt': [],
'numbers_cnt': [],
'words_cnt': [],
#'uppercase_cnt': [],
'uppercase_ratio': [],
'len': [],
},
'tags': {
'cnt': [],
},
}
iterator = zip(
dataset['description'],
dataset['title'],
dataset['tags'])
for de, ti, ta in iterator:
if type(de) is float:
de = ""
if type(ti) is float:
ti = ""
if type(ta) is float:
ta = ""
cleaned_de = clean_description(de)
cleaned_ti = clean_title(de)
cleaned_ta = clean_tags(ta)
cleaned['desc'].append(cleaned_de)
cleaned['title'].append(cleaned_ti)
cleaned['tags'].append(cleaned_ta)
new['desc']['url_cnt'].append(urls_count(de))
new['desc']['question_cnt'].append(cleaned_de.count('?'))
new['desc']['exclamation_cnt'].append(cleaned_de.count('!'))
new['desc']['spaces_cnt'].append(spaces_count(cleaned_de))
new['desc']['numbers_cnt'].append(numbers_count(cleaned_de))
new['desc']['words_cnt'].append(words_count(cleaned_de))
new['desc']['uppercase_ratio'].append(get_ratio(uppercase_count(cleaned_de),
len(cleaned_de)))
new['desc']['len'].append(len(de))
new['title']['question_cnt'].append(cleaned_ti.count('?'))
new['title']['exclamation_cnt'].append(cleaned_ti.count('!'))
new['title']['spaces_cnt'].append(spaces_count(cleaned_ti))
new['title']['numbers_cnt'].append(numbers_count(cleaned_ti))
new['title']['words_cnt'].append(words_count(cleaned_ti))
#new['title']['uppercase_cnt'].append(uppercase_count(cleaned_ti))
new['title']['uppercase_ratio'].append(get_ratio(uppercase_count(cleaned_ti),
len(cleaned_ti)))
new['title']['len'].append(len(cleaned_ti))
new['tags']['cnt'].append(cleaned_ta.count(' '))
'''
clean_df['description'] = cleaned['desc']
clean_df['title'] = cleaned['title']
clean_df['tags'] = cleaned['tags']
'''
clean_df['desc_url_cnt'] = new['desc']['url_cnt']
clean_df['desc_question_cnt'] = new['desc']['question_cnt']
clean_df['desc_exclamation_cnt'] = new['desc']['exclamation_cnt']
clean_df['desc_spaces_cnt'] = new['desc']['spaces_cnt']
clean_df['desc_numbers_cnt'] = new['desc']['numbers_cnt']
clean_df['desc_words_cnt'] = new['desc']['words_cnt']
clean_df['desc_uppercase_ratio'] = new['desc']['uppercase_ratio']
clean_df['desc_len'] = new['desc']['len']
clean_df['title_question_cnt'] = new['title']['question_cnt']
clean_df['title_exclamation_cnt'] = new['title']['exclamation_cnt']
clean_df['title_spaces_cnt'] = new['title']['spaces_cnt']
clean_df['title_numbers_cnt'] = new['title']['numbers_cnt']
clean_df['title_words_cnt'] = new['title']['words_cnt']
#clean_df['title_uppercase_cnt'] = new['title']['uppercase_cnt']
clean_df['title_uppercase_ratio'] = new['title']['uppercase_ratio']
clean_df['title_len'] = new['title']['len']
clean_df['tags_cnt'] = new['tags']['cnt']
print(clean_df.shape) # (40780, 18)
clean_df.head()
Este transorma strings en un identificador unico, util para el nombre del canal, pero no se utilizo para el analisis ya no nos interesa este valor para relizar las predicciones
"""
from sklearn.preprocessing import LabelEncoder
y2 = dataset['channel_title']
lb = LabelEncoder()
clean_df = clean_df.assign(channel_title = lb.fit_transform(y2))
"""
## 2 grama
from sklearn.feature_extraction.text import CountVectorizer
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
token_pattern=r'\b\w+\b',
stop_words ='english',
min_df=0.1)
Xtitle_2 = bigram_vectorizer.fit_transform(cleaned['title'])
print(Xtitle_2.shape) # == (40780, 381751) == videos vs bigramas
'''
# total de apariciones por bigrama
title_bigram_total = Xtitle_2.sum(axis=0)
bigram_total.shape == (1, 381751)
# total de bigramas por video
title_video_total = Xtitle_2.sum(axis=1)
bigram_total.shape == (40780, 1)
'''
title_feature_names= bigram_vectorizer.get_feature_names()
# print(len(title_feature_names)) #381751
TitleBOW = pd.DataFrame(Xtitle_2.toarray()).fillna(0)
TitleBOW.head()
Estas categorias se reparten arbitrariamente, podrian separarse de mejor manera.
def clean_nans(df):
res = df
for c in df.columns:
cnt = len(res[c])
res = res[res[c].notnull()]
if(cnt != len(res)):
print(c,'=',cnt - len(res[c]), 'nulls')
return res
target_columns = [
'likes',
'dislikes',
'views',
]
target_df = dataset[target_columns]
#target_df['likes_ratio'] = (target_df['likes']+0)/(target_df['dislikes']+0)
likes_ratio = []
for l,d in zip(target_df['likes'], target_df['dislikes']):
likes_ratio.append(get_ratio(l,d))
target_df = target_df.assign(likes_ratio = likes_ratio)
bins = [-np.inf, 0.5, 1.5, 15, 50.0, 100.0, 200.0, np.inf]
labels=['<0.5','<1.5','<15','<50','<100','<200','>200']
target_df['likes_ratio_category'] = pd.cut(target_df['likes_ratio'], bins=bins, labels=labels)
bins = [-np.inf, 500000, 1000000, 10000000, 50000000, np.inf]
labels=['0.5M','1M','10M','50M','>50M']
target_df['views_category'] = pd.cut(target_df['views'], bins=bins, labels=labels)
target_df.head()
X = clean_nans(clean_df)
Y = clean_nans(target_df)
from sklearn.datasets import load_breast_cancer
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC # support vector machine classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB # naive bayes
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, recall_score, precision_score
def run_classifier(clf, X_train, X_test, y_train, y_test, num_tests=100):
metrics = {'f1-score': [], 'precision': [], 'recall': []}
for _ in range(num_tests):
clf.fit(X_train, y_train) ## Entrenamos con X_train y clases y_train
predictions = clf.predict(X_test) ## Predecimos con nuevos datos (los de test X_test)
metrics['y_pred'] = predictions
metrics['y_prob'] = clf.predict_proba(X_test)[:,1]
metrics['f1-score'].append(f1_score(y_test, predictions, average='weighted'))
metrics['recall'].append(recall_score(y_test, predictions, average='weighted'))
metrics['precision'].append(precision_score(y_test, predictions, average='weighted'))
return metrics
classifiers = [
("Base Dummy", DummyClassifier(strategy='stratified')),
("Decision Tree", DecisionTreeClassifier()),
("Gaussian Naive Bayes", GaussianNB()),
("KNN", KNeighborsClassifier(n_neighbors=5)),
]
target_classes = [
('Likes/Dislikes ratio Category', 'likes_ratio_category'),
('Views Category', 'views_category'),
]
X = clean_df
for tname, col in target_classes:
print('Clasificando: {}'.format(tname))
y = target_df[col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.60)
results = {}
for name, clf in classifiers:
metrics = run_classifier(clf, X_train, X_test, y_train, y_test) # hay que implementarla en el bloque anterior.
results[name] = metrics
print("----------------")
print("Resultados para clasificador: ",name)
print("Precision promedio:",np.array(metrics['precision']).mean())
print("Recall promedio:",np.array(metrics['recall']).mean())
print("F1-score promedio:",np.array(metrics['f1-score']).mean())
print("----------------\n\n")
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
target_classes = [
('Likes/Dislikes ratio Category', 'likes_ratio_category'),
('Views Category', 'views_category'),
]
X = clean_df
for name, col in target_classes:
print('Clasificando: {}'.format(name))
y = target_df[col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=37, stratify=y)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy en test set:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
import warnings
warnings.filterwarnings('ignore')
clean_df.columns == ['category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
'title_len', 'tags_cnt'
]
from sklearn.datasets import load_breast_cancer
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC # support vector machine classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB # naive bayes
from sklearn.neighbors import KNeighborsClassifier
cols_groups =[
( "original",
[
'category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
'title_len', 'tags_cnt'
]
),
( "sin tags",
[
'category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
'title_len'
]),
( "sin descripcion",
[
'category_id', 'comment_count',
'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
'title_len', 'tags_cnt'
]
),
( "sin titulo",
[
'category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
'tags_cnt',
]
)
]
classifiers = [
("Base Dummy", DummyClassifier(strategy='stratified')),
("Decision Tree", DecisionTreeClassifier()),
("Gaussian Naive Bayes", GaussianNB()),
("KNN", KNeighborsClassifier(n_neighbors=5)),
]
target_classes = [
('Likes/Dislikes ratio Category', 'likes_ratio_category'),
('Views Category', 'views_category'),
]
print('clasificador, target, cols_description, precision, recall, f1-score')
for tname, col in target_classes:
for cname, clf in classifiers:
for gdescription, group in cols_groups:
X = clean_df[group]
y = target_df[col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=37, stratify=y)
metrics = run_classifier(clf, X_train, X_test, y_train, y_test) # hay que implementarla en el bloque anterior.
print("{}, {}, {}, {}, {}, {}, ".format(
cname,
tname,
gdescription,
np.array(metrics['precision']).mean(),
np.array(metrics['recall']).mean(),
np.array(metrics['f1-score']).mean()
)
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
classifiers = [
("Base Dummy", DummyClassifier(strategy='stratified')),
("Decision Tree", DecisionTreeClassifier()),
("Gaussian Naive Bayes", GaussianNB()),
("KNN", KNeighborsClassifier(n_neighbors=5)),
]
target_classes = [
('Likes/Dislikes ratio Category', 'likes_ratio_category'),
('Views Category', 'views_category'),
]
X = TitleBOW
print('clasificador, target, precision, recall, f1-score')
for tname, col in target_classes:
for cname, clf in classifiers:
y = target_df[col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=37, stratify=y)
metrics = run_classifier(clf, X_train, X_test, y_train, y_test)
print("{}, {}, {}, {}, {}, ".format(
cname,
tname,
np.array(metrics['precision']).mean(),
np.array(metrics['recall']).mean(),
np.array(metrics['f1-score']).mean()
)
)
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc
import seaborn as sns
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize
columns = ['likes', 'dislikes', 'views', 'comment_count']
X = dataset[columns]
# Matriz de correlación
cor = X.corr()
sns.heatmap(cor, square=True, vmin=0)
# Matriz de correlación
cor = clean_df.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(cor, square=True, vmin=0)
# Matriz de correlación
cor = target_df.corr()
sns.heatmap(cor, square=True, vmin=0)
# Matriz de correlación
clean_df.columns == ['category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
'title_len', 'tags_cnt']
target_df.columns == ['likes', 'dislikes', 'views', 'likes_ratio', 'likes_ratio_category',
'views_category']
pd_total =pd.concat([clean_df, target_df], axis=1)
cor = pd_total.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(cor, square=True, vmin=0)
# small
clean_df.columns == ['category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
'title_len', 'tags_cnt']
target_df.columns == ['likes', 'dislikes', 'views', 'likes_ratio', 'likes_ratio_category',
'views_category']
cols = ['category_id', 'comment_count', 'desc_len',
'title_question_cnt', 'title_exclamation_cnt',
'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
'tags_cnt',
'likes_ratio','views'
]
pd_total =pd.concat([clean_df, target_df], axis=1)
small_df = pd_total[cols]
cor = small_df.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(cor, square=True, vmin=0)
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style="ticks", color_codes=True)
cols = [
'category_id', 'comment_count', 'desc_len',
'title_question_cnt', 'title_exclamation_cnt',
'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
'tags_cnt',
'likes_ratio','views'
]
pd_total =pd.concat([clean_df, target_df], axis=1)
small_df = pd_total[cols]
g = sns.pairplot(small_df) # Parametro kind="reg" agrega una recta
plt.show()
# Clustering con un sample de 3000
from mpl_toolkits.mplot3d import Axes3D
def cluster_data(data, sample_size, seed, cluster_size):
plt.figure(figsize=(15, 10))
data_sample = data.sample(sample_size, random_state=seed)
normalized = data_sample / data_sample.max()
dend = shc.dendrogram(shc.linkage(normalized, method='ward'))
plt.show()
cluster = AgglomerativeClustering(n_clusters=cluster_size, affinity='euclidean', linkage='ward')
cluster.fit_predict(normalized)
fig = plt.figure(figsize=(10, 7))
ax = Axes3D(fig)
ax.scatter(data_sample.values[:,0],
data_sample.values[:,1],
data_sample.values[:,2],
c=cluster.labels_,
cmap='rainbow')
attr = list(data_sample)
ax.set_xlabel(attr[0])
ax.set_ylabel(attr[1])
ax.set_zlabel(attr[2])
plt.show()
data_sample['cluster'] = cluster.labels_
for i in range(cluster_size):
c = data_sample[data_sample['cluster'] == i]
print('cluster ' + str(i) + ':')
print('cluster size: ' + str(len(c)))
for j in attr:
print('avg ' + j + ': ' + str(c[j].mean()))
print()
cluster_data(X, 3000, 15, 4)
# Clustering de videos unicos
Y = dataset.copy()
Y['trending_date'] = pd.to_datetime(Y['trending_date'], format='%y.%d.%m')
Y = Y.sort_values(by='trending_date')
Y = Y.drop_duplicates(subset=['video_id'], keep='last')
Y = Y[columns]
cluster_data(Y, 5000, 17, 4)
Se hace para datasets de todos los paises
import pandas as pd
import numpy as np
import io
FILE = 'USvideos'
# FILE = 'JPvideos'
# FILE = 'MXvideos'
path_to_file = FILE+'.xlsx'
dataset = pd.read_excel(path_to_file)
print(dataset.columns)
print(dataset.shape) #(40949, 16)
clean_columns = [
'category_id',
#'comment_count',
#'comments_disabled',
#'ratings_disabled',
#'views',
]
dataset = dataset[dataset['ratings_disabled']==False]
dataset = dataset[dataset['likes'].notnull()]
dataset = dataset[dataset['dislikes'].notnull()]
clean_df = dataset[clean_columns]
print(clean_df.shape) #(40949, 16)
clean_df.head()
Se limpia y agregan features a clean_df.
import re
from urllib.parse import urlparse
url_re = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/"
def delete_urls(text):
return re.sub(url_re, "", text)
# genera lista con las urls en un texto
def urls_in_text(text):
return [url_array[0] for url_array in re.findall(url_re, text)]
# ['http://www.cwi.nl:80/%7Eguido/Python.html',] >> ['www.cwi.nl:80',]
# solo funciona cuando tienen http:// , https://
def urls2netloc(urls):
res = []
for u in urls:
res.append(urlparse(u).netloc) #print(u,urlparse(u))
return res
# remplaza las urls en un texto, dejando los netloc
def replace_urls(text):
urls = urls_in_text(text)
netlocs = urls2netloc(urls)
# print(urls) # print(netlocs)
for u, n in zip(urls, netlocs):
text = text.replace(u, n)
return text
# example = '''
# blabllablbl http://www.dasd.com/asd \n
# blab llablblblabllablbl blabllablbl https://xdasd.net/asdasdasds
# https://www.youtub.com/123123
# '''
# replace_urls(example)
Retornan el string limpio
def clean_description(de):
clean_str = de.replace('\n', ' \n ')
clean_str = replace_urls(clean_str)
return clean_str
def clean_title(ti):
return ti
# se cambiara el formato para que funcione con el BoW
# string uno"|"str2"|"... >> string_uno str2 ...
def clean_tags(ta):
clean_str = ta.replace(' ', '_')
clean_str = clean_str.replace('"|"', ' ')
return clean_str
def urls_count(text):
return len(urls_in_text(text))
def get_ratio(a, b):
if b == 0 :
return -1
return a/b
def uppercase_count(text):
return sum(1 for c in text if c.isupper())
def spaces_count(text):
return sum(1 for c in text if c.isspace())
def numbers_count(text):
return sum(1 for c in text if c.isdigit())
def words_count(text):
return sum(1 for c in text if c.isalpha())
cleaned = {
'desc': [],
'title': [],
'tags': [],
}
new = {
'desc': {
'url_cnt': [],
'question_cnt': [],
'exclamation_cnt': [],
'spaces_cnt': [],
'numbers_cnt': [],
'words_cnt': [],
'uppercase_ratio': [],
'len': [],
},
'title': {
'question_cnt': [],
'exclamation_cnt': [],
'spaces_cnt': [],
'numbers_cnt': [],
'words_cnt': [],
#'uppercase_cnt': [],
'uppercase_ratio': [],
'len': [],
},
'tags': {
'cnt': [],
},
}
iterator = zip(
dataset['description'],
dataset['title'],
dataset['tags'])
for de, ti, ta in iterator:
if type(de) is float:
de = ""
if type(ti) is float:
ti = ""
if type(ta) is float:
ta = ""
cleaned_de = clean_description(de)
cleaned_ti = clean_title(de)
cleaned_ta = clean_tags(ta)
cleaned['desc'].append(cleaned_de)
cleaned['title'].append(cleaned_ti)
cleaned['tags'].append(cleaned_ta)
new['desc']['url_cnt'].append(urls_count(de))
new['desc']['question_cnt'].append(cleaned_de.count('?'))
new['desc']['exclamation_cnt'].append(cleaned_de.count('!'))
new['desc']['spaces_cnt'].append(spaces_count(cleaned_de))
new['desc']['numbers_cnt'].append(numbers_count(cleaned_de))
new['desc']['words_cnt'].append(words_count(cleaned_de))
new['desc']['uppercase_ratio'].append(get_ratio(uppercase_count(cleaned_de),
len(cleaned_de)))
new['desc']['len'].append(len(de))
new['title']['question_cnt'].append(cleaned_ti.count('?'))
new['title']['exclamation_cnt'].append(cleaned_ti.count('!'))
new['title']['spaces_cnt'].append(spaces_count(cleaned_ti))
new['title']['numbers_cnt'].append(numbers_count(cleaned_ti))
new['title']['words_cnt'].append(words_count(cleaned_ti))
#new['title']['uppercase_cnt'].append(uppercase_count(cleaned_ti))
new['title']['uppercase_ratio'].append(get_ratio(uppercase_count(cleaned_ti),
len(cleaned_ti)))
new['title']['len'].append(len(cleaned_ti))
new['tags']['cnt'].append(cleaned_ta.count(' '))
'''
clean_df['description'] = cleaned['desc']
clean_df['title'] = cleaned['title']
clean_df['tags'] = cleaned['tags']
'''
clean_df['desc_url_cnt'] = new['desc']['url_cnt']
clean_df['desc_question_cnt'] = new['desc']['question_cnt']
clean_df['desc_exclamation_cnt'] = new['desc']['exclamation_cnt']
clean_df['desc_spaces_cnt'] = new['desc']['spaces_cnt']
clean_df['desc_numbers_cnt'] = new['desc']['numbers_cnt']
clean_df['desc_words_cnt'] = new['desc']['words_cnt']
clean_df['desc_uppercase_ratio'] = new['desc']['uppercase_ratio']
clean_df['desc_len'] = new['desc']['len']
clean_df['title_question_cnt'] = new['title']['question_cnt']
clean_df['title_exclamation_cnt'] = new['title']['exclamation_cnt']
clean_df['title_spaces_cnt'] = new['title']['spaces_cnt']
clean_df['title_numbers_cnt'] = new['title']['numbers_cnt']
clean_df['title_words_cnt'] = new['title']['words_cnt']
#clean_df['title_uppercase_cnt'] = new['title']['uppercase_cnt']
clean_df['title_uppercase_ratio'] = new['title']['uppercase_ratio']
clean_df['title_len'] = new['title']['len']
clean_df['tags_cnt'] = new['tags']['cnt']
print(clean_df.shape) # (40780, 18)
clean_df.head()
Estas categorias se reparten arbitrariamente, podrian separarse de mejor manera.
def clean_nans(df):
res = df
for c in df.columns:
cnt = len(res[c])
res = res[res[c].notnull()]
if(cnt != len(res)):
print(c,'=',cnt - len(res[c]), 'nulls')
return res
target_columns = [
'likes',
'dislikes',
'views',
]
target_df = dataset[target_columns]
#target_df['likes_ratio'] = (target_df['likes']+0)/(target_df['dislikes']+0)
likes_ratio = []
for l,d in zip(target_df['likes'], target_df['dislikes']):
likes_ratio.append(get_ratio(l,d))
target_df = target_df.assign(likes_ratio = likes_ratio)
# bins = [-np.inf, 0.5, 1.5, 15, 50.0, 100.0, 200.0, np.inf]
# labels=['<0.5','<1.5','<15','<50','<100','<200','>200']
# target_df['likes_ratio_category'] = pd.cut(target_df['likes_ratio'], bins=bins, labels=labels)
# bins = [-np.inf, 500000, 1000000, 10000000, 50000000, np.inf]
# labels=['0.5M','1M','10M','50M','>50M']
# target_df['views_category'] = pd.cut(target_df['views'], bins=bins, labels=labels)
n_class = 6
labels = [i for i in range(n_class)]
target_df['likes_ratio_category'] = pd.qcut(target_df['likes_ratio'], n_class, labels=labels)
target_df['views_category'], q = pd.qcut(target_df['views'], n_class, labels=labels, retbins=True)
print(q)
target_df.head()
#X = clean_nans(clean_df)
#Y = clean_nans(target_df)
from sklearn.datasets import load_breast_cancer
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC # support vector machine classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB # naive bayes
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import f1_score, recall_score, precision_score
def run_classifier(clf, X_train, X_test, y_train, y_test, num_tests=100):
metrics = {'f1-score': [], 'precision': [], 'recall': [], 'accuracy': []}
for _ in range(num_tests):
clf.fit(X_train, y_train) ## Entrenamos con X_train y clases y_train
predictions = clf.predict(X_test) ## Predecimos con nuevos datos (los de test X_test)
metrics['y_pred'] = predictions
metrics['y_prob'] = clf.predict_proba(X_test)[:,1]
metrics['f1-score'].append(f1_score(y_test, predictions, average='weighted'))
metrics['recall'].append(recall_score(y_test, predictions, average='weighted'))
metrics['precision'].append(precision_score(y_test, predictions, average='weighted'))
metrics['accuracy'].append(accuracy_score(y_test, predictions,))
return metrics
classifiers = [
("Base Dummy", DummyClassifier(strategy='stratified')),
("Decision Tree", DecisionTreeClassifier()),
("Gaussian Naive Bayes", GaussianNB()),
("KNN", KNeighborsClassifier(n_neighbors=5)),
]
target_classes = [
('Likes/Dislikes ratio Category', 'likes_ratio_category'),
('Views Category', 'views_category'),
]
X = clean_df
for tname, col in target_classes:
y = target_df[col]
results = {}
for cname, clf in classifiers:
scoring = ['precision_macro', 'recall_macro', 'accuracy', 'f1_macro']
cv_results = cross_validate(clf, X, y, cv = 10, scoring = scoring, return_train_score= True)
print("Cross validation, {}, {}, {}, {}, {}, ".format(
cname,
tname,
np.mean(cv_results['test_precision_macro']),
np.mean(cv_results['test_recall_macro']),
np.mean(cv_results['test_f1_macro']),
np.mean(cv_results['test_accuracy']),
)
)
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc
import seaborn as sns
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize
columns = ['likes', 'dislikes', 'views', 'comment_count']
X = dataset[columns]
# Matriz de correlación
clean_df.columns == ['category_id', 'desc_url_cnt', 'desc_question_cnt',
'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
'title_len', 'tags_cnt']
target_df.columns == ['likes', 'dislikes', 'views', 'likes_ratio', 'likes_ratio_category',
'views_category']
pd_total =pd.concat([clean_df, target_df], axis=1)
cor = pd_total.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(cor, square=True, vmin=0)
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
cols = [('Likes ratio', 'likes_ratio'), ('Views', 'views')]
regressors = [('Dummy', DummyRegressor()),
('Lineal', LinearRegression()),
('Random forest', RandomForestRegressor(n_estimators=10))]
X = clean_df
for tname, col in cols:
for rname, model in regressors:
y = target_df[col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("{}, {}, {}, {}, {}, {},".format(
rname,
tname,
metrics.mean_absolute_error(y_test, y_pred),
metrics.mean_squared_error(y_test, y_pred),
np.sqrt(metrics.mean_squared_error(y_test, y_pred)),
model.score(X_test, y_test))
)
import matplotlib.pyplot as plt
for tname, col in cols:
y = target_df[col]
order = [i for i in range(len(y))]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)
model = RandomForestRegressor(n_estimators=10)
model.fit(X_train, y_train)
def getKey(i):
return y.iloc[i]
order.sort(key=getKey)
_y = []
_x = []
for i in range(len(y)):
_y.append(y.iloc[order[i]])
_x.append(X.iloc[order[i]])
plt.plot(model.predict(_x))
plt.plot(_y)
plt.show()
plt.plot(model.predict(_x[:len(_x) - 2000]))
plt.plot(_y[:len(_y) - 2000])
plt.show()