IMDB Sentiment Analysis (Lasso)

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

February 28, 2024

1 Set up

Display system information for reproducibility.

import IPython
print(IPython.sys_info())
{'commit_hash': '8b1204b6c',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/opt/venv/lib/python3.10/site-packages/IPython',
 'ipython_version': '8.21.0',
 'os_name': 'posix',
 'platform': 'Linux-6.6.12-linuxkit-aarch64-with-glibc2.35',
 'sys_executable': '/opt/venv/bin/python',
 'sys_platform': 'linux',
 'sys_version': '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]'}
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.34     fastmap_1.1.1     xfun_0.42         Matrix_1.6-1.1   
 [5] lattice_0.21-9    reticulate_1.35.0 knitr_1.45        htmltools_0.5.7  
 [9] png_0.1-8         rmarkdown_2.25    cli_3.6.2         grid_4.3.2       
[13] compiler_4.3.2    rstudioapi_0.15.0 tools_4.3.2       evaluate_0.23    
[17] Rcpp_1.0.12       yaml_2.3.8        rlang_1.1.3       jsonlite_1.8.8   
[21] htmlwidgets_1.6.4

Load libraries.

# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 1.2)
# Display all columns
pd.set_option('display.max_columns', None)

# Load Tensorflow and Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
library(keras)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5      ✔ rsample      1.2.0 
✔ dials        1.2.1      ✔ tune         1.1.2 
✔ infer        1.0.6      ✔ workflows    1.1.4 
✔ modeldata    1.3.0      ✔ workflowsets 1.0.1 
✔ parsnip      1.2.0      ✔ yardstick    1.3.0 
✔ recipes      1.0.10     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard()        masks purrr::discard()
✖ dplyr::filter()          masks stats::filter()
✖ recipes::fixed()         masks stringr::fixed()
✖ yardstick::get_weights() masks keras::get_weights()
✖ dplyr::lag()             masks stats::lag()
✖ yardstick::spec()        masks readr::spec()
✖ recipes::step()          masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages
library(glmnet)
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-8
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var

2 Prepare data

From documentation:

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: “only consider the top 10,000 most common words, but eliminate the top 20 most common words”.

Retrieve IMDB data. We restrict to the 10,000 most frequently-used words and tokens.

max_features = 10000

(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words = max_features)

Data dimensions.

x_train.shape
(25000,)
y_train.shape
(25000,)
x_test.shape
(25000,)
y_test.shape
(25000,)
# Indices of the first 12 words of the first training document
x_train[0][0:11]
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458]

To decode the reviews:

# Use the default parameters to keras.datasets.imdb.load_data
start_char = 1
oov_char = 2
index_from = 3

# Retrieve the word index file mapping words to indices
word_index = keras.datasets.imdb.get_word_index()

# Reverse the word index to obtain a dict mapping indices to words
# And add `index_from` to indices to sync with `x_train`
inverted_word_index = dict(
    (i + index_from, word) for (word, i) in word_index.items()
)

# Update `inverted_word_index` to include `start_char` and `oov_char`
inverted_word_index[start_char] = "[START]"
inverted_word_index[oov_char] = "[OOV]"
# Decode the first sequence in the dataset
" ".join(inverted_word_index[i] for i in x_train[0])
"[START] this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert [OOV] is an amazing actor and now the same being director [OOV] father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for [OOV] and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also [OOV] to the two little boy's that played the [OOV] of norman and paul they were just brilliant children are often left out of the [OOV] list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
max_features <- 10000

imdb <- dataset_imdb(num_words = max_features)
x_train <- imdb$train$x
y_train <- imdb$train$y
x_test <- imdb$test$x
y_test <- imdb$test$y

Data dimensions.

# Training set
length(x_train)
[1] 25000
table(y_train)
y_train
    0     1 
12500 12500 
# Test set
length(x_test)
[1] 25000
table(y_test)
y_test
    0     1 
12500 12500 
# Indices of the first 12 words of the first training document
x_train[[1]][1:12]
 [1]    1   14   22   16   43  530  973 1622 1385   65  458 4468

Function for decoding the IMDB reviews:

word_index <- dataset_imdb_word_index()

decode_review <- function(text, word_index) {
  word <- names(word_index)
  idx <- unlist(word_index, use.names = FALSE)
  word <- c("<PAD>", "<START>", "<UNK>", "<UNUSED>", word)
  idx <- c(0:3, idx + 3)
  words <- word[match(text, idx, 2)]
  paste(words, collapse = " ")
}

decode_review(x_train[[1]], word_index)
[1] "<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

Create the bag of words matrices.

from scipy import sparse

def one_hot(sequences, dimension):
  seqlen = [len(sequences[i]) for i in range(len(sequences))]
  n = len(seqlen)
  rowind = np.repeat(range(n), seqlen)
  colind = np.concatenate(sequences)
  vals = np.ones(len(rowind))
  return sparse.coo_matrix((vals, (rowind, colind)), shape = (n, dimension)).tocsc()

# Train
x_train_1h = one_hot(x_train, 10000)
x_train_1h.shape
(25000, 10000)
# Sparsity of train set
x_train_1h.count_nonzero() / np.prod(x_train_1h.shape)
0.013169872
# Test
x_test_1h = one_hot(x_test, 10000)
x_test_1h.shape
(25000, 10000)
# Sparsity of test set
x_test_1h.count_nonzero() / np.prod(x_test_1h.shape)
0.012874312
library(Matrix)

one_hot <- function(sequences, dimension) {
  seqlen <- sapply(sequences, length)
  n <- length(seqlen)
  rowind <- rep(1:n, seqlen)
  colind <- unlist(sequences)
  sparseMatrix(
    i = rowind,
    j = colind,
    dims = c(n, dimension)
  )
}

# Train
x_train_1h <- one_hot(x_train, 10000)
dim(x_train_1h)
[1] 25000 10000
# Proportion of nonzeros
nnzero(x_train_1h) / (25000 * 10000)
[1] 0.01316987
# Test
x_test_1h <- one_hot(x_test, 10000)
dim(x_test_1h)
[1] 25000 10000
# Proportion of nonzeros
nnzero(x_test_1h) / (25000 * 10000)
[1] 0.01287431

3 Lasso

3.1 Training Lasso

We use logistic regression model in scikit-learn:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

logit_mod = LogisticRegression(
  penalty = 'l1',
  solver = 'liblinear',
  warm_start = True
  )

pipe = Pipeline(steps = [
  ("model", logit_mod)
])
pipe
Pipeline(steps=[('model',
                 LogisticRegression(penalty='l1', solver='liblinear',
                                    warm_start=True))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Set up tuning grid for \(C = 1/\alpha\).

C_grid = np.logspace(start = -3, stop = 3, base = 10, num = 100)

tuned_parameters = {"model__C": C_grid}

Cross-validation:

from sklearn.model_selection import GridSearchCV

# Set up CV
n_folds = 10
search = GridSearchCV(
  pipe, 
  tuned_parameters, 
  cv = n_folds, 
  scoring = "roc_auc",
  # Refit the best model on the whole data set
  refit = True,
  # Adjust n_jobs according to hardware
  n_jobs = 8
  )

search.fit(x_train_1h, y_train)
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model',
                                        LogisticRegression(penalty='l1',
                                                           solver='liblinear',
                                                           warm_start=True))]),
             n_jobs=8,
             param_grid={'model__C': array([1.00000000e-03, 1.14975700e-03, 1.32194115e-03, 1.51991108e-03,
       1.74752840e-03, 2.00923300e-03, 2.31012970e-03, 2.65608778e-03,
       3.05385551e-03, 3.51119173e-03, 4.03701726e-03, 4.64158883e-03,
       5.3366...
       4.03701726e+01, 4.64158883e+01, 5.33669923e+01, 6.13590727e+01,
       7.05480231e+01, 8.11130831e+01, 9.32603347e+01, 1.07226722e+02,
       1.23284674e+02, 1.41747416e+02, 1.62975083e+02, 1.87381742e+02,
       2.15443469e+02, 2.47707636e+02, 2.84803587e+02, 3.27454916e+02,
       3.76493581e+02, 4.32876128e+02, 4.97702356e+02, 5.72236766e+02,
       6.57933225e+02, 7.56463328e+02, 8.69749003e+02, 1.00000000e+03])},
             scoring='roc_auc')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
cvres <- cv.glmnet(
  x_train_1h,
  y_train,
  nfolds = 10,
  family = "binomial",
  # loss to use for CV
  type.measure = "auc",
  # pass through to glmnet
  standardize = FALSE,
)

Visualize CV result:

cvres = pd.DataFrame(
  {
  "C": np.array(search.cv_results_["param_model__C"]),
  "aucroc": search.cv_results_["mean_test_score"]
  }
)

plt.figure()
sns.relplot(
  kind = "line",
  data = cvres,
  x = "C",
  y = "aucroc"
  ).set(
    xscale = "log",
    xlabel = "C",
    ylabel = "CV AUC"
);
plt.show()

plot(cvres)

Which words have the largest effect sizes in the lasso penalized logistic regression?

search.best_estimator_['model']
LogisticRegression(C=0.26560877829466867, penalty='l1', solver='liblinear',
                   warm_start=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
betahat = search.best_estimator_['model'].coef_[0]
ix = np.argsort(abs(betahat))[::-1]

pd.DataFrame({
  'word': [inverted_word_index[i + 1 + index_from] for i in ix[range(10)]],
  'beta': betahat[ix[range(10)]]
})
           word      beta
0  entertaining -1.757320
1     developed -1.641642
2     structure  1.614960
3      surprise -1.556532
4           fun -1.513878
5          knew  1.304278
6            40  1.277720
7             v -1.268717
8      remember -1.214594
9        angles -1.204265
betahat_sorted <- sort(
  as.vector(abs(coef(cvres))), 
  decreasing = TRUE, 
  index.return = TRUE
  )

tibble(
  word = str_split(decode_review(betahat_sorted$ix[1:10], word_index), ' ')[[1]],
  beta = coef(cvres)[betahat_sorted$ix[1:10]]
)
# A tibble: 10 × 2
   word        beta
   <chr>      <dbl>
 1 cinema     -1.67
 2 girl       -1.61
 3 knows       1.40
 4 premise    -1.37
 5 loves      -1.34
 6 video      -1.26
 7 opposite   -1.24
 8 australian  1.23
 9 necessary   1.16
10 nearly     -1.11

3.2 Testing Lasso

Test AUC:

from sklearn.metrics import accuracy_score, roc_auc_score

roc_auc_score(
  y_test, 
  search.best_estimator_.predict_proba(x_test_1h)[:, 1]
  )
0.9482191232000001

Test accuracy:

accuracy_score(
  y_test,
  search.best_estimator_.predict(x_test_1h)
  )
0.8814

Test AUC:

auc(y_test, predict(cvres, x_test_1h, type = "response"))
Setting levels: control = 0, case = 1
Warning in roc.default(response, predictor, auc = TRUE, ...): Deprecated use a
matrix as predictor. Unexpected results may be produced, please pass a numeric
vector.
Setting direction: controls < cases
Area under the curve: 0.9492

Test accuracy:

sum(y_test == predict(cvres, x_test_1h, type = "class")) / length(y_test)
[1] 0.88132