The Office: Who is who?

More stub!
Author

Charlotte Hadley

Published

December 16, 2019

Abstract

Long stubb!.

I’m going to basically copy Julia Silge’s work from here: https://juliasilge.com/blog/tidy-text-classification/

Code
library("tidyverse")
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library("tidytext")
library("schrute")

I’d like to tidy up the data a little bit:

Code
theoffice_characters <- theoffice %>%
  select(season:text) %>%
  mutate(season = as.numeric(season),
         episode = as.numeric(episode))
theoffice_characters
# A tibble: 55,130 × 7
   season episode episode_name director   writer                 character text 
    <dbl>   <dbl> <chr>        <chr>      <chr>                  <chr>     <chr>
 1      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Michael   All …
 2      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Jim       Oh, …
 3      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Michael   So y…
 4      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Jim       Actu…
 5      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Michael   All …
 6      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Michael   Yes,…
 7      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Michael   I've…
 8      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Pam       Well…
 9      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Michael   If y…
10      1       1 Pilot        Ken Kwapis Ricky Gervais;Stephen… Pam       What?
# … with 55,120 more rows

In Julia’s article she adds a document row without really highlighting why… I think it’s actually crucial, so let’s do that:

Code
theoffice_characters <- theoffice_characters %>%
  mutate(document = row_number())

I want to see how good Jim and Dwight’s impressions of one another are, which appear in “Product Recall” - Season 3, Episode 21.

Let’s use every episode before this as our training set:

Code
theoffice_before_product_recall <- theoffice_characters %>%
  filter(season <= 3,
         episode < 21)

For the moment we only care about Jim and Dwight, so let’s extract out the unigrams for each speaker using unnest_tokens() and throw away the stop words:

Code
jim_dwight_unigrams_before_product_recall <- theoffice_characters %>%
  filter(character %in% c("Dwight", "Jim")) %>%
  select(character, text, document) %>%
  unnest_tokens(word, text) %>%
  anti_join(get_stopwords())
Joining, by = "word"

I’m going to duplicate one of Julia’s charts so we can compare the most common words used by Dwight and Jim.

Code
jim_dwight_unigrams_before_product_recall %>%
  count(character, word, sort = TRUE) %>%
  group_by(character) %>%
  top_n(20) %>%
  ungroup() %>%
  ggplot(aes(reorder_within(word, n, character), n,
    fill = character
  )) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(~character, scales = "free") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(
    x = NULL, y = "Word count",
    title = "Most frequent words after removing stop words",
    subtitle = "Words like 'said' occupy similar ranks but other words are quite different"
  )
Selecting by n

These lists are very similar to one another. That’s because the stop words in the {tidytext} package are collated from prose, and not from spoken word - or dialogue.

We

Code
theoffice_characters %>%
  filter(character %in% c("Dwight", "Jim")) %>%
  select(character, text, document) %>%
  unnest_tokens(ngrams, text, token = "ngrams", n = 2) %>%
  separate(ngrams, into = c("word_1",
                            "word_2")) %>%
  filter(!word_1 %in% stop_words$word,
         !word_2 %in% stop_words$word) %>%
  count(character, word_1, word_2, sort = TRUE)
Warning: Expected 2 pieces. Additional pieces discarded in 12536 rows [5, 6, 69,
70, 126, 127, 128, 132, 133, 134, 144, 145, 156, 165, 166, 198, 241, 242, 268,
269, ...].
# A tibble: 8,636 × 4
   character word_1    word_2       n
   <chr>     <chr>     <chr>    <int>
 1 Jim       <NA>      <NA>       936
 2 Dwight    <NA>      <NA>       737
 3 Dwight    dwight    schrute     50
 4 Dwight    hey       hey         42
 5 Dwight    ha        ha          37
 6 Dwight    dunder    mifflin     36
 7 Dwight    regional  manager     33
 8 Dwight    wait      wait        24
 9 Dwight    assistant regional    20
10 Dwight    la        la          17
# … with 8,626 more rows

We need to create ourselves a set of filler words:

Code
filler_words <- tibble(
  word = c("yeah", "oh", "well", "like", "uh", "okay", "just", "um")
)

Now we have ourselves

Code
jim_dwight_unigrams_before_product_recall <- theoffice_characters %>%
  filter(character %in% c("Dwight", "Jim")) %>%
  select(character, text, document) %>%
  unnest_tokens(word, text) %>%
  anti_join(get_stopwords()) %>%
  anti_join(filler_words)
Joining, by = "word"
Joining, by = "word"
Code
jim_dwight_unigrams_before_product_recall %>%
  count(character, word, sort = TRUE) %>%
  group_by(character) %>%
  top_n(20) %>%
  ungroup() %>%
  ggplot(aes(reorder_within(word, n, character), n,
    fill = character
  )) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(~character, scales = "free") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(
    x = NULL, y = "Word count",
    title = "Most frequent words after removing stop words",
    subtitle = "Words like 'said' occupy similar ranks but other words are quite different"
  )
Selecting by n

Building our model

The {rsample} package is designe for splitting up data into training and test sets.

Code
library("rsample")

But we’re going to be unfair and split our data as follows:

  • Training: the episodes before the impressions
  • Test: the episode of the impressions
Code
tidy_training_data <- theoffice_before_product_recall %>%
  select(character, text, document) %>%
  unnest_tokens(word, text) %>%
  anti_join(get_stopwords()) %>%
  anti_join(filler_words)
Joining, by = "word"
Joining, by = "word"

Transform into a sparse matrix:

Code
theoffice_sparse_words <- tidy_training_data %>%
  count(document, word) %>%
  inner_join(tidy_training_data) %>%
  cast_sparse(document, word, n)
Joining, by = c("document", "word")

Now we build a data.frame to store our response variable:

Code
word_rownames <- as.integer(rownames(theoffice_sparse_words))

theoffice_rejoined_product_recall <- tibble(document = word_rownames) %>%
  left_join(theoffice_before_product_recall %>%
              select(document, character))
Joining, by = "document"
Code
dim(theoffice_rejoined_product_recall)
[1] 10850     2
Code
dim(theoffice_sparse_words)
[1] 10850  7932
Code
library("glmnet")
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-4
Code
library("doMC")
Loading required package: foreach

Attaching package: 'foreach'
The following objects are masked from 'package:purrr':

    accumulate, when
Loading required package: iterators
Loading required package: parallel
Code
registerDoMC(cores = 8)

is_jim <- theoffice_rejoined_product_recall$character == "Jim"
model <- cv.glmnet(theoffice_sparse_words, is_jim,
  family = "binomial",
  parallel = TRUE, keep = TRUE
)
Code
library(broom)

coefs <- model$glmnet.fit %>%
  tidy() %>%
  filter(lambda == model$lambda.1se)
Code
coefs %>%
  group_by(estimate > 0) %>%
  top_n(10, abs(estimate)) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) 

Reuse

Citation

BibTeX citation:
@online{hadley2019,
  author = {Charlotte Hadley},
  title = {The {Office:} {Who} Is Who?},
  date = {2019-12-16},
  url = {https://visibledata.co.uk/posts/2019-12-16_the-office-who-is-who},
  langid = {en},
  abstract = {Long stubb!.}
}
For attribution, please cite this work as:
Charlotte Hadley. 2019. “The Office: Who Is Who?” December 16, 2019. https://visibledata.co.uk/posts/2019-12-16_the-office-who-is-who.