Ted Talk Recommender (Part1): Cleaning text with NLTK

This post is about the Ted Talk recommender I made using Natural Language Processing (NLP), topic modeling. The recommender lets you enter key words from the title of a talk and returns the urls to 5 Ted Talks that are similar to yours. I'm planning to host my little flask app soon so that you too can use the Ted Talk Recommender! The github repository is here.

The data for this project consisted of transcripts from ted and tedX talks. Thanks to Rounak Banik and his web scraping I was able to obtain 2467 transcripts from 355 different Ted and TedX events from 2001-2017. I downloaded this corpus from Kaggle, along with metadata about every talk. Here's a sample of the transcript from the most popular (highest views) Ted Talk. 'Do Schools Kill Creativity? by Sir Ken Robinson. 

    Good morning. How are you?(Laughter)It\'s been great, hasn\'t it? 
    I\'ve been blown away by the whole thing. In fact, I\'m leaving.
    (Laughter)There have been three themes running through the conference 
    which are relevant to what I want to talk about. One is the extraordinary 
    evidence of human creativity

The first thing I saw when looking at these transcripts was that there were a lot of parentheticals for various non-speech sounds. For example, (Laughter) or (applause) or (Music).  There were even some cute little notes when the lyrics of a performance were transcribed 

    someone like him ♫♫ He was tall and strong 

I decided that I wanted to look at only the words that the speaker said, and remove these words in parentheses. Although, it would be interesting to collect these non-speech events and keep a count in the main matrix, especially for things like 'laughter' or applause or multimedia (present/not present) in making recommendations or calculating the popularity of a talk.

Lucky for me, all of the parentheses contained these non-speech sounds and any of the speaker's words that required parenthesis were in brackets, so I just removed them with a simple regular expression. Thank you, Ted transcribers, for making my life a little easier!!! 

 clean_parens = re.sub(r'\([^)]*\)', ' ', text)

Cleaning Text with NLTK

Four important steps for cleaning the text and getting it into a format that we can analyze:

1.tokenize
2.lemmatize
3.remove stop words/punctuation
4.vectorize

NLTK (Natural Language ToolKit) is a python library for NLP. I found it very easy to use and highly effective.

  • tokenize- This is the process of splitting up the document (talk) into words. There are a few tokenizers in NLTK, and one called wordpunct was my favorite because it separated the punctuation as well.

    from nltk.tokenize import wordpunct_tokenize doc_words2 =     [wordpunct_tokenize(docs[fileid]) for fileid in fileids] print('\n---   --\n'.join(wordpunct_tokenize(docs[1][0])))
    
      OUTPUT: 
      Good 
      morning 
      . 
      How 
      are 
      you 
      ?
  • The notes were easy to remove by adding them to my stop words. Stopwords are the words that don't give us much information, (i.e., the, and, it, she, as) along with the punctuation. We want to remove these from our text, too.

    We can do this by importing NLTKs list of stopwords and then adding to it. I went through many iterations of cleaning in order to figure out which words to add to my stopwords. I added a lot of words and little things that weren't getting picked up, but this is a sample of my list.

    from nltk.corpus import stopwords
      stop = stopwords.words('english')
      stop += ['.'," \'", 'ok','okay','yeah','ya','stuff','?']
  • Lemmatization - In this step, we get each word down to its root form. I chose the lemmatizer over the stemmer because it was more conservative and was able to change the ending to the appropriate one (i.e. children-->child, capacities-->capacity). This was at the expense of missing a few obvious ones (starting, unpredictability).

from nltk.stem import WordNetLemmatizer lemmizer = WordNetLemmatizer() clean_words = []

for word in docwords2:

    #remove stop words
     if word.lower() not in stop:
         low_word = lemmizer.lemmatize(word)

         #another shot at removing stopwords
         if low_word.lower() not in stop:
             clean_words.append(low_word.lower())

Now we have squeaky clean text! Here's the same excerpt that I showed you at the top of the post.

good morning great blown away whole thing fact leaving three 
theme running conference relevant want talk one extraordinary 
evidence human creativity

As you can see it no longer makes a ton of sense, but it will still be very informative once we process these words over the whole corpus of talks.

Let's look at some of the n-grams. This is just pairs of words (or triplets) that show up together. It will tell us something about our corpus, but also guide us in our next step of vectorization. Here's what we get for the top 30 most fequent bi-grmas.

from collections import Counter
from operator import itemgetter

counter = Counter()

n = 2
for doc in cleaned_talks:
    words = TextBlob(doc).words
    bigrams = ngrams(words, n)
    counter += Counter(bigrams)

for phrase, count in counter.most_common(30):
    print('%20s %i' % (" ".join(phrase), count))

OUPUT:
year ago 2074
little bit 1607
year old 1365
united states 1103
one thing 1041
around world 938
new york 894
can not 877
first time 751
every day 692
many people 656
last year 604
every single 573
one day 559
10 year 541

The tri-grams were not very informative or useful aside from "new york city" and "0000 year ago" which get picked up in the bi-grams.

OUPTUT:
new york city 236
000 year ago 135
new york times 123
10 year ago 118
every single day 109
million year ago 109
people around world 101
two year ago 100
world war ii 99
one two three 97
couple year ago 96
20 year ago 83
five year old 78
talk little bit 71
spend lot time 71
every single one 69
three year ago 69
six year old 69
sub saharan africa 68
  • Vectorization is the important step of turning our words into numbers. The method that gave me the best results was count vectorizer. This function takes each word in each document and counts the number of times the word appears. You end up with each word (and n-gram) as your columns and each row is a document (talk), so the data is the frequency of each word in each document. As you can imagine, there will be a large number of zeros in this matrix; we call this a sparse matrix.
c_vectorizer = CountVectorizer(ngram_range=(1,3), 
                                 stop_words='english', 
                                 max_df = 0.6, 
                                 max_features=10000)

    # call `fit` to build the vocabulary
    c_vectorizer.fit(cleaned_talks)

    # finally, call `transform` to convert text to a bag of words
    c_x = c_vectorizer.transform(cleaned_talks)

Now we are ready for topic modeling! Stay tuned for next week's post on Latent Dirichlet Allocation and visualization with tSNE!