This post is about the Ted Talk recommender I made using Natural Language Processing (NLP), topic modeling. The recommender lets you enter key words from the title of a talk and returns the urls to 5 Ted Talks that are similar to yours. I'm planning to host my little flask app soon so that you too can use the Ted Talk Recommender! The github repository is here.
The data for this project consisted of transcripts from ted and tedX talks. Thanks to Rounak Banik and his web scraping I was able to obtain 2467 transcripts from 355 different Ted and TedX events from 2001-2017. I downloaded this corpus from Kaggle, along with metadata about every talk. Here's a sample of the transcript from the most popular (highest views) Ted Talk. 'Do Schools Kill Creativity? by Sir Ken Robinson.
Good morning. How are you?(Laughter)It\'s been great, hasn\'t it? I\'ve been blown away by the whole thing. In fact, I\'m leaving. (Laughter)There have been three themes running through the conference which are relevant to what I want to talk about. One is the extraordinary evidence of human creativity
The first thing I saw when looking at these transcripts was that there were a lot of parentheticals for various non-speech sounds. For example, (Laughter) or (applause) or (Music). There were even some cute little notes when the lyrics of a performance were transcribed
someone like him ♫♫ He was tall and strong
I decided that I wanted to look at only the words that the speaker said, and remove these words in parentheses. Although, it would be interesting to collect these non-speech events and keep a count in the main matrix, especially for things like 'laughter' or applause or multimedia (present/not present) in making recommendations or calculating the popularity of a talk.
Lucky for me, all of the parentheses contained these non-speech sounds and any of the speaker's words that required parenthesis were in brackets, so I just removed them with a simple regular expression. Thank you, Ted transcribers, for making my life a little easier!!!
clean_parens = re.sub(r'\([^)]*\)', ' ', text)
Cleaning Text with NLTK
Four important steps for cleaning the text and getting it into a format that we can analyze:
1.tokenize 2.lemmatize 3.remove stop words/punctuation 4.vectorize
NLTK (Natural Language ToolKit) is a python library for NLP. I found it very easy to use and highly effective.
tokenize- This is the process of splitting up the document (talk) into words. There are a few tokenizers in NLTK, and one called wordpunct was my favorite because it separated the punctuation as well.
from nltk.tokenize import wordpunct_tokenize doc_words2 = [wordpunct_tokenize(docs[fileid]) for fileid in fileids] print('\n--- --\n'.join(wordpunct_tokenize(docs))) OUTPUT: Good morning . How are you ?
The notes were easy to remove by adding them to my stop words. Stopwords are the words that don't give us much information, (i.e., the, and, it, she, as) along with the punctuation. We want to remove these from our text, too.
We can do this by importing NLTKs list of stopwords and then adding to it. I went through many iterations of cleaning in order to figure out which words to add to my stopwords. I added a lot of words and little things that weren't getting picked up, but this is a sample of my list.
from nltk.corpus import stopwords stop = stopwords.words('english') stop += ['.'," \'", 'ok','okay','yeah','ya','stuff','?']
Lemmatization - In this step, we get each word down to its root form. I chose the lemmatizer over the stemmer because it was more conservative and was able to change the ending to the appropriate one (i.e. children-->child, capacities-->capacity). This was at the expense of missing a few obvious ones (starting, unpredictability).
from nltk.stem import WordNetLemmatizer lemmizer = WordNetLemmatizer() clean_words =  for word in docwords2: #remove stop words if word.lower() not in stop: low_word = lemmizer.lemmatize(word) #another shot at removing stopwords if low_word.lower() not in stop: clean_words.append(low_word.lower())
Now we have squeaky clean text! Here's the same excerpt that I showed you at the top of the post.
good morning great blown away whole thing fact leaving three theme running conference relevant want talk one extraordinary evidence human creativity
As you can see it no longer makes a ton of sense, but it will still be very informative once we process these words over the whole corpus of talks.
Let's look at some of the n-grams. This is just pairs of words (or triplets) that show up together. It will tell us something about our corpus, but also guide us in our next step of vectorization. Here's what we get for the top 30 most fequent bi-grmas.
from collections import Counter from operator import itemgetter counter = Counter() n = 2 for doc in cleaned_talks: words = TextBlob(doc).words bigrams = ngrams(words, n) counter += Counter(bigrams) for phrase, count in counter.most_common(30): print('%20s %i' % (" ".join(phrase), count)) OUPUT: year ago 2074 little bit 1607 year old 1365 united states 1103 one thing 1041 around world 938 new york 894 can not 877 first time 751 every day 692 many people 656 last year 604 every single 573 one day 559 10 year 541
The tri-grams were not very informative or useful aside from "new york city" and "0000 year ago" which get picked up in the bi-grams.
OUPTUT: new york city 236 000 year ago 135 new york times 123 10 year ago 118 every single day 109 million year ago 109 people around world 101 two year ago 100 world war ii 99 one two three 97 couple year ago 96 20 year ago 83 five year old 78 talk little bit 71 spend lot time 71 every single one 69 three year ago 69 six year old 69 sub saharan africa 68
- Vectorization is the important step of turning our words into numbers. The method that gave me the best results was count vectorizer. This function takes each word in each document and counts the number of times the word appears. You end up with each word (and n-gram) as your columns and each row is a document (talk), so the data is the frequency of each word in each document. As you can imagine, there will be a large number of zeros in this matrix; we call this a sparse matrix.
c_vectorizer = CountVectorizer(ngram_range=(1,3), stop_words='english', max_df = 0.6, max_features=10000) # call `fit` to build the vocabulary c_vectorizer.fit(cleaned_talks) # finally, call `transform` to convert text to a bag of words c_x = c_vectorizer.transform(cleaned_talks)
Now we are ready for topic modeling! Stay tuned for next week's post on Latent Dirichlet Allocation and visualization with tSNE!