Above is a sample of the transcript from on of the most popular (highest views) Ted Talks. 'Do Schools Kill Creativity? by Sir Ken Robinson. I made a Ted Talk recommender using Natural Language Processing (NLP), topic modeling. The recommender lets you enter key words from the title of a talk, then it finds the talk and returns the urls to 5 Ted talks that are similar to yours. This post will cover cleaning, Part2 covers topic modeling and Part 3 is the recommender and app. The code is in my github repository.
The data for this project consisted of transcripts from Ted and TedX talks. Thanks to Rounak Banik and his web scraping I was able to obtain 2467 transcripts from 355 different Ted and TedX events from 2001-2017. I downloaded this corpus from Kaggle, along with metadata about every talk.
The first thing I saw when looking at these transcripts was that there were a lot of parentheticals for various non-speech sounds. For example, (Laughter) or (applause) or (Music). There were even some cute little notes when the lyrics of a performance were transcribed
someone like him ♫♫ He was tall and strong
I decided that I wanted to look at only the words that the speaker said, and remove these words in parentheses. Although, it would be interesting to collect these non-speech events and keep a count in the main matrix, especially for things like 'laughter' or applause or multimedia (present/not present) in making recommendations or calculating the popularity of a talk.
Lucky for me, all of the parentheses contained these non-speech sounds and any of the speaker's words that required parenthesis were in brackets, so I just removed them with a simple regular expression. Thank you, Ted transcribers, for making my life a little easier!!!
clean_parens = re.sub(r'\([^)]*\)', ' ', text)
Cleaning Text with NLTK
Four important steps for cleaning the text and getting it into a format that we can analyze:
1.tokenize 2.remove stop words/punctuation 3.lemmatize 4.vectorize
NLTK (Natural Language ToolKit) is a python library for NLP. I found it very easy to use and highly effective.
This is the process of splitting up the document (talk) into words. There are a few tokenizers in NLTK, and one called wordpunct was my favorite because it separated the punctuation as well.
The music notes were easy to remove by adding them to my stopwords. Stopwords are the words that don't give us much information, (i.e., the, and, it, she, as) along with the punctuation. We want to remove these from our text, too.
We can do this by importing NLTKs list of stopwords and then adding to it. I went through many iterations of cleaning in order to figure out which words to add to my stopwords. I added a lot of words and little things that weren't getting picked up, but this is a sample of my list.
In this step, we get each word down to its root form. I chose the lemmatizer over the stemmer because it was more conservative and was able to change the ending to the appropriate one (i.e. children-->child, capacities-->capacity). This was at the expense of missing a few obvious ones (starting, unpredictability).
Now we have squeaky clean text! Here's the same excerpt that I showed you at the top of the post.
good morning great blown away whole thing fact leaving three theme running conference relevant want talk one extraordinary evidence human creativity
As you can see it no longer makes a ton of sense, but it will still be very informative once we process these words over the whole corpus of talks.
Let's look at some of the n-grams. This is just pairs of words (or triplets) that show up together. It will tell us something about our corpus, but also guide us in our next step of vectorization. Here's what we get for the top 30 most fequent bi-grmas.
The tri-grams were not very informative or useful aside from "new york city" and "0000 year ago" which get picked up in the bi-grams.
Vectorization is the important step of turning our words into numbers. The method that gave me the best results was count vectorizer. This function takes each word in each document and counts the number of times the word appears. You end up with each word (and n-gram) as your columns and each row is a document (talk), so the data is the frequency of each word in each document. As you can imagine, there will be a large number of zeros in this matrix; we call this a sparse matrix.
Now we are ready for topic modeling! In Part 2 we do topic modeling with Latent Dirichlet Allocation and visualization with tSNE!