TED Talk Recommender (Part2): Topic Modeling and tSNE

I created a TED talk recommender using Natural Language Processing (NLP). See part1, for the initial exploration of the data and cleaning. This post will describe the different topic modeling methods I tried and how I used t-distributed Stochastic Neighbor Embedding (tSNE) to visualize the topic space. The code is available in my github

Now that we have cleaned and vectorized our data, we can move on to topic modeling. A topic is just a group of words that tend to show up together throughout the corpus of documents. The model that worked best for me was Latent Dirichlet Allocation (LDA) which creates a latent space where documents that have similar topics are closer together. This is because documents from a similar topic tend to have similar words, and LDA picks up on this.  We then assign a label to each topic based on the 20 most frequent words for each topic. This is where NLP can be tricky, because you have to decide whether the topics make sense or not, given your corpus of documents and model(s) that you have used. 

A more detailed explanation of this process is that LDA will provide a (strength) score for each topic for each document, then I just assigned the strongest topic to each document. The other thing about LDA is that it will also yield a 'junk' topic (hopefully there's only one...) that won't make much sense, that's fine. It's putting all of the words without a group (they don't tend to show up with other words) into that topic. Depending on how much you clean your data, this can be tricky to recognize. When my data was still pretty dirty, it was where all of the less meaningful words would end up (i.e., stuff, guy, yo). 

First get the cleaned_talks from the previous step. Then import the model with sklearn

    from sklearn.decomposition import LatentDirichletAllocation

I put the model into a function along with the vectorizers so that I could easily manipulate the parameters like 'number of topics, number of iterations (max_iter), n-gram size (ngram_min, ngram_max), number of features (max_df). We can tune the hyperparameters to see which one gives us the best topics (ones that make sense to you). 


 

def topic_mod_lda(data,topics=5,iters=10,ngram_min=1, ngram_max=3, max_df=0.6, max_feats=5000):
    
    """ vectorizer - turn words into numbers for each document(rows)
    then use Latent Dirichlet Allocation to get topics"""
    
    vectorizer = CountVectorizer(ngram_range=(ngram_min,ngram_max), 
                             stop_words='english', 
                             max_df = max_df, 
                             max_features=max_feats)
    vect_data = vectorizer.fit_transform(data)
    
    lda = LatentDirichletAllocation(n_components=topics,
                                    max_iter=iters,
                                    random_state=42,
                                    learning_method='online',
                                    n_jobs=-1)
    
    lda_dat = lda.fit_transform(vect_data)
    
    # to display a list of topic words and their scores 
    def display_topics(model, feature_names, no_top_words):
        for ix, topic in enumerate(model.components_):
            print("Topic ", ix)
            print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
    
    display_topics(lda, vectorizer.get_feature_names(),20)
    
    return vectorizer, vect_data, lda, lda_dat

The parameters I settled on were the following: 

vect_mod, vect_data, lda_mod, lda_data = topic_mod_lda(cleaned_talks, topics=20,
                                             iters=100, ngram_min=1, ngram_max=2, 
                                             max_df=0.5, max_feats=2000)

The functions will print the topics and the most frequent 20 words in each topic. Here are a few examples from the TED Talks.

 --------------------------------------------------------------------------------------------------------------------------------------------------

Topic  0
woman men child girl family community young black mother sex boy home man country white school story female father gender

Topic  1
food plant water eat farmer product plastic waste grow seed feed farm produce crop egg soil diet eating percent agriculture

Topic  2
universe earth planet space light star science mars physic particle galaxy sun theory billion matter hole black number image away

Topic  3
water ocean specie animal 000 tree sea forest fish earth planet ice area year ago river million coral bird foot shark

---------------------------------------------------------------------------------------------------------------------------------------------------

At this point try to put a name with each topic. For example, I called Topic 3 'marine biology' but you might call it 'environment'. Now that we have topics, we can assign the strongest topic to each document and create a dataframe of labels for our next step.

topic_ind = np.argmax(lda_data, axis=1)
topic_ind.shape
y=topic_ind

# create text labels for plotting
tsne_labels = pd.DataFrame(y)

# save to csv
tsne_labels.to_csv(path + 'tsne_labels.csv')

Then we create the dataframe with our custom labels

topic_names = tsne_labels
topic_names[topic_names==0] = "family"
topic_names[topic_names==1] = "agriculture"
topic_names[topic_names==2] = "space" .... 

If we want to view this topic space to further check whether our modeling worked like we wanted it to, we can use tSNE. t-Distributed Stochastic Neighbor Embedding (tSNE) is a recent technique for reducing the dimensionality of a high-dimensional space, so that we can view it.


 

from sklearn.manifold import TSNE

# a t-SNE model
# angle value close to 1 means sacrificing accuracy for speed
# pca initializtion usually leads to better results 
tsne_model = TSNE(n_components=3, verbose=1, random_state=44, angle=.50,
                  perplexity=18,early_exaggeration=1,learning_rate=50.0)#, init='pca'

# 20-D -> 3-D
tsne_lda = tsne_model.fit_transform(lda_data)
# put into a Dataframe for plotly import
tsne_data = pd.DataFrame(tsne_lda)

We humans suck at visualizing 20-dimensions, as you know.  Once we plot our t-SNE we want to look for clusters. This means that the points (documents) that belong to the same topic are close together rather than spread out. I have code to plot it in matplotlib, but I prefer to use the one in plotly.  Each point is a document (talk) and each color is a different topic, so points of the same color should be close together if our topic modeling is working. If they are spread out, it means that the documents in that particular topic are not very similar (which may not be the worst thing in the world, depending on what you are trying to do). I was not able to get a better color bar, so it is easier to view if you just isolate some of the topics rather than try to view all 20.  You can do this by clicking on the topics in the legend. 

tSNE Ted Talk Topics

Stay tuned for the actual recommender I created using flask.