Linear regression of multiple county-level statistics on the obesity rate in the United States.Read More
Topic modeling of TED talks using Latent Dirichlet Allocation and visualization with tSNE.Read More
This post is about the Ted Talk recommender I made using Natural Language Processing (NLP), topic modeling. The recommender lets you enter key words from the title of a talk and returns the urls to 5 Ted Talks that are similar to yours. I'm planning to host my little flask app soon so that you too can use the Ted Talk Recommender! The github repository is here.
The data for this project consisted of transcripts from ted and tedX talks. Thanks to Rounak Banik and his web scraping I was able to obtain 2467 transcripts from 355 different Ted and TedX events from 2001-2017. I downloaded this corpus from Kaggle, along with metadata about every talk. Here's a sample of the transcript from the most popular (highest views) Ted Talk. 'Do Schools Kill Creativity? by Sir Ken Robinson.
Good morning. How are you?(Laughter)It\'s been great, hasn\'t it? I\'ve been blown away by the whole thing. In fact, I\'m leaving. (Laughter)There have been three themes running through the conference which are relevant to what I want to talk about. One is the extraordinary evidence of human creativity
The first thing I saw when looking at these transcripts was that there were a lot of parentheticals for various non-speech sounds. For example, (Laughter) or (applause) or (Music). There were even some cute little notes when the lyrics of a performance were transcribed
someone like him ♫♫ He was tall and strong
I decided that I wanted to look at only the words that the speaker said, and remove these words in parentheses. Although, it would be interesting to collect these non-speech events and keep a count in the main matrix, especially for things like 'laughter' or applause or multimedia (present/not present) in making recommendations or calculating the popularity of a talk.
Lucky for me, all of the parentheses contained these non-speech sounds and any of the speaker's words that required parenthesis were in brackets, so I just removed them with a simple regular expression. Thank you, Ted transcribers, for making my life a little easier!!!
clean_parens = re.sub(r'\([^)]*\)', ' ', text)
Cleaning Text with NLTK
Four important steps for cleaning the text and getting it into a format that we can analyze:
1.tokenize 2.lemmatize 3.remove stop words/punctuation 4.vectorize
NLTK (Natural Language ToolKit) is a python library for NLP. I found it very easy to use and highly effective.
tokenize- This is the process of splitting up the document (talk) into words. There are a few tokenizers in NLTK, and one called wordpunct was my favorite because it separated the punctuation as well.
from nltk.tokenize import wordpunct_tokenize doc_words2 = [wordpunct_tokenize(docs[fileid]) for fileid in fileids] print('\n--- --\n'.join(wordpunct_tokenize(docs))) OUTPUT: Good morning . How are you ?
The notes were easy to remove by adding them to my stop words. Stopwords are the words that don't give us much information, (i.e., the, and, it, she, as) along with the punctuation. We want to remove these from our text, too.
We can do this by importing NLTKs list of stopwords and then adding to it. I went through many iterations of cleaning in order to figure out which words to add to my stopwords. I added a lot of words and little things that weren't getting picked up, but this is a sample of my list.
from nltk.corpus import stopwords stop = stopwords.words('english') stop += ['.'," \'", 'ok','okay','yeah','ya','stuff','?']
Lemmatization - In this step, we get each word down to its root form. I chose the lemmatizer over the stemmer because it was more conservative and was able to change the ending to the appropriate one (i.e. children-->child, capacities-->capacity). This was at the expense of missing a few obvious ones (starting, unpredictability).
from nltk.stem import WordNetLemmatizer lemmizer = WordNetLemmatizer() clean_words =  for word in docwords2: #remove stop words if word.lower() not in stop: low_word = lemmizer.lemmatize(word) #another shot at removing stopwords if low_word.lower() not in stop: clean_words.append(low_word.lower())
Now we have squeaky clean text! Here's the same excerpt that I showed you at the top of the post.
good morning great blown away whole thing fact leaving three theme running conference relevant want talk one extraordinary evidence human creativity
As you can see it no longer makes a ton of sense, but it will still be very informative once we process these words over the whole corpus of talks.
Let's look at some of the n-grams. This is just pairs of words (or triplets) that show up together. It will tell us something about our corpus, but also guide us in our next step of vectorization. Here's what we get for the top 30 most fequent bi-grmas.
from collections import Counter from operator import itemgetter counter = Counter() n = 2 for doc in cleaned_talks: words = TextBlob(doc).words bigrams = ngrams(words, n) counter += Counter(bigrams) for phrase, count in counter.most_common(30): print('%20s %i' % (" ".join(phrase), count)) OUPUT: year ago 2074 little bit 1607 year old 1365 united states 1103 one thing 1041 around world 938 new york 894 can not 877 first time 751 every day 692 many people 656 last year 604 every single 573 one day 559 10 year 541
The tri-grams were not very informative or useful aside from "new york city" and "0000 year ago" which get picked up in the bi-grams.
OUPTUT: new york city 236 000 year ago 135 new york times 123 10 year ago 118 every single day 109 million year ago 109 people around world 101 two year ago 100 world war ii 99 one two three 97 couple year ago 96 20 year ago 83 five year old 78 talk little bit 71 spend lot time 71 every single one 69 three year ago 69 six year old 69 sub saharan africa 68
- Vectorization is the important step of turning our words into numbers. The method that gave me the best results was count vectorizer. This function takes each word in each document and counts the number of times the word appears. You end up with each word (and n-gram) as your columns and each row is a document (talk), so the data is the frequency of each word in each document. As you can imagine, there will be a large number of zeros in this matrix; we call this a sparse matrix.
c_vectorizer = CountVectorizer(ngram_range=(1,3), stop_words='english', max_df = 0.6, max_features=10000) # call `fit` to build the vocabulary c_vectorizer.fit(cleaned_talks) # finally, call `transform` to convert text to a bag of words c_x = c_vectorizer.transform(cleaned_talks)
Now we are ready for topic modeling! Stay tuned for next week's post on Latent Dirichlet Allocation and visualization with tSNE!
Dog and Cat sounds. I reduced the dimensionality of the 1D FFTs of the sounds and then used several models to see which one is able to classify them. Gaussian Naive Bayes won.Read More
Dog and Cat sounds. This post is mostly spectrograms.Read More
Merging of YMCA location data from zip code format to county, for data exploration and comparison to health data per county.Read More
Opinion: we are inadvertently weeding women out in the name of 'rigor'.Read More
I have been doing some thinking about the factors that contribute to obesity in the US. CDC
One of the issues for people who struggle with obesity in low-income areas is access to a safe place to exercise. For example, as someone who lives in Baltimore, I can certainly see why you wouldn't just go running if you don't know the area. It can make you an easy target for someone who is looking to get your phone/cash or just make trouble. I was thinking about what the least expensive options are for people who want to go to the gym, pool or basketball court and I immediately thought of the good old YMCA! My family have always been members of the Y. It's where I learned to swim, where I did step aerobics in the 80s, and ran on the track with my Dad when it was cold outside. It also happens to be the most cost effective option around town. The Y was always a good place to build community and get some exercise. I even went to a lock-in on New Year's with my friends. Much love to the Pat Jones YMCA.
I decided to look at the data on obesity in the US and whether or not it has a relationship with the locations of YMCAs. i.e., Are there more YMCAs in areas with lower obesity? In order to do this I needed the locations of all YMCAs, but the only way to get it was to enter in all the states, cities, or zip codes to this page (**as of Oct. 5th, 2017**)
When you enter the state, it only shows the 20 closest locations and I wanted them all, so the most systematic way to get all the locations was to enter the zip codes. Thanks to selenium, this can be automated once you obtain a list of zip codes. I used selenium with python 3 and wrote a little function that allowed me to feed it a list of zip codes (I separated them into states), and it automates the process of entering the zip code, clicking 'go' and pulling all of the html from the next page. This page (upper right) is comprised of a series of tables that list the name, address, city, state, zip, phone, etc... The code for it is below.
First, import a bunch of libraries.
import time, os, pickle, sys, selenium, requests import pandas as pd from bs4 import BeautifulSoup import urllib.request, urllib.parse, urllib.error from selenium import webdriver from selenium.webdriver.common.keys import Keys sys.setrecursionlimit(10000) # path to your pickled zipcode files path = '/user/me/placeI_keep_zipcodes'
Then, we make a little function to unpickle (format for storing pandas database files).
def get_zips(state_name): """open the pickle file of zipcodes for a state ----------- IN: state abbreviation OUT: the data in the file """ with open(path + state_name + "_zips.pkl", 'rb') as picklefile: return pickle.load(picklefile)
Now, we will make a very long function, but it will be worth it. Don't worry, I'll explain it in pieces.
def get_ymca_locations(state_name): """ #open the page to enter in zip code, parse HTML, save in a dataframe ------------------- IN: 2-letter abbreviation of a state OUT: data frame of ymca locations (name, address, city, zip) in the state """ # open file that contains all zipcodes for selected state zipcodes_for_site = [get_zips(state_name)] #create a data frame, name the cols we will fill up y_df = pd.DataFrame(index=range(len(zipcodes_for_site)*20), columns=['zipcode','state','city', 'adds','name','locations'])
Iterate through a loop based on the zipcode
open up chrome remotely with chromedriver. be sure you have it installed. The process is simple.
row = 0 for zipy in zipcodes_for_site: #open chrome chromedriver = "/Applications/chromedriver" os.environ["webdriver.chrome.driver"] = chromedriver driver = webdriver.Chrome(chromedriver)
Now, we enter the url of the page we want to go to
#url of YMCA's page driver.get("http://www.ymca.net/find-your-y/")
When you look at the 'source' code for this website, you can find out what the 'element ID' is of the box where we need to enter our zip code. This is the word that you put in. For us, it was 'address' .
query = driver.find_element_by_id("address") # Then we ask (query) selenium to put in our zipcode (left) query.send_keys(zipy) # and hit enter query.send_keys(Keys.RETURN) # now we have arrived at the page we want! (right) # It lists our locations in a big table.
The html was parsed using BeautifulSoup4 in Python 3. Create a variable called soup2 (soup1 was lost in a terrible accident. don't ask) and fill it up with all the HTML from our current page (the one that lists our 20 locations).
I figured out that each location had this unique style tag, so that's what I searched in order to pull the name, address, city, state, zip. This gets entered into our 'find_all' call.
locationsoup = soup2.find_all(style="padding-left: 17px; text-indent: -17px;")
This takes each item and gets the text associated with the first item 'a' which is a web address, the text associated with it is the name of that YMCA. That is what we save in the name variable. See all those 'br' tags? The 'item.next_sibling.next_sibling' method proved to be my best friend!
for item in locationsoup: name1 = item.find('a') name = name1.text # then, we go to the next item (address) and store that adds= name1.next_sibling.next_sibling # Do it again for city/state/zip nn= adds.next_sibling.next_sibling
now, we name another loop that checks to see if the location has been stored already, and if not, it stores the variables and separates that big 'nn' variable of city/state/zip.
if adds not in y_df.adds.values: y_df.name.iloc[row] = name y_df.adds.iloc[row] = adds y_df.city.iloc[row] = nn.split(',') y_df.state.iloc[row] = nn.split()[-2] y_df.zipcode.iloc[row] = str(nn.split()[-1])[0:5] row +=1
Last thing to do is close the web driver !!!!! Don't forget this step. Drop the Null rows, and add a location row.
driver.close() #drop null rows y_df = y_df[y_df['name'].notnull()] # this is how we will sum later y_df['locations'] = 1 #get rid of locations in nearby states y_df = y_df.loc[(y_df.state == state_name)] return y_df
I scraped one state at a time, and some states have a LOT more zipcodes than others. My average time was about 3 zip codes per min. which meant a few nights of letting my computer stay on all night to work while I snore.
Please see my other posts about initial data cleaning and analysis.
WHICH MTA STATIONS ARE OPTIMAL FOR STREET TEAMS?
A fictional nonprofit company called WomenTechWomenYes is on a mission to get more girls and women involved with technology. They are throwing their yearly fundraising GALA in the early summer. In order to maximize attendance and raise awareness they are planning to send street teams into the city to collect emails from locals. People who give their email will then be sent free tickets to attend the fundraising GALA. We were tasked with locating the best subway stations in NYC to place 8 street teams on 2 different days (1 weekday, 1 weekend) who are collecting emails.
WTWY asked that we use the MTA data to determine the busiest stations. We decided that they should be looking at more than just the busiest stations. There are issues with the top 5 stations, as they are filled with both tourists and commuters and tourists are not likely to attend a GALA or contribute to a local non-profit. The other issue with the busiest stations is that too many people will be very difficult to stand and gather information because you will just be in the way of a huge crowd. How, then, do we decide where to place the street teams? We want to target as many people as possible that till also be interested in the GALA.
OBJECTIVE: LOCATE SUBWAY STATIONS WITH HIGH TRAFFIC IN HIGH INCOME NEIGHBORHOODS.
First we downloaded the turnstile data from the MTA.info website. This data is collected from every turnstile, every 4(ish) hours and put into a .csv file that contains a week worth of data. We pulled data from April, 30, 2016 - June 25, 2016, and April 29th, 2017 - June 24th, 2017. Memorial Day was excluded for both years since ridership is abnormal on major holidays.
Each turnstile collects turns every 4 hours, cumulatively and then resets to zero when it reaches the maximum. So, we need to take the difference to get the number of people that came through for the 4-hour interval. But because of resets, you end up with a few extreme outliers. Below (left) is the plot of histograms for the data and as you can see there are some extreme outliers. Once we clean it up and get rid of numbers that are not possible, we get the histogram (distribution) of the real data (right).
After cleaning the data, we summed the entries and exits collected for each time interval to get a total number of people in/out. The, we summed over the time intervals to get a daily total for each turnstile. Then, summed over the turnstiles for each station to get a daily total of people per station. We then wanted to know how to narrow it down by day of the week since there are over 300 unique stations in this dataset (MTA's count is higher, I realize). Below, is a figure of the mean traffic (all stations) for each day of the week.
As you can see, the MTA is used more by commuters during the week going to and from work, than on weekends. Tuesday, Wednesday, and Thursday are the best days to try and catch some locals going to/from work. As for the weekend, Saturday sees more traffic than Sunday. The, we look at the top 50 stations from a particular weekday and match them to the income level of the neighborhood.
The second plot shows the top 20 stations for Tuesdays (mean over all the Tuesdays we collected). The expected stations appear here, thought I thought Times Square would have been higher on the list.
In order to map stations, we linked the station names to the latitude and longitude. This information is kept in a different file (from the MTA) and the station names are slightly different for the 2 files, e.x., 'GRD CNTRL-42 ST' vs. 'Grand Central - 42 ST' (see figure below). So, we used a Python library called fuzzy wuzzy to match the names from the turnstile files to the names listed in the latitude/longitude file (also obtained form MTA's site).
We acquired the median household income from US Census Block Data 2015. It can be downloaded as a shape file and contains latitude and longitude along with other Geo data.
BUSIEST STATIONS IN HIGH INCOME DATA AREAS
Now that we have both the income and the subway traffic data sets, we can give each subway station a weight based on the level of income for that area (block) and choose the top stations based on traffic and income.
MAPPING STATIONS + INCOME
Lastly, we then map these stations onto the heatmap of income data. We used a Python library called BaseMap for the mapping tasks.