YMCA web scraping: what the village people have to do with obesity (Part1)

Github of code.

I have been doing some thinking about the factors that contribute to obesity in the US.  CDC

One of the issues for people who struggle with obesity in low-income areas is access to a safe place to exercise.  For example, as someone who lives in Baltimore, I can certainly see why you wouldn't just go running if you don't know the area. It can make you an easy target for someone who is looking to get your phone/cash or just make trouble. I was thinking about what the least expensive options are for people who want to go to the gym, pool or basketball court and I immediately thought of the good old YMCA!  My family have always been members of the Y. It's where I learned to swim, where I did step aerobics in the 80s, and ran on the track with my Dad when it was cold outside. It also happens to be the most cost effective option around town. The Y was always a good place to build community and get some  exercise. I even went to a lock-in on New Year's with my friends.  Much love to the Pat Jones YMCA

I decided to look at the data on obesity in the US and whether or not it has a relationship with the locations of YMCAs. i.e., Are there more YMCAs in areas with lower obesity?  In order to do this I needed the locations of all YMCAs, but the only way to get it was to enter in all the states, cities, or zip codes to this page (**as of Oct. 5th, 2017**)

Find Your Y

Find Your Y

List of Locations

List of Locations

When you enter the state, it only shows the 20 closest locations and I wanted them all, so the most systematic way to get all the locations was to enter the zip codes. Thanks to selenium, this can be automated once you obtain a list of zip codes. I used selenium with python 3 and wrote a little function that allowed me to feed it a list of zip codes (I separated them into states), and it automates the process of entering the zip code, clicking 'go' and pulling all of the html from the next page.  This page (upper right) is comprised of a series of tables that list the name, address, city, state, zip, phone, etc...  The code for it is below.

First, import a bunch of libraries.

import time, os, pickle, sys, selenium, requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request, urllib.parse, urllib.error
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# path to your pickled zipcode files
path = '/user/me/placeI_keep_zipcodes' 

Then, we make a little function to unpickle (format for storing pandas database files).

def get_zips(state_name):
    """open the pickle file of zipcodes for a state
    IN: state abbreviation
    OUT: the data in the file 
    with open(path + state_name + "_zips.pkl", 'rb') as picklefile: 
        return pickle.load(picklefile)

Now, we will make a very long function, but it will be worth it. Don't worry, I'll explain it in pieces. 

def get_ymca_locations(state_name):
    """ #open the page to enter in zip code, parse HTML, save in a dataframe
    IN: 2-letter abbreviation of a state
    OUT: data frame of ymca locations (name, address, city, zip) in the state
    # open file that contains all zipcodes for selected state
    zipcodes_for_site = [get_zips(state_name)]

    #create a data frame, name the cols we will fill up
     y_df = pd.DataFrame(index=range(len(zipcodes_for_site[0])*20),

Iterate through a loop based on the zipcode

open up chrome remotely with chromedriver. be sure you have it installed. The process is simple. 

row = 0
    for zipy in zipcodes_for_site[0]:
        #open chrome   
        chromedriver = "/Applications/chromedriver"
        os.environ["webdriver.chrome.driver"] = chromedriver
        driver = webdriver.Chrome(chromedriver)

Now, we enter the url of the page we want to go to

#url of YMCA's page

When you look at the 'source' code for this website, you can find out what the 'element ID' is of the box where we need to enter our zip code. This is the word that you put in. For us, it was 'address' .  

query = driver.find_element_by_id("address")

# Then we ask (query) selenium to put in our zipcode (left)

# and hit enter

# now we have arrived at the page we want! (right)
# It lists our locations in a big table. 

The html was parsed using BeautifulSoup4 in Python 3. Create a variable called soup2 (soup1 was lost in a terrible accident. don't ask) and fill it up with all the HTML from our current page (the one that lists our 20 locations).

html from 'find your Y' page

html from 'find your Y' page

I figured out that each location had this unique style tag, so that's what I searched in order to pull the name, address, city, state, zip. This gets entered into our 'find_all' call.

locationsoup = soup2.find_all(style="padding-left: 17px; text-indent: -17px;")

This takes each item and gets the text associated with the first item 'a' which is a web address, the text associated with it is the name of that YMCA.  That is what we save in the name variable. See all those 'br' tags? The 'item.next_sibling.next_sibling' method proved to be my best friend!

for item in locationsoup:  
            name1 = item.find('a')  
            name = name1.text  

# then, we go to the next item (address) and store that 
            adds= name1.next_sibling.next_sibling

# Do it again for city/state/zip
            nn= adds.next_sibling.next_sibling

now, we name another loop that checks to see if the location has been stored already, and if not, it stores the variables and separates that big 'nn' variable of city/state/zip. 

if adds not in y_df.adds.values:
                y_df.name.iloc[row] = name
                y_df.adds.iloc[row] = adds
                y_df.city.iloc[row] = nn.split(',')[0]
                y_df.state.iloc[row] = nn.split()[-2]
                y_df.zipcode.iloc[row] = str(nn.split()[-1])[0:5]
                row +=1
Parsing HTML with beautifulsoup4 

Parsing HTML with beautifulsoup4 

Last thing to do is close the web driver !!!!! Don't forget this step. Drop the Null rows, and add a location row.  

     #drop null rows
    y_df = y_df[y_df['name'].notnull()]    
    # this is how we will sum later
    y_df['locations'] = 1 
    #get rid of locations in nearby states
    y_df = y_df.loc[(y_df.state == state_name)]

return y_df

I scraped one state at a time, and some states have a LOT more zipcodes than others. My average time was about 3 zip codes per min. which meant a few nights of letting my computer stay on all night to work while I snore.


Please see my other posts about initial data cleaning and analysis.