Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents.

Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents. Takes in a list of notes or a single file (.docx, .txt, .rtf, etc) or single string to be marked for duplicates which can then be highlighted, bolded, or removed. Marked output and tokens are output.

Read More

Busiest NYC subway stations

WHICH MTA STATIONS ARE OPTIMAL FOR STREET TEAMS?

MTA nyc map

MTA nyc map

The PROBLEM

A fictional nonprofit company called WomenTechWomenYes is on a mission to get more girls and women involved with technology. They are throwing their yearly fundraising GALA in the early summer. In order to maximize attendance and raise awareness they are planning to send street teams into the city to collect emails from locals. People who give their email will then be sent free tickets to attend the fundraising GALA.  We were tasked with locating the best subway stations in NYC to place 8 street teams on 2 different days (1 weekday, 1 weekend) who are collecting emails.

APPROACH

WTWY asked that we use the MTA data to determine the busiest stations. We decided that they should be looking at more than just the busiest stations. There are issues with the top 5 stations, as they are filled with both tourists and commuters and tourists are not likely to attend a GALA or contribute to a local non-profit. The other issue with the busiest stations is that too many people will be very difficult to stand and gather information because you will just be in the way of a huge crowd. How, then, do we decide where to place the street teams? We want to target as many people as possible that till also be interested in the GALA.

OBJECTIVE: LOCATE SUBWAY STATIONS WITH HIGH TRAFFIC IN HIGH INCOME NEIGHBORHOODS.

First we downloaded the turnstile data from the MTA.info website. This data is collected from every turnstile, every 4(ish) hours and put into a .csv file that contains a week worth of data. We pulled data from April, 30, 2016 - June 25, 2016, and April 29th, 2017 - June 24th, 2017. Memorial Day was excluded for both years since ridership is abnormal on major holidays.

histograms of Raw (left) and cleaned (right) data.

histograms of Raw (left) and cleaned (right) data.

 

ASSUMPTIONS

Each turnstile collects turns every 4 hours, cumulatively and then resets to zero when it reaches the maximum. So, we need to take the difference to get the number of people that came through for the 4-hour interval. But because of resets, you end up with a few extreme outliers. Below (left) is the plot of histograms for the data and as you can see there are some extreme outliers.  Once we clean it up and get rid of numbers that are not possible, we get the histogram (distribution) of the real data (right). 

 

weekdayBars.png

WEEKLY TOTALS

After cleaning the data, we summed the entries and exits collected for each time interval to get a total number of people in/out. The, we summed over the time intervals to get a daily total for each turnstile. Then, summed over the turnstiles for each station to get a daily total of people per station.  We then wanted to know how to narrow it down by day of the week since there are over 300 unique stations in this dataset (MTA's count is higher, I realize).  Below, is a figure of the mean traffic (all stations) for each day of the week. 

download-2.png

As you can see, the MTA is used more by commuters during the week going to and from work, than on weekends. Tuesday, Wednesday, and Thursday are the best days to try and catch some locals going to/from work. As for the weekend, Saturday sees more traffic than Sunday. The, we look at the top 50 stations from a particular weekday and match them to the income level of the neighborhood. 

The second plot shows the top 20 stations for Tuesdays (mean over all the Tuesdays we collected). The  expected stations appear here, thought I thought Times Square would have been higher on the list.

LOCATIONS

In order to map stations, we linked the station names to the latitude and longitude. This information is kept in a different file (from the MTA) and the station names are slightly different for the 2 files, e.x., 'GRD CNTRL-42 ST'  vs. 'Grand Central - 42 ST' (see figure below). So, we used a Python library called fuzzy wuzzy to match the names from the turnstile files to the names listed in the latitude/longitude file (also obtained form MTA's site). 

Project+Benson.jpg

INCOME DATA

We acquired the median household income from US Census Block Data 2015. It can be downloaded as a shape file and contains latitude and longitude along with other Geo data.  

BUSIEST STATIONS IN HIGH INCOME DATA AREAS

Now that we have both the income and the subway traffic data sets, we can give each subway station a weight based on the level of income for that area (block) and choose the top stations based on traffic and income. 

NY_incomemap.png
Project+Benson+(3).jpg

MAPPING STATIONS + INCOME

Lastly, we then map these stations onto the heatmap of income data. We used a Python library called BaseMap for the mapping tasks. 

Project+Benson+(1).jpg