Seattle Location History

Kivan Polimis, Mon 21 August 2017, How-to

Seattle Location History

The goal of this post is to visualize time spent in various Seattle neighborhoods using Google location data and Python

Overview

  1. Setup
    • download data
    • install modules
  2. Data wrangling
    • data extraction
    • data exploration
  3. Working with Shapefiles in Python
  4. Prep data and pare down locations
  5. Compute your measurement metric
  6. Choropleth Plot
  7. Hexbin Map

Setup

  1. Use Google Takout to download your Google location history
  2. If you've previously enabled Google location reporting on your smartphone, your GPS data will be periodically uploaded to Google's servers. Use Google Takeout to download your location history.
    • The decisions of when and how to upload this data are entirely obfuscated to the end user, but as you'll see below, Android appears to upload a GPS location every 60 seconds. That's plenty of data to work with.
  3. After downloading your data, install the required modules

Google Takeout

Google Takeout is a Google service that allows users to export any personal Google data. We'll use Takeout to download our raw location history as a one-time snapshot. Since Latitude was retired, no API exists to access location history in real-time.

Download location data:

  • Go to takeout. Uncheck all services except "Location History"
  • The data will be in a json format, which works great for us. Download it in your favorite compression type.
  • When Google has finished creating your archive, you'll get an email notification and a link to download.
  • Download and unzip the file, and you should be looking at a LocationHistory.json file. Working with location data in Pandas. Pandas is an incredibly powerful tool that simplifies working with complex datatypes and performing statistical analysis in the style of R. Chris Albon has great primers on using Pandas here under the "Data Wrangling" section.

Install modules

  • If you use Anaconda to manage your Python packages, I recommend creating a virtual environment with anaconda to install the dependencies. Copying the lines below the instruction into the terminal creates the environment, requirements.txt, etc.
    • conda create -n test-env python=3.5 anaconda
    • source activate test-env
  • make a requirements.txt file for dependencies
    • (echo descartes; echo IPython; echo shapely; echo fiona; echo Basemap) >> requirements.txt
  • install requirements.txt
    • conda install --yes --file requirements.txt
  • Windows users:

After completing the setup, we'll read in the LocationHistory.json file from Google Takeout and create a DataFrame.

In [1]:
from __future__ import division
from utils import * 

Data Wrangling

  • data extraction
In [2]:
with open('data/LocationHistory/2018/LocationHistory.json', 'r') as location_file:
    raw = json.loads(location_file.read())

# use location_data as an abbreviation for location data
location_data = pd.DataFrame(raw['locations'])
del raw #free up some memory

# convert to typical units
location_data['latitudeE7'] = location_data['latitudeE7']/float(1e7) 
location_data['longitudeE7'] = location_data['longitudeE7']/float(1e7)

# convert timestampMs to seconds
location_data['timestampMs'] = location_data['timestampMs'].map(lambda x: float(x)/1000) 
location_data['datetime'] = location_data.timestampMs.map(datetime.datetime.fromtimestamp)

# Rename fields based on the conversions
location_data.rename(columns={'latitudeE7':'latitude',
                              'longitudeE7':'longitude',
                              'timestampMs':'timestamp'}, inplace=True)

# Ignore locations with accuracy estimates over 1000m
location_data = location_data[location_data.accuracy < 1000]
location_data.reset_index(drop=True, inplace=True)

Explore Data

  • view data and datatypes
In [3]:
print(location_data.dtypes)
location_data.describe()
accuracy                     int64
activity                    object
altitude                   float64
heading                    float64
latitude                   float64
longitude                  float64
timestamp                  float64
velocity                   float64
verticalAccuracy           float64
datetime            datetime64[ns]
dtype: object
Out[3]:
accuracy altitude heading latitude longitude timestamp velocity verticalAccuracy
count 745660.000000 101260.000000 44100.000000 745660.000000 745660.000000 7.456600e+05 58874.000000 4921.000000
mean 58.997173 67.057525 186.597551 37.748367 -102.506537 1.417774e+09 7.769678 23.099776
std 125.358984 242.209547 101.643968 9.004123 23.609836 3.356510e+07 11.790783 45.139324
min 1.000000 -715.000000 0.000000 13.689757 -123.260751 1.376790e+09 0.000000 2.000000
25% 22.000000 -18.000000 98.000000 29.817569 -122.306596 1.391259e+09 0.000000 2.000000
50% 31.000000 2.000000 181.000000 29.986634 -95.246060 1.413249e+09 1.000000 2.000000
75% 50.000000 60.000000 270.000000 47.664284 -94.995603 1.428049e+09 13.000000 30.000000
max 999.000000 6738.000000 359.000000 50.105984 23.782015 1.519330e+09 208.000000 473.000000
  • accuracy code "999" may represent missingness
  • find earliest and latest observations in the data
    • save for later
In [4]:
print("earliest observed date: {}".format(min(location_data["datetime"]).strftime('%m-%d-%Y')))
print("latest observed date: {}".format(max(location_data["datetime"]).strftime('%m-%d-%Y')))

earliest_obs = min(location_data["datetime"]).strftime('%m-%d-%Y')
latest_obs = max(location_data["datetime"]).strftime('%m-%d-%Y')
earliest observed date: 08-17-2013
latest observed date: 02-22-2018
  • location_data is a Pandas DataFrame containing all your location history and related info.
  • columns include latitude, longitude, and a timestamp. additional columns are accuracy, activity, altitude, heading, and velocity.
  • all we'll need is latitude, longitude, and time.

Working with Shapefiles in Python

Shapefile is a widely-used data format for describing points, lines, and polygons. To work with shapefiles, Python gives us shapely. To read and write shapefiles, we'll use fiona.

To learn Shapely and write this blog post, I leaned heavily on this article from sensitivecities.com.

First up, you'll need to download shapefile data for the part of the world you're interested in plotting. I wanted to focus on my current home of Seattle, which like many cities provides city shapefile map data for free. It's even broken into city neighborhoods! The US Census Bureau provides a ton of national shapefiles here. Your city likely provides this kind of data too. Tom MacWright has GIS with Python, Shapely, and Fiona overview for more detail on Python mapping with these tools

Next, we'll need to import the Shapefile data we downloaded from the data.seattle.gov link above

In [5]:
shapefilename = 'data/Seattle_Neighborhoods/WGS84/Neighborhoods'
shp = fiona.open(shapefilename+'.shp')
coords = shp.bounds
shp.close()

width, height = coords[2] - coords[0], coords[3] - coords[1]
extra = 0.01
  • Use Basemap to plot the shapefiles
In [6]:
m = Basemap(
    projection='tmerc', ellps='WGS84',
    lon_0=np.mean([coords[0], coords[2]]),
    lat_0=np.mean([coords[1], coords[3]]),
    llcrnrlon=coords[0] - extra * width,
    llcrnrlat=coords[1] - (extra * height), 
    urcrnrlon=coords[2] + extra * width,
    urcrnrlat=coords[3] + (extra * height),
    resolution='i',  suppress_ticks=True)

_out = m.readshapefile(shapefilename, name='seattle', drawbounds=False, color='none', zorder=2)

Prep data and pare down locations

The first step is to pare down your location history to only contain points within the map's borders.

In [7]:
# set up a map dataframe
df_map = pd.DataFrame({
    'poly': [Polygon(hood_points) for hood_points in m.seattle],
    'name': [hood['S_HOOD'] for hood in m.seattle_info]
})

# Convert our latitude and longitude into Basemap cartesian map coordinates
mapped_points = [Point(m(mapped_x, mapped_y)) for mapped_x, mapped_y in zip(location_data['longitude'], 
            location_data['latitude'])]
all_points = MultiPoint(mapped_points)

# Use prep to optimize polygons for faster computation
hood_polygons = (MultiPolygon(list(df_map['poly'].values)))
prepared_polygons = prep(hood_polygons)

# Filter out the points that do not fall within the map we're making
city_points_filter = filter(prepared_polygons.contains, all_points)
city_points_list = list(city_points_filter)
In [8]:
hood_polygons
Out[8]: