Code
import os
import urllib
import webbrowser
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoupIt’s Christmas and that means a full slate of NBA games. This time of year also provokes some great NBA discussions and the best NBA discussions are comparative. Arguments like: would Jordan’s ’96 Bulls would beat the ’17 Warriors?
Another comparison that is sure to create great debate is who had the best (or worst) Most Valuable Player (MVP) season in history? This question has a strong empirical dimension, in that we can observe quantifiable aspects across seasons, and can leverage programming to both gather and and analyze the data.
The MVP data will gather comes from basketball-reference.com. Basketball-reference is part of the Sports-Reference sites, “a group of sites providing both basic and sabermetric statistics and resources for sports fans everywhere. [Sports-Reference aims] to be the easiest-to-use, fastest, most complete sources for sports statistics anywhere” (sports-reference.com).
In this post, we will gather and pre-process all the data for the a multi-part series to determine: 1. the MVP finalist with best case for winning their year 2. predict the 2018-2019 MVP
Let the shade begin
Beautiful Soup to parse webpages into .csvPandas DataFrameIn [1]:
In [2]:
In [3]:
The column headers we need for our DataFrame are found in the th element
In [4]:
# Extract the necessary values for the column headers from the table
# and store them as a list
column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
column_headers = [s for s in column_headers if len(s) != 0]
column_headers = column_headers[1:]
print(column_headers)
column_headers = [e for e in column_headers if e not in ('Shooting', 'Advanced')][:-7]
print(column_headers)['Shooting', 'Advanced', 'Season', 'Lg', 'Player', 'Voting', 'Age', 'Tm', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '2017-18', '2016-17', '2015-16', '2014-15', '2013-14', '2012-13', '2011-12']
['Season', 'Lg', 'Player', 'Voting', 'Age', 'Tm', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
table_rows is a list of tag elements.In [6]:
<class 'list'>
<tr><th class="left " data-stat="season" scope="row"><a href="/leagues/NBA_2018.html">2017-18</a></th><td class="left " data-stat="lg_id"><a href="/leagues/NBA_2018.html">NBA</a></td><td class="left " csk="Harden,James" data-append-csv="hardeja01" data-stat="player"><a href="/players/h/hardeja01.html">James Harden</a></td><td class="center " data-stat="voting"> (<a href="/awards/awards_2018.html#mvp">V</a>)</td><td class="right " data-stat="age">28</td><td class="left " data-stat="team_id"><a href="/teams/HOU/2018.html">HOU</a></td><td class="right " data-stat="g">72</td><td class="right " data-stat="mp_per_g">35.4</td><td class="right " data-stat="pts_per_g">30.4</td><td class="right " data-stat="trb_per_g">5.4</td><td class="right " data-stat="ast_per_g">8.8</td><td class="right " data-stat="stl_per_g">1.8</td><td class="right " data-stat="blk_per_g">0.7</td><td class="right " data-stat="fg_pct">.449</td><td class="right " data-stat="fg3_pct">.367</td><td class="right " data-stat="ft_pct">.858</td><td class="right " data-stat="ws">15.4</td><td class="right " data-stat="ws_per_48">.289</td></tr>
td (or table data) elements.table_rows. The comments should walk you through what each part of the function does.In [7]:
def extract_mvp_data(table_rows):
"""
Extract and return the the desired information from the td elements within the table rows.
:param: table_rows: list of soup `tr` elements
:return: list of player-year MVP observation
"""
# create the empty list to store the player data
player_data = []
for row in table_rows: # for each row do the following
# Get the text for each table data (td) element in the row
player_year = [td.get_text() for td in row.find_all("th")]
player_list = [td.get_text() for td in row.find_all("td")]
# there are some empty table rows, which are the repeated
# column headers in the table
# we skip over those rows and and continue the for loop
if not player_list:
continue
# Now append the data to list of data
player_info = player_year+player_list
player_data.append(player_info)
return player_dataDataFrame with the MVP dataIn [8]:
In [9]:
In [10]:
the MVP dataframe has 113 rows (player-year observations) and 18 columns
| Season | Lg | Player | Voting | Age | Tm | G | MP | PTS | TRB | AST | STL | BLK | FG% | 3P% | FT% | WS | WS/48 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-18 | NBA | James Harden | (V) | 28 | HOU | 72 | 35.4 | 30.4 | 5.4 | 8.8 | 1.8 | 0.7 | .449 | .367 | .858 | 15.4 | .289 |
| 1 | 2016-17 | NBA | Russell Westbrook | (V) | 28 | OKC | 81 | 34.6 | 31.6 | 10.7 | 10.4 | 1.6 | 0.4 | .425 | .343 | .845 | 13.1 | .224 |
| 2 | 2015-16 | NBA | Stephen Curry | (V) | 27 | GSW | 79 | 34.2 | 30.1 | 5.4 | 6.7 | 2.1 | 0.2 | .504 | .454 | .908 | 17.9 | .318 |
| 3 | 2014-15 | NBA | Stephen Curry | (V) | 26 | GSW | 80 | 32.7 | 23.8 | 4.3 | 7.7 | 2.0 | 0.2 | .487 | .443 | .914 | 15.7 | .288 |
| 4 | 2013-14 | NBA | Kevin Durant | (V) | 25 | OKC | 81 | 38.5 | 32.0 | 7.4 | 5.5 | 1.3 | 0.7 | .503 | .391 | .873 | 19.2 | .295 |
In [11]:
In [12]:
# get the html
# examine webpage
html_finalist = urllib.request.urlopen(url_2018_mvp_finalist)
webbrowser.open_new_tab(url)
# create the BeautifulSoup object
soup_finalist = BeautifulSoup(html_finalist, "lxml")
# Extract the necessary values for the column headers from the table and store them as a list
column_headers_finalist = [th.getText() for th in soup_finalist.findAll('th', limit=30)]
column_headers_finalist = [s for s in column_headers_finalist if len(s) != 0]
column_headers_finalist = column_headers_finalist[1:]
print("raw column names in finalist table: {}".format(column_headers_finalist))
column_headers_finalist = [e for e in column_headers_finalist if e not in ('Shooting', 'Advanced', 'Per Game')][:-4]
print("formatted column names in finalist table: {}".format(column_headers_finalist))
print("{} columns in finalist table".format(len(column_headers_finalist)))
# The data is found within the `tr` elements of the first `tbody` element
# We want the elements from the 3rd row and on
table_rows_finalist = soup_finalist.find("tbody").find_all("tr")
print(type(table_rows_finalist))
table_rows_finalist[-1] # take a look at the last rowraw column names in finalist table: ['Per Game', 'Shooting', 'Advanced', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '1', '2', '3', '4']
formatted column names in finalist table: ['Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
20 columns in finalist table
<class 'bs4.element.ResultSet'>
<tr><th class="right " data-stat="rank" scope="row">13</th><td class="left " csk="Oladipo,Victor" data-append-csv="oladivi01" data-stat="player"><a href="/players/o/oladivi01.html">Victor Oladipo</a></td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/IND/2018.html">IND</a></td><td class="right " data-stat="votes_first">0.0</td><td class="right " data-stat="points_won">2.0</td><td class="right " data-stat="points_max">1010</td><td class="right " data-stat="award_share">0.002</td><td class="right " data-stat="g">75</td><td class="right " data-stat="mp_per_g">34.0</td><td class="right " data-stat="pts_per_g">23.1</td><td class="right " data-stat="trb_per_g">5.2</td><td class="right " data-stat="ast_per_g">4.3</td><td class="right " data-stat="stl_per_g">2.4</td><td class="right " data-stat="blk_per_g">0.8</td><td class="right " data-stat="fg_pct">.477</td><td class="right " data-stat="fg3_pct">.371</td><td class="right " data-stat="ft_pct">.799</td><td class="right " data-stat="ws">8.2</td><td class="right " data-stat="ws_per_48">.155</td></tr>
In [13]:
def extract_player_data(table_rows):
"""
Extract and return the the desired information from the td elements within the table rows.
:param: table_rows: list of soup `tr` elements
:return: list of player-year MVP finalist observations
"""
# create the empty list to store the player data
player_data = []
for row in table_rows: # for each row do the following
# Get the text for each table data (td) element in the row
player_rank = [td.get_text() for td in row.find_all("th")]
player_list = [td.get_text() for td in row.find_all("td")]
# there are some empty table rows, which are the repeated
# column headers in the table
# we skip over those rows and and continue the for loop
if not player_list:
continue
# Now append the data to list of data
player_info = player_rank+player_list
player_data.append(player_info)
return player_dataIn [14]:
# extract the data we want
data = extract_player_data(table_rows_finalist)
# and then store it in a DataFrame
example_player_df = pd.DataFrame(data)
example_player_df.columns = column_headers_finalist
print("the MVP finalist dataframe has {} rows (player-year observations) and {} columns".format(example_player_df.shape[0],
example_player_df.shape[1]))
example_player_df.tail(6)the MVP finalist dataframe has 13 rows (player-year observations) and 20 columns
| Rank | Player | Age | Tm | First | Pts Won | Pts Max | Share | G | MP | PTS | TRB | AST | STL | BLK | FG% | 3P% | FT% | WS | WS/48 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 8 | DeMar DeRozan | 28 | TOR | 0.0 | 32.0 | 1010 | 0.032 | 80 | 33.9 | 23.0 | 3.9 | 5.2 | 1.1 | 0.3 | .456 | .310 | .825 | 9.6 | .170 |
| 8 | 9 | LaMarcus Aldridge | 32 | SAS | 0.0 | 6.0 | 1010 | 0.006 | 75 | 33.5 | 23.1 | 8.5 | 2.0 | 0.6 | 1.2 | .510 | .293 | .837 | 10.9 | .209 |
| 9 | 10T | Jimmy Butler | 28 | MIN | 0.0 | 5.0 | 1010 | 0.005 | 59 | 36.7 | 22.2 | 5.3 | 4.9 | 2.0 | 0.4 | .474 | .350 | .854 | 8.9 | .198 |
| 10 | 10T | Stephen Curry | 29 | GSW | 0.0 | 5.0 | 1010 | 0.005 | 51 | 32.0 | 26.4 | 5.1 | 6.1 | 1.6 | 0.2 | .495 | .423 | .921 | 9.1 | .267 |
| 11 | 12 | Joel Embiid | 23 | PHI | 0.0 | 4.0 | 1010 | 0.004 | 63 | 30.3 | 22.9 | 11.0 | 3.2 | 0.6 | 1.8 | .483 | .308 | .769 | 6.2 | .155 |
| 12 | 13 | Victor Oladipo | 25 | IND | 0.0 | 2.0 | 1010 | 0.002 | 75 | 34.0 | 23.1 | 5.2 | 4.3 | 2.4 | 0.8 | .477 | .371 | .799 | 8.2 | .155 |
Scraping the for finalist data since 1956 follows is essentially the same process as above, just repeated for each year, using a for loop.
DataFrame for each year of MVP finalist data, and append it to a large list of DataFrames that contains all the MVP finalists data. We will also have a separate list that will contain any errors and the url associated with that error. This will let us know if there are any issues with our scraper, and which url is causing the error.In [15]:
In [17]:
# The url template that we pass in the finalist year info
url_template = "https://www.basketball-reference.com/awards/awards_{year}.html#mvp"
# for each year from 1956 to (and including) 2018
for year in range(1956, 2019):
# Use try/except block to catch and inspect any urls that cause an error
try:
# get the MVP finalist data url
url = url_template.format(year=year)
# get the html
html = urllib.request.urlopen(url)
# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
# get the column headers
column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
column_headers = [s for s in column_headers if len(s) != 0]
column_headers = column_headers[1:]
column_headers = [e for e in column_headers if e not in ('Shooting', 'Advanced', 'Per Game')][:-4]
# select the data from the table
table_rows = soup.find_all("tr")[2:]
# extract the player data from the table rows
player_data = extract_player_data(table_rows)
# create the dataframe for the current year's mvp finalist data
# subset to only include MVP finalists
year_df = pd.DataFrame(player_data)
year_df.columns = column_headers
year_df = year_df.loc[year_df["Pts Max"]==year_df["Pts Max"].unique()[0]]
# add the year of the MVP finalist data to the dataframe
year_df.insert(0, "Year", year)
# append the current dataframe to the list of dataframes
mvp_finalist_list.append(year_df)
except Exception as e:
# Store the url and the error it causes in a list
error =[url, e]
# then append it to the list of errors
errors_list.append(error)In [18]:
DataFrames we scraped and create one large DataFrame containing all the finalist dataIn [21]:
[ Year Rank Player Age Tm First Pts Won Pts Max Share G ... \
0 1956 1 Bob Pettit 23 STL 33.0 33.0 80 0.413 72 ...
1 1956 2 Paul Arizin 27 PHW 21.0 21.0 80 0.263 72 ...
2 1956 3 Bob Cousy 27 BOS 11.0 11.0 80 0.138 72 ...
3 1956 4 Mel Hutchins 27 FTW 9.0 9.0 80 0.113 66 ...
4 1956 5T Dolph Schayes 27 SYR 2.0 2.0 80 0.025 72 ...
5 1956 5T Bill Sharman 29 BOS 2.0 2.0 80 0.025 72 ...
6 1956 7T Tom Gola 23 PHW 1.0 1.0 80 0.013 68 ...
7 1956 7T Maurice Stokes 22 ROC 1.0 1.0 80 0.013 67 ...
PTS TRB AST STL BLK FG% 3P% FT% WS WS/48
0 25.7 16.2 2.6 .429 .736 13.8 .236
1 24.2 7.5 2.6 .448 .810 12.2 .214
2 18.8 6.8 8.9 .360 .844 6.8 .119
3 12.0 7.5 2.7 .425 .643 4.4 .095
4 20.4 12.4 2.8 .387 .858 11.8 .225
5 19.9 3.6 4.7 .438 .867 8.8 .157
6 10.8 9.1 5.9 .412 .733 6.5 .132
7 16.8 16.3 4.9 .354 .714 6.0 .125
[8 rows x 21 columns]]
In [22]:
In [23]:
In [24]:
Now that we fixed up the necessary columns, let’s write out the raw data to a CSV file.
In [25]:
In [26]:
| Year | Rank | Player | Age | Tm | First | Pts Won | Pts Max | Share | G | ... | PTS | TRB | AST | STL | BLK | FG% | 3P% | FT% | WS | WS/48 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1956 | 1 | Bob Pettit | 23 | STL | 33.0 | 33.0 | 80 | 0.413 | 72 | ... | 25.7 | 16.2 | 2.6 | .429 | .736 | 13.8 | .236 | |||
| 1 | 1956 | 2 | Paul Arizin | 27 | PHW | 21.0 | 21.0 | 80 | 0.263 | 72 | ... | 24.2 | 7.5 | 2.6 | .448 | .810 | 12.2 | .214 | |||
| 2 | 1956 | 3 | Bob Cousy | 27 | BOS | 11.0 | 11.0 | 80 | 0.138 | 72 | ... | 18.8 | 6.8 | 8.9 | .360 | .844 | 6.8 | .119 | |||
| 3 | 1956 | 4 | Mel Hutchins | 27 | FTW | 9.0 | 9.0 | 80 | 0.113 | 66 | ... | 12.0 | 7.5 | 2.7 | .425 | .643 | 4.4 | .095 | |||
| 4 | 1956 | 5T | Dolph Schayes | 27 | SYR | 2.0 | 2.0 | 80 | 0.025 | 72 | ... | 20.4 | 12.4 | 2.8 | .387 | .858 | 11.8 | .225 |
5 rows × 21 columns
In [27]:
In [28]:
In [29]:
| Season | Lg | Player | Voting | Age | Tm | G | MP | PTS | TRB | AST | STL | BLK | FG% | 3P% | FT% | WS | WS/48 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-18 | NBA | James Harden | (V) | 28.0 | HOU | 72.0 | 35.4 | 30.4 | 5.4 | 8.8 | 1.8 | 0.7 | 0.449 | 0.367 | 0.858 | 15.4 | 0.289 |
| 1 | 2016-17 | NBA | Russell Westbrook | (V) | 28.0 | OKC | 81.0 | 34.6 | 31.6 | 10.7 | 10.4 | 1.6 | 0.4 | 0.425 | 0.343 | 0.845 | 13.1 | 0.224 |
| 2 | 2015-16 | NBA | Stephen Curry | (V) | 27.0 | GSW | 79.0 | 34.2 | 30.1 | 5.4 | 6.7 | 2.1 | 0.2 | 0.504 | 0.454 | 0.908 | 17.9 | 0.318 |
| 3 | 2014-15 | NBA | Stephen Curry | (V) | 26.0 | GSW | 80.0 | 32.7 | 23.8 | 4.3 | 7.7 | 2.0 | 0.2 | 0.487 | 0.443 | 0.914 | 15.7 | 0.288 |
| 4 | 2013-14 | NBA | Kevin Durant | (V) | 25.0 | OKC | 81.0 | 38.5 | 32.0 | 7.4 | 5.5 | 1.3 | 0.7 | 0.503 | 0.391 | 0.873 | 19.2 | 0.295 |
In [30]:
mvp_data_columns_dict = {'Season':'season', 'Lg':'league', 'Player':'player', 'Voting': 'voting', 'Tm': 'team', 'Age': 'age',
'G': 'games_played', 'MP': 'avg_minutes', 'PTS': 'avg_points', 'TRB': 'avg_rebounds',
'AST': 'avg_assists', 'STL': 'avg_steals', 'BLK': 'avg_blocks', 'FG%': 'field_goal_pct',
'3P%': 'three_pt_pct', 'FT%': 'free_throw_pct', 'WS': 'win_shares',
'WS/48': 'win_shares_per_48'
}
mvp_finalist_columns_dict = {'Year':'year', 'Player':'player', 'Rank': 'rank', 'Tm': 'team', 'Age': 'age',
'First': 'first_place_votes', 'Pts Won': 'points_won', 'Pts Max': 'points_max',
'Share':'vote_share', 'G': 'games_played', 'MP': 'avg_minutes', 'PTS': 'avg_points',
'TRB': 'avg_rebounds', 'AST': 'avg_assists', 'STL': 'avg_steals', 'BLK': 'avg_blocks',
'FG%': 'field_goal_pct', '3P%': 'three_pt_pct', 'FT%': 'free_throw_pct', 'WS': 'win_shares',
'WS/48': 'win_shares_per_48'
}In [31]:
| season | league | player | voting | age | team | games_played | avg_minutes | avg_points | avg_rebounds | avg_assists | avg_steals | avg_blocks | field_goal_pct | three_pt_pct | free_throw_pct | win_shares | win_shares_per_48 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-18 | NBA | James Harden | (V) | 28.0 | HOU | 72.0 | 35.4 | 30.4 | 5.4 | 8.8 | 1.8 | 0.7 | 0.449 | 0.367 | 0.858 | 15.4 | 0.289 |
| 1 | 2016-17 | NBA | Russell Westbrook | (V) | 28.0 | OKC | 81.0 | 34.6 | 31.6 | 10.7 | 10.4 | 1.6 | 0.4 | 0.425 | 0.343 | 0.845 | 13.1 | 0.224 |
| 2 | 2015-16 | NBA | Stephen Curry | (V) | 27.0 | GSW | 79.0 | 34.2 | 30.1 | 5.4 | 6.7 | 2.1 | 0.2 | 0.504 | 0.454 | 0.908 | 17.9 | 0.318 |
| 3 | 2014-15 | NBA | Stephen Curry | (V) | 26.0 | GSW | 80.0 | 32.7 | 23.8 | 4.3 | 7.7 | 2.0 | 0.2 | 0.487 | 0.443 | 0.914 | 15.7 | 0.288 |
| 4 | 2013-14 | NBA | Kevin Durant | (V) | 25.0 | OKC | 81.0 | 38.5 | 32.0 | 7.4 | 5.5 | 1.3 | 0.7 | 0.503 | 0.391 | 0.873 | 19.2 | 0.295 |
In [32]:
| year | rank | player | age | team | first_place_votes | points_won | points_max | vote_share | games_played | ... | avg_points | avg_rebounds | avg_assists | avg_steals | avg_blocks | field_goal_pct | three_pt_pct | free_throw_pct | win_shares | win_shares_per_48 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1956 | 1 | Bob Pettit | 23 | STL | 33.0 | 33.0 | 80 | 0.413 | 72 | ... | 25.7 | 16.2 | 2.6 | NaN | NaN | 0.429 | NaN | 0.736 | 13.8 | 0.236 |
| 1 | 1956 | 2 | Paul Arizin | 27 | PHW | 21.0 | 21.0 | 80 | 0.263 | 72 | ... | 24.2 | 7.5 | 2.6 | NaN | NaN | 0.448 | NaN | 0.810 | 12.2 | 0.214 |
| 2 | 1956 | 3 | Bob Cousy | 27 | BOS | 11.0 | 11.0 | 80 | 0.138 | 72 | ... | 18.8 | 6.8 | 8.9 | NaN | NaN | 0.360 | NaN | 0.844 | 6.8 | 0.119 |
| 3 | 1956 | 4 | Mel Hutchins | 27 | FTW | 9.0 | 9.0 | 80 | 0.113 | 66 | ... | 12.0 | 7.5 | 2.7 | NaN | NaN | 0.425 | NaN | 0.643 | 4.4 | 0.095 |
| 4 | 1956 | 5T | Dolph Schayes | 27 | SYR | 2.0 | 2.0 | 80 | 0.025 | 72 | ... | 20.4 | 12.4 | 2.8 | NaN | NaN | 0.387 | NaN | 0.858 | 11.8 | 0.225 |
5 rows × 21 columns
In [33]:
<class 'pandas.core.frame.DataFrame'>
Index: 113 entries, 0 to 112
Data columns (total 18 columns):
season 113 non-null object
league 113 non-null object
player 113 non-null object
voting 73 non-null object
age 73 non-null float64
team 73 non-null object
games_played 73 non-null float64
avg_minutes 73 non-null float64
avg_points 73 non-null float64
avg_rebounds 73 non-null float64
avg_assists 73 non-null float64
avg_steals 50 non-null float64
avg_blocks 50 non-null float64
field_goal_pct 73 non-null float64
three_pt_pct 48 non-null float64
free_throw_pct 73 non-null float64
win_shares 73 non-null float64
win_shares_per_48 73 non-null float64
dtypes: float64(13), object(5)
memory usage: 16.8+ KB
In [34]:
<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 971
Data columns (total 21 columns):
year 972 non-null int64
rank 972 non-null object
player 972 non-null object
age 972 non-null int64
team 972 non-null object
first_place_votes 972 non-null float64
points_won 972 non-null float64
points_max 972 non-null int64
vote_share 972 non-null float64
games_played 972 non-null int64
avg_minutes 972 non-null float64
avg_points 972 non-null float64
avg_rebounds 972 non-null float64
avg_assists 972 non-null float64
avg_steals 756 non-null float64
avg_blocks 756 non-null float64
field_goal_pct 972 non-null float64
three_pt_pct 621 non-null float64
free_throw_pct 972 non-null float64
win_shares 972 non-null float64
win_shares_per_48 972 non-null float64
dtypes: float64(14), int64(4), object(3)
memory usage: 167.1+ KB
We are not done yet. A lot of out numeric columns are missing data because players didn’t accumulate any of those stats. For example, the 3 point line is introduced in 1982 and all players in preceding seasons don’t have this statistic. Additionally, we want to select the columns with numeric data and then replace the NaNs (the current value that represents the missing data) with 0s, as that is a more appropriate value.
In [35]:
# Get the column names for the numeric columns
num_cols_mvp = mvp_data_df_clean.columns[mvp_data_df_clean.dtypes != object]
num_cols_finalist = mvp_finalist_df_clean.columns[mvp_finalist_df_clean.dtypes != object]
# Replace all NaNs with 0
mvp_data_df_clean.loc[:, num_cols_mvp] = mvp_data_df_clean.loc[:, num_cols_mvp].fillna(0)
mvp_finalist_df_clean.loc[:, num_cols_finalist] = mvp_finalist_df_clean.loc[:, num_cols_finalist].fillna(0)In [36]:
<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 971
Data columns (total 21 columns):
year 972 non-null int64
rank 972 non-null object
player 972 non-null object
age 972 non-null int64
team 972 non-null object
first_place_votes 972 non-null float64
points_won 972 non-null float64
points_max 972 non-null int64
vote_share 972 non-null float64
games_played 972 non-null int64
avg_minutes 972 non-null float64
avg_points 972 non-null float64
avg_rebounds 972 non-null float64
avg_assists 972 non-null float64
avg_steals 972 non-null float64
avg_blocks 972 non-null float64
field_goal_pct 972 non-null float64
three_pt_pct 972 non-null float64
free_throw_pct 972 non-null float64
win_shares 972 non-null float64
win_shares_per_48 972 non-null float64
dtypes: float64(14), int64(4), object(3)
memory usage: 167.1+ KB
In [37]:
We are finally done cleaning the data and now we can save it to a CSV file.
In [38]:
mvp_data_df_clean = mvp_data_df_clean[pd.notnull(mvp_data_df_clean['player'])]
mvp_data_df_clean.sort_values(['season'], ascending=False, axis=0, inplace=True)
mvp_finalist_df_clean = mvp_finalist_df_clean[pd.notnull(mvp_finalist_df_clean['player'])]
mvp_finalist_df_clean.sort_values(['year', 'points_won'], ascending=False, axis=0, inplace=True)In [39]:
(63, 18)
| season | league | player | voting | age | team | games_played | avg_minutes | avg_points | avg_rebounds | avg_assists | avg_steals | avg_blocks | field_goal_pct | three_pt_pct | free_throw_pct | win_shares | win_shares_per_48 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-18 | NBA | James Harden | (V) | 28.0 | HOU | 72.0 | 35.4 | 30.4 | 5.4 | 8.8 | 1.8 | 0.7 | 0.449 | 0.367 | 0.858 | 15.4 | 0.289 |
| 1 | 2016-17 | NBA | Russell Westbrook | (V) | 28.0 | OKC | 81.0 | 34.6 | 31.6 | 10.7 | 10.4 | 1.6 | 0.4 | 0.425 | 0.343 | 0.845 | 13.1 | 0.224 |
| 2 | 2015-16 | NBA | Stephen Curry | (V) | 27.0 | GSW | 79.0 | 34.2 | 30.1 | 5.4 | 6.7 | 2.1 | 0.2 | 0.504 | 0.454 | 0.908 | 17.9 | 0.318 |
| 3 | 2014-15 | NBA | Stephen Curry | (V) | 26.0 | GSW | 80.0 | 32.7 | 23.8 | 4.3 | 7.7 | 2.0 | 0.2 | 0.487 | 0.443 | 0.914 | 15.7 | 0.288 |
| 4 | 2013-14 | NBA | Kevin Durant | (V) | 25.0 | OKC | 81.0 | 38.5 | 32.0 | 7.4 | 5.5 | 1.3 | 0.7 | 0.503 | 0.391 | 0.873 | 19.2 | 0.295 |
In [40]:
(972, 21)
| year | rank | player | age | team | first_place_votes | points_won | points_max | vote_share | games_played | ... | avg_points | avg_rebounds | avg_assists | avg_steals | avg_blocks | field_goal_pct | three_pt_pct | free_throw_pct | win_shares | win_shares_per_48 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 959 | 2018 | 1 | James Harden | 28 | HOU | 86.0 | 965.0 | 1010 | 0.955 | 72 | ... | 30.4 | 5.4 | 8.8 | 1.8 | 0.7 | 0.449 | 0.367 | 0.858 | 15.4 | 0.289 |
| 960 | 2018 | 2 | LeBron James | 33 | CLE | 15.0 | 738.0 | 1010 | 0.731 | 82 | ... | 27.5 | 8.6 | 9.1 | 1.4 | 0.9 | 0.542 | 0.367 | 0.731 | 14.0 | 0.221 |
| 961 | 2018 | 3 | Anthony Davis | 24 | NOP | 0.0 | 445.0 | 1010 | 0.441 | 75 | ... | 28.1 | 11.1 | 2.3 | 1.5 | 2.6 | 0.534 | 0.340 | 0.828 | 13.7 | 0.241 |
| 962 | 2018 | 4 | Damian Lillard | 27 | POR | 0.0 | 207.0 | 1010 | 0.205 | 73 | ... | 26.9 | 4.5 | 6.6 | 1.1 | 0.4 | 0.439 | 0.361 | 0.916 | 12.6 | 0.227 |
| 963 | 2018 | 5 | Russell Westbrook | 29 | OKC | 0.0 | 76.0 | 1010 | 0.075 | 80 | ... | 25.4 | 10.1 | 10.3 | 1.8 | 0.3 | 0.449 | 0.298 | 0.737 | 10.1 | 0.166 |
5 rows × 21 columns
Beautiful Soup module to parse tables on multiple webpage into .csvPandas moduleIn [41]:
import sys
import bs4
print(f'last updated: {datetime.now().strftime("%Y-%m-%d %H:%M")} \n')
print(f"System and module version information: \n")
print(f"Python version: {sys.version_info}")
print(f"urllib.request version: {urllib.request.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"Beautiful Soup version: {bs4.__version__}")last updated: 2019-05-27 22:35
System and module version information:
Python version: sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0)
urllib.request version: 3.7
pandas version: 0.23.4
Beautiful Soup version: 4.6.3