NBA MVP Comparisons

Part 1

25 ^th December 2018

It’s Christmas and that means a full slate of NBA games. This time of year also provokes some great NBA discussions and the best NBA discussions are comparative. Arguments like: would Jordan’s ’96 Bulls would beat the ’17 Warriors?

Another comparison that is sure to create great debate is who had the best (or worst) Most Valuable Player (MVP) season in history? This question has a strong empirical dimension, in that we can observe quantifiable aspects across seasons, and can leverage programming to both gather and and analyze the data.

The MVP data will gather comes from basketball-reference.com. Basketball-reference is part of the Sports-Reference sites, “a group of sites providing both basic and sabermetric statistics and resources for sports fans everywhere. [Sports-Reference aims] to be the easiest-to-use, fastest, most complete sources for sports statistics anywhere” (sports-reference.com).

In this post, we will gather and pre-process all the data for the a multi-part series to determine: 1. the MVP finalist with best case for winning their year 2. predict the 2018-2019 MVP

Let the shade begin

Outline

Import modules
Examine html structure of webpage
Use a function with Beautiful Soup to parse webpages into .csv
Analyze .csv of webpage as a Pandas DataFrame
Process the data

import relevant modules
standard library modules:
- os
- urllib
- webbrowser
- datetime
open source modules:
- pandas
- Beautiful Soup

In [1]:

Code

import os
import urllib
import webbrowser
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup

Let’s examine the webpage with all the MVP data from the 1966-1956 season to the 2017-2018 season

In [2]:

Code

url = 'https://www.basketball-reference.com/awards/mvp.html'
webbrowser.open_new_tab(url)

True

In [3]:

Code

# get the html
html = urllib.request.urlopen(url)

# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")

Scraping the Column Headers

The column headers we need for our DataFrame are found in the th element

In [4]:

Code

# Extract the necessary values for the column headers from the table
# and store them as a list
column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
column_headers = [s for s in column_headers if len(s) != 0]
column_headers = column_headers[1:]
print(column_headers)
column_headers = [e for e in column_headers if e not in ('Shooting', 'Advanced')][:-7]
print(column_headers)

['Shooting', 'Advanced', 'Season', 'Lg', 'Player', 'Voting', 'Age', 'Tm', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '2017-18', '2016-17', '2015-16', '2014-15', '2013-14', '2012-13', '2011-12']
['Season', 'Lg', 'Player', 'Voting', 'Age', 'Tm', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']

In [5]:

Code

len(column_headers)

Scraping the Data

Note that table_rows is a list of tag elements.

In [6]:

Code

# The data is found within the table rows
# We want the elements from the 3rd row and on
table_rows = soup.find_all("tr")[2:]
print(type(table_rows))
table_rows[0] # take a look at the first row

<class 'list'>

<tr><th class="left " data-stat="season" scope="row"><a href="/leagues/NBA_2018.html">2017-18</a></th><td class="left " data-stat="lg_id"><a href="/leagues/NBA_2018.html">NBA</a></td><td class="left " csk="Harden,James" data-append-csv="hardeja01" data-stat="player"><a href="/players/h/hardeja01.html">James Harden</a></td><td class="center " data-stat="voting"> (<a href="/awards/awards_2018.html#mvp">V</a>)</td><td class="right " data-stat="age">28</td><td class="left " data-stat="team_id"><a href="/teams/HOU/2018.html">HOU</a></td><td class="right " data-stat="g">72</td><td class="right " data-stat="mp_per_g">35.4</td><td class="right " data-stat="pts_per_g">30.4</td><td class="right " data-stat="trb_per_g">5.4</td><td class="right " data-stat="ast_per_g">8.8</td><td class="right " data-stat="stl_per_g">1.8</td><td class="right " data-stat="blk_per_g">0.7</td><td class="right " data-stat="fg_pct">.449</td><td class="right " data-stat="fg3_pct">.367</td><td class="right " data-stat="ft_pct">.858</td><td class="right " data-stat="ws">15.4</td><td class="right " data-stat="ws_per_48">.289</td></tr>

The data we want for each player is found within the the td (or table data) elements.
Below I’ve created a function that extracts the data we want from table_rows. The comments should walk you through what each part of the function does.

In [7]:

Code

def extract_mvp_data(table_rows):
    """
    Extract and return the the desired information from the td elements within the table rows.
    :param: table_rows:  list of soup `tr` elements
    :return: list of player-year MVP observation
    """
    # create the empty list to store the player data
    player_data = []
    
    for row in table_rows:  # for each row do the following

        # Get the text for each table data (td) element in the row
        player_year = [td.get_text() for td in row.find_all("th")]
        player_list = [td.get_text() for td in row.find_all("td")]
              
        # there are some empty table rows, which are the repeated 
        # column headers in the table
        # we skip over those rows and and continue the for loop
        if not player_list:
            continue
            
        # Now append the data to list of data
        
        player_info = player_year+player_list
        player_data.append(player_info)
        
    return player_data

now we can create a DataFrame with the MVP data

In [8]:

Code

# extract the data we want
mvp_data = extract_mvp_data(table_rows)
# and then store it in a DataFrame
mvp_data_df  = pd.DataFrame(mvp_data)

In [9]:

Code

mvp_data_df[0:1]

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
0	2017-18	NBA	James Harden	(V)	28	HOU	72	35.4	30.4	5.4	8.8	1.8	0.7	.449	.367	.858	15.4	.289

rename the columns
view the data

In [10]:

Code

mvp_data_df.columns = column_headers
print("the MVP dataframe has {} rows (player-year observations) and  {} columns".format(mvp_data_df.shape[0],
                                                                                        mvp_data_df.shape[1]))
mvp_data_df.head()

the MVP dataframe has 113 rows (player-year observations) and  18 columns

	Season	Lg	Player	Voting	Age	Tm	G	MP	PTS	TRB	AST	STL	BLK	FG%	3P%	FT%	WS	WS/48
0	2017-18	NBA	James Harden	(V)	28	HOU	72	35.4	30.4	5.4	8.8	1.8	0.7	.449	.367	.858	15.4	.289
1	2016-17	NBA	Russell Westbrook	(V)	28	OKC	81	34.6	31.6	10.7	10.4	1.6	0.4	.425	.343	.845	13.1	.224
2	2015-16	NBA	Stephen Curry	(V)	27	GSW	79	34.2	30.1	5.4	6.7	2.1	0.2	.504	.454	.908	17.9	.318
3	2014-15	NBA	Stephen Curry	(V)	26	GSW	80	32.7	23.8	4.3	7.7	2.0	0.2	.487	.443	.914	15.7	.288
4	2013-14	NBA	Kevin Durant	(V)	25	OKC	81	38.5	32.0	7.4	5.5	1.3	0.7	.503	.391	.873	19.2	.295

now we need the data on all the finalist for the question:
- which finalist had the best argument for winning that year?

In [11]:

Code

url_2018_mvp_finalist = 'https://www.basketball-reference.com/awards/awards_2018.html#mvp'

In [12]:

Code

# get the html
# examine webpage
html_finalist = urllib.request.urlopen(url_2018_mvp_finalist)
webbrowser.open_new_tab(url)

# create the BeautifulSoup object
soup_finalist = BeautifulSoup(html_finalist, "lxml") 

# Extract the necessary values for the column headers from the table and store them as a list
column_headers_finalist = [th.getText() for th in soup_finalist.findAll('th', limit=30)]
column_headers_finalist = [s for s in column_headers_finalist if len(s) != 0]
column_headers_finalist = column_headers_finalist[1:]

print("raw column names in finalist table: {}".format(column_headers_finalist))
column_headers_finalist = [e for e in column_headers_finalist if e not in ('Shooting', 'Advanced', 'Per Game')][:-4]
print("formatted column names in finalist table: {}".format(column_headers_finalist))
print("{} columns in finalist table".format(len(column_headers_finalist)))

# The data is found within the `tr` elements of the first `tbody` element
# We want the elements from the 3rd row and on
table_rows_finalist = soup_finalist.find("tbody").find_all("tr")
print(type(table_rows_finalist))
table_rows_finalist[-1] # take a look at the last row

raw column names in finalist table: ['Per Game', 'Shooting', 'Advanced', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '1', '2', '3', '4']
formatted column names in finalist table: ['Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
20 columns in finalist table
<class 'bs4.element.ResultSet'>

<tr><th class="right " data-stat="rank" scope="row">13</th><td class="left " csk="Oladipo,Victor" data-append-csv="oladivi01" data-stat="player"><a href="/players/o/oladivi01.html">Victor Oladipo</a></td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/IND/2018.html">IND</a></td><td class="right " data-stat="votes_first">0.0</td><td class="right " data-stat="points_won">2.0</td><td class="right " data-stat="points_max">1010</td><td class="right " data-stat="award_share">0.002</td><td class="right " data-stat="g">75</td><td class="right " data-stat="mp_per_g">34.0</td><td class="right " data-stat="pts_per_g">23.1</td><td class="right " data-stat="trb_per_g">5.2</td><td class="right " data-stat="ast_per_g">4.3</td><td class="right " data-stat="stl_per_g">2.4</td><td class="right " data-stat="blk_per_g">0.8</td><td class="right " data-stat="fg_pct">.477</td><td class="right " data-stat="fg3_pct">.371</td><td class="right " data-stat="ft_pct">.799</td><td class="right " data-stat="ws">8.2</td><td class="right " data-stat="ws_per_48">.155</td></tr>

create a function to extract MVP finalist data

In [13]:

Code

def extract_player_data(table_rows):
    """
    Extract and return the the desired information from the td elements within the table rows.
    :param: table_rows:  list of soup `tr` elements
    :return: list of player-year MVP finalist observations
    """
    # create the empty list to store the player data
    player_data = []
    
    for row in table_rows:  # for each row do the following

        # Get the text for each table data (td) element in the row
        player_rank = [td.get_text() for td in row.find_all("th")]
        player_list = [td.get_text() for td in row.find_all("td")]
        
        
        # there are some empty table rows, which are the repeated 
        # column headers in the table
        # we skip over those rows and and continue the for loop
        if not player_list:
            continue
            
        # Now append the data to list of data
        
        player_info = player_rank+player_list
        player_data.append(player_info)
        
    return player_data

In [14]:

Code

# extract the data we want
data = extract_player_data(table_rows_finalist)

# and then store it in a DataFrame
example_player_df  = pd.DataFrame(data)
example_player_df.columns = column_headers_finalist
print("the MVP finalist dataframe has {} rows (player-year observations) and  {} columns".format(example_player_df.shape[0],
                                                                                        example_player_df.shape[1]))
example_player_df.tail(6)

the MVP finalist dataframe has 13 rows (player-year observations) and  20 columns

	Rank	Player	Age	Tm	Pts Won	Pts Max	Share	G	MP	PTS	TRB	AST	STL	BLK	FG%	3P%	FT%	WS	WS/48
7	8	DeMar DeRozan	28	TOR	32.0	1010	0.032	80	33.9	23.0	3.9	5.2	1.1	0.3	.456	.310	.825	9.6	.170
8	9	LaMarcus Aldridge	32	SAS	6.0	1010	0.006	75	33.5	23.1	8.5	2.0	0.6	1.2	.510	.293	.837	10.9	.209
9	10T	Jimmy Butler	28	MIN	5.0	1010	0.005	59	36.7	22.2	5.3	4.9	2.0	0.4	.474	.350	.854	8.9	.198
10	10T	Stephen Curry	29	GSW	5.0	1010	0.005	51	32.0	26.4	5.1	6.1	1.6	0.2	.495	.423	.921	9.1	.267
11	12	Joel Embiid	23	PHI	4.0	1010	0.004	63	30.3	22.9	11.0	3.2	0.6	1.8	.483	.308	.769	6.2	.155
12	13	Victor Oladipo	25	IND	2.0	1010	0.002	75	34.0	23.1	5.2	4.3	2.4	0.8	.477	.371	.799	8.2	.155

Scraping the Data for All MVP Finalists Since 1956

Scraping the for finalist data since 1956 follows is essentially the same process as above, just repeated for each year, using a for loop.

As we loop over the years, we will create a DataFrame for each year of MVP finalist data, and append it to a large list of DataFrames that contains all the MVP finalists data. We will also have a separate list that will contain any errors and the url associated with that error. This will let us know if there are any issues with our scraper, and which url is causing the error.

In [15]:

Code

# Create an empty list that will contain all the dataframes
# (one dataframe for all finalist dataframes)
mvp_finalist_list = []

# a list to store any errors that may come up while scraping
errors_list = []

let’s time how long this loop takes

In [16]:

Code

loop_start = datetime.now()
print(loop_start)

2019-05-25 15:10:57.567450

In [17]:

Code

# The url template that we pass in the finalist year info
url_template = "https://www.basketball-reference.com/awards/awards_{year}.html#mvp"

# for each year from 1956 to (and including) 2018
for year in range(1956, 2019): 
    
    # Use try/except block to catch and inspect any urls that cause an error
    try:
        # get the MVP finalist data url
        url = url_template.format(year=year)

        # get the html
        html = urllib.request.urlopen(url)

        # create the BeautifulSoup object
        soup = BeautifulSoup(html, "lxml") 

        # get the column headers
        column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
        column_headers = [s for s in column_headers if len(s) != 0]
        column_headers = column_headers[1:]
        column_headers = [e for e in column_headers if e not in ('Shooting', 'Advanced', 'Per Game')][:-4]
       
        # select the data from the table 
        table_rows = soup.find_all("tr")[2:]

        # extract the player data from the table rows
        player_data = extract_player_data(table_rows)
        
        # create the dataframe for the current year's mvp finalist data
        # subset to only include MVP finalists
        year_df = pd.DataFrame(player_data)
        year_df.columns = column_headers
        year_df = year_df.loc[year_df["Pts Max"]==year_df["Pts Max"].unique()[0]]

        # add the year of the MVP finalist data to the dataframe
        year_df.insert(0, "Year", year)
 
        # append the current dataframe to the list of dataframes
        mvp_finalist_list.append(year_df)
        
    except Exception as e:
        # Store the url and the error it causes in a list
        error =[url, e] 
        # then append it to the list of errors
        errors_list.append(error)

In [18]:

Code

loop_stop = datetime.now()
print(loop_stop)
print("the loop took {}".format(loop_stop-loop_start))

2019-05-25 15:12:03.441050
the loop took 0:01:05.873600

the loop took ~ 1 minute

In [19]:

Code

print(len(errors_list))
errors_list

[]

We don’t get any errors, so that’s good.
Now we can concatenate all the DataFrames we scraped and create one large DataFrame containing all the finalist data

In [20]:

Code

print(type(mvp_finalist_list))
print(len(mvp_finalist_list))

<class 'list'>
63

In [21]:

Code

mvp_finalist_list[0:1]

[   Year Rank          Player Age   Tm First Pts Won Pts Max  Share   G  ...   \
 0  1956    1      Bob Pettit  23  STL  33.0    33.0      80  0.413  72  ...    
 1  1956    2     Paul Arizin  27  PHW  21.0    21.0      80  0.263  72  ...    
 2  1956    3       Bob Cousy  27  BOS  11.0    11.0      80  0.138  72  ...    
 3  1956    4    Mel Hutchins  27  FTW   9.0     9.0      80  0.113  66  ...    
 4  1956   5T   Dolph Schayes  27  SYR   2.0     2.0      80  0.025  72  ...    
 5  1956   5T    Bill Sharman  29  BOS   2.0     2.0      80  0.025  72  ...    
 6  1956   7T        Tom Gola  23  PHW   1.0     1.0      80  0.013  68  ...    
 7  1956   7T  Maurice Stokes  22  ROC   1.0     1.0      80  0.013  67  ...    
 
     PTS   TRB  AST STL BLK   FG% 3P%   FT%    WS WS/48  
 0  25.7  16.2  2.6          .429      .736  13.8  .236  
 1  24.2   7.5  2.6          .448      .810  12.2  .214  
 2  18.8   6.8  8.9          .360      .844   6.8  .119  
 3  12.0   7.5  2.7          .425      .643   4.4  .095  
 4  20.4  12.4  2.8          .387      .858  11.8  .225  
 5  19.9   3.6  4.7          .438      .867   8.8  .157  
 6  10.8   9.1  5.9          .412      .733   6.5  .132  
 7  16.8  16.3  4.9          .354      .714   6.0  .125  
 
 [8 rows x 21 columns]]

In [22]:

Code

column_headers_finalist.insert(0, "Year")
print(column_headers_finalist)
print(len(column_headers_finalist))

['Year', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
21

In [23]:

Code

# store all finalist data in one DataFrame
mvp_finalist_df = pd.concat(mvp_finalist_list, axis=0)
print(mvp_finalist_df.shape)

(972, 21)

In [24]:

Code

# Take a look at the first row
mvp_finalist_df.iloc[0]

Year             1956
Rank                1
Player     Bob Pettit
Age                23
Tm                STL
First            33.0
Pts Won          33.0
Pts Max            80
Share           0.413
G                  72
MP               38.8
PTS              25.7
TRB              16.2
AST               2.6
STL                  
BLK                  
FG%              .429
3P%                  
FT%              .736
WS               13.8
WS/48            .236
Name: 0, dtype: object

Now that we fixed up the necessary columns, let’s write out the raw data to a CSV file.

In [25]:

Code

os.makedirs('../data/raw_data', exist_ok=True)
os.makedirs('../data/clean_data', exist_ok=True)

In [26]:

Code

mvp_finalist_df.head()

	Year	Rank	Player	Age	Tm	First	Pts Won	Pts Max	Share	G	...	PTS	TRB	AST	FG%	FT%	WS	WS/48
0	1956	1	Bob Pettit	23	STL	33.0	33.0	80	0.413	72	...	25.7	16.2	2.6	.429	.736	13.8	.236
1	1956	2	Paul Arizin	27	PHW	21.0	21.0	80	0.263	72	...	24.2	7.5	2.6	.448	.810	12.2	.214
2	1956	3	Bob Cousy	27	BOS	11.0	11.0	80	0.138	72	...	18.8	6.8	8.9	.360	.844	6.8	.119
3	1956	4	Mel Hutchins	27	FTW	9.0	9.0	80	0.113	66	...	12.0	7.5	2.7	.425	.643	4.4	.095
4	1956	5T	Dolph Schayes	27	SYR	2.0	2.0	80	0.025	72	...	20.4	12.4	2.8	.387	.858	11.8	.225

5 rows × 21 columns

In [27]:

Code

# Write out the  MVP data and MVP finalist data to the raw_data folder in the data folder
mvp_data_df.to_csv("../data/raw_data/mvp_data_df_raw.csv", index=False)
mvp_finalist_df.to_csv("../data/raw_data/mvp_finalist_df_raw.csv", index=False)

Cleaning the Data

Now that we have the raw MVP data, we need to clean it up a bit for data exploration

In [28]:

Code

mvp_data_df_clean = pd.read_csv("../data/raw_data/mvp_data_df_raw.csv", encoding = "Latin-1")
mvp_finalist_df_clean = pd.read_csv("../data/raw_data/mvp_finalist_df_raw.csv", encoding = "Latin-1")

In [29]:

Code

mvp_data_df_clean.head()

	Season	Lg	Player	Voting	Age	Tm	G	MP	PTS	TRB	AST	STL	BLK	FG%	3P%	FT%	WS	WS/48
0	2017-18	NBA	James Harden	(V)	28.0	HOU	72.0	35.4	30.4	5.4	8.8	1.8	0.7	0.449	0.367	0.858	15.4	0.289
1	2016-17	NBA	Russell Westbrook	(V)	28.0	OKC	81.0	34.6	31.6	10.7	10.4	1.6	0.4	0.425	0.343	0.845	13.1	0.224
2	2015-16	NBA	Stephen Curry	(V)	27.0	GSW	79.0	34.2	30.1	5.4	6.7	2.1	0.2	0.504	0.454	0.908	17.9	0.318
3	2014-15	NBA	Stephen Curry	(V)	26.0	GSW	80.0	32.7	23.8	4.3	7.7	2.0	0.2	0.487	0.443	0.914	15.7	0.288
4	2013-14	NBA	Kevin Durant	(V)	25.0	OKC	81.0	38.5	32.0	7.4	5.5	1.3	0.7	0.503	0.391	0.873	19.2	0.295

create dictionaries for renaming columns
rename all columns with dictionaries

In [30]:

Code

mvp_data_columns_dict = {'Season':'season', 'Lg':'league', 'Player':'player', 'Voting': 'voting', 'Tm': 'team', 'Age': 'age',
                         'G': 'games_played', 'MP': 'avg_minutes', 'PTS': 'avg_points',  'TRB': 'avg_rebounds', 
                         'AST': 'avg_assists', 'STL': 'avg_steals', 'BLK': 'avg_blocks', 'FG%': 'field_goal_pct',
                         '3P%': 'three_pt_pct', 'FT%': 'free_throw_pct', 'WS': 'win_shares',
                         'WS/48': 'win_shares_per_48'
                        }
mvp_finalist_columns_dict = {'Year':'year', 'Player':'player', 'Rank': 'rank', 'Tm': 'team', 'Age': 'age',
                             'First': 'first_place_votes', 'Pts Won': 'points_won', 'Pts Max': 'points_max',
                             'Share':'vote_share', 'G': 'games_played', 'MP': 'avg_minutes', 'PTS': 'avg_points',
                             'TRB': 'avg_rebounds', 'AST': 'avg_assists', 'STL': 'avg_steals', 'BLK': 'avg_blocks',
                             'FG%': 'field_goal_pct',  '3P%': 'three_pt_pct', 'FT%': 'free_throw_pct', 'WS': 'win_shares',
                             'WS/48': 'win_shares_per_48'
                              }

In [31]:

Code

mvp_data_df_clean.rename(index=str,columns=mvp_data_columns_dict, inplace=True)
mvp_data_df_clean.head()

	season	league	player	voting	age	team	games_played	avg_minutes	avg_points	avg_rebounds	avg_assists	avg_steals	avg_blocks	field_goal_pct	three_pt_pct	free_throw_pct	win_shares	win_shares_per_48
0	2017-18	NBA	James Harden	(V)	28.0	HOU	72.0	35.4	30.4	5.4	8.8	1.8	0.7	0.449	0.367	0.858	15.4	0.289
1	2016-17	NBA	Russell Westbrook	(V)	28.0	OKC	81.0	34.6	31.6	10.7	10.4	1.6	0.4	0.425	0.343	0.845	13.1	0.224
2	2015-16	NBA	Stephen Curry	(V)	27.0	GSW	79.0	34.2	30.1	5.4	6.7	2.1	0.2	0.504	0.454	0.908	17.9	0.318
3	2014-15	NBA	Stephen Curry	(V)	26.0	GSW	80.0	32.7	23.8	4.3	7.7	2.0	0.2	0.487	0.443	0.914	15.7	0.288
4	2013-14	NBA	Kevin Durant	(V)	25.0	OKC	81.0	38.5	32.0	7.4	5.5	1.3	0.7	0.503	0.391	0.873	19.2	0.295

In [32]:

Code

mvp_finalist_df_clean.rename(index=str,columns=mvp_finalist_columns_dict, inplace=True)
mvp_finalist_df_clean.head()

	year	rank	player	age	team	first_place_votes	points_won	points_max	vote_share	games_played	...	avg_points	avg_rebounds	avg_assists	avg_steals	avg_blocks	field_goal_pct	three_pt_pct	free_throw_pct	win_shares	win_shares_per_48
0	1956	1	Bob Pettit	23	STL	33.0	33.0	80	0.413	72	...	25.7	16.2	2.6	NaN	NaN	0.429	NaN	0.736	13.8	0.236
1	1956	2	Paul Arizin	27	PHW	21.0	21.0	80	0.263	72	...	24.2	7.5	2.6	NaN	NaN	0.448	NaN	0.810	12.2	0.214
2	1956	3	Bob Cousy	27	BOS	11.0	11.0	80	0.138	72	...	18.8	6.8	8.9	NaN	NaN	0.360	NaN	0.844	6.8	0.119
3	1956	4	Mel Hutchins	27	FTW	9.0	9.0	80	0.113	66	...	12.0	7.5	2.7	NaN	NaN	0.425	NaN	0.643	4.4	0.095
4	1956	5T	Dolph Schayes	27	SYR	2.0	2.0	80	0.025	72	...	20.4	12.4	2.8	NaN	NaN	0.387	NaN	0.858	11.8	0.225

5 rows × 21 columns

Cleaning Up the Rest of the Data

In [33]:

Code

# convert the data to proper numeric types
mvp_data_df_clean = mvp_data_df_clean.apply(pd.to_numeric, errors="ignore")
mvp_data_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 113 entries, 0 to 112
Data columns (total 18 columns):
season               113 non-null object
league               113 non-null object
player               113 non-null object
voting               73 non-null object
age                  73 non-null float64
team                 73 non-null object
games_played         73 non-null float64
avg_minutes          73 non-null float64
avg_points           73 non-null float64
avg_rebounds         73 non-null float64
avg_assists          73 non-null float64
avg_steals           50 non-null float64
avg_blocks           50 non-null float64
field_goal_pct       73 non-null float64
three_pt_pct         48 non-null float64
free_throw_pct       73 non-null float64
win_shares           73 non-null float64
win_shares_per_48    73 non-null float64
dtypes: float64(13), object(5)
memory usage: 16.8+ KB

In [34]:

Code

# convert the data to proper numeric types
mvp_finalist_df_clean = mvp_finalist_df_clean.apply(pd.to_numeric, errors="ignore")
mvp_finalist_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 971
Data columns (total 21 columns):
year                 972 non-null int64
rank                 972 non-null object
player               972 non-null object
age                  972 non-null int64
team                 972 non-null object
first_place_votes    972 non-null float64
points_won           972 non-null float64
points_max           972 non-null int64
vote_share           972 non-null float64
games_played         972 non-null int64
avg_minutes          972 non-null float64
avg_points           972 non-null float64
avg_rebounds         972 non-null float64
avg_assists          972 non-null float64
avg_steals           756 non-null float64
avg_blocks           756 non-null float64
field_goal_pct       972 non-null float64
three_pt_pct         621 non-null float64
free_throw_pct       972 non-null float64
win_shares           972 non-null float64
win_shares_per_48    972 non-null float64
dtypes: float64(14), int64(4), object(3)
memory usage: 167.1+ KB

We are not done yet. A lot of out numeric columns are missing data because players didn’t accumulate any of those stats. For example, the 3 point line is introduced in 1982 and all players in preceding seasons don’t have this statistic. Additionally, we want to select the columns with numeric data and then replace the NaNs (the current value that represents the missing data) with 0s, as that is a more appropriate value.

In [35]:

Code

# Get the column names for the numeric columns
num_cols_mvp = mvp_data_df_clean.columns[mvp_data_df_clean.dtypes != object]
num_cols_finalist = mvp_finalist_df_clean.columns[mvp_finalist_df_clean.dtypes != object]

# Replace all NaNs with 0
mvp_data_df_clean.loc[:, num_cols_mvp] = mvp_data_df_clean.loc[:, num_cols_mvp].fillna(0)
mvp_finalist_df_clean.loc[:, num_cols_finalist] = mvp_finalist_df_clean.loc[:, num_cols_finalist].fillna(0)

In [36]:

Code

mvp_finalist_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 971
Data columns (total 21 columns):
year                 972 non-null int64
rank                 972 non-null object
player               972 non-null object
age                  972 non-null int64
team                 972 non-null object
first_place_votes    972 non-null float64
points_won           972 non-null float64
points_max           972 non-null int64
vote_share           972 non-null float64
games_played         972 non-null int64
avg_minutes          972 non-null float64
avg_points           972 non-null float64
avg_rebounds         972 non-null float64
avg_assists          972 non-null float64
avg_steals           972 non-null float64
avg_blocks           972 non-null float64
field_goal_pct       972 non-null float64
three_pt_pct         972 non-null float64
free_throw_pct       972 non-null float64
win_shares           972 non-null float64
win_shares_per_48    972 non-null float64
dtypes: float64(14), int64(4), object(3)
memory usage: 167.1+ KB

remove ABA winners
remove MVP summary table

In [37]:

Code

mvp_data_df_clean = mvp_data_df_clean.loc[mvp_data_df_clean["league"]=="NBA"]
mvp_data_df_clean = mvp_data_df_clean[pd.notnull(mvp_data_df_clean['team'])]

We are finally done cleaning the data and now we can save it to a CSV file.

In [38]:

Code

mvp_data_df_clean = mvp_data_df_clean[pd.notnull(mvp_data_df_clean['player'])]
mvp_data_df_clean.sort_values(['season'], ascending=False, axis=0, inplace=True)


mvp_finalist_df_clean = mvp_finalist_df_clean[pd.notnull(mvp_finalist_df_clean['player'])]
mvp_finalist_df_clean.sort_values(['year', 'points_won'], ascending=False, axis=0, inplace=True)

In [39]:

Code

mvp_data_df_clean.to_csv("../data/clean_data/mvp_data_df_clean.csv", index=False)
print(mvp_data_df_clean.shape)
mvp_data_df_clean.head()

(63, 18)

	season	league	player	voting	age	team	games_played	avg_minutes	avg_points	avg_rebounds	avg_assists	avg_steals	avg_blocks	field_goal_pct	three_pt_pct	free_throw_pct	win_shares	win_shares_per_48
0	2017-18	NBA	James Harden	(V)	28.0	HOU	72.0	35.4	30.4	5.4	8.8	1.8	0.7	0.449	0.367	0.858	15.4	0.289
1	2016-17	NBA	Russell Westbrook	(V)	28.0	OKC	81.0	34.6	31.6	10.7	10.4	1.6	0.4	0.425	0.343	0.845	13.1	0.224
2	2015-16	NBA	Stephen Curry	(V)	27.0	GSW	79.0	34.2	30.1	5.4	6.7	2.1	0.2	0.504	0.454	0.908	17.9	0.318
3	2014-15	NBA	Stephen Curry	(V)	26.0	GSW	80.0	32.7	23.8	4.3	7.7	2.0	0.2	0.487	0.443	0.914	15.7	0.288
4	2013-14	NBA	Kevin Durant	(V)	25.0	OKC	81.0	38.5	32.0	7.4	5.5	1.3	0.7	0.503	0.391	0.873	19.2	0.295

In [40]:

Code

mvp_finalist_df_clean.to_csv("../data/clean_data/mvp_finalist_df_clean.csv", index=False)
print(mvp_finalist_df_clean.shape)
mvp_finalist_df_clean.head()

(972, 21)

	year	rank	player	age	team	first_place_votes	points_won	points_max	vote_share	games_played	...	avg_points	avg_rebounds	avg_assists	avg_steals	avg_blocks	field_goal_pct	three_pt_pct	free_throw_pct	win_shares	win_shares_per_48
959	2018	1	James Harden	28	HOU	86.0	965.0	1010	0.955	72	...	30.4	5.4	8.8	1.8	0.7	0.449	0.367	0.858	15.4	0.289
960	2018	2	LeBron James	33	CLE	15.0	738.0	1010	0.731	82	...	27.5	8.6	9.1	1.4	0.9	0.542	0.367	0.731	14.0	0.221
961	2018	3	Anthony Davis	24	NOP	0.0	445.0	1010	0.441	75	...	28.1	11.1	2.3	1.5	2.6	0.534	0.340	0.828	13.7	0.241
962	2018	4	Damian Lillard	27	POR	0.0	207.0	1010	0.205	73	...	26.9	4.5	6.6	1.1	0.4	0.439	0.361	0.916	12.6	0.227
963	2018	5	Russell Westbrook	29	OKC	0.0	76.0	1010	0.075	80	...	25.4	10.1	10.3	1.8	0.3	0.449	0.298	0.737	10.1	0.166

5 rows × 21 columns

Review

In this tutorial, we learned how to
- examine the html structure of webpage
- use functions based on the Beautiful Soup module to parse tables on multiple webpage into .csv
- analyzed a .csv file using the Pandas module

Download this notebook or see a static view here

In [41]:

Code

import sys
import bs4
print(f'last updated: {datetime.now().strftime("%Y-%m-%d %H:%M")} \n')
print(f"System and module version information: \n")
print(f"Python version: {sys.version_info}")
print(f"urllib.request version: {urllib.request.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"Beautiful Soup version: {bs4.__version__}")

last updated: 2019-05-27 22:35 

System and module version information: 

Python version: sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0)
urllib.request version: 3.7
pandas version: 0.23.4
Beautiful Soup version: 4.6.3