NBA MVP Comparisons - Part 1

sports-analytics

python

Author

Kivan Polimis

Published

December 25, 2018

NBA MVP Comparisons

Part 1

25 ^th December 2018

It’s Christmas and that means a full slate of NBA games. This time of year also provokes some great NBA discussions and the best NBA discussions are comparative. Arguments like: would Jordan’s ’96 Bulls would beat the ’17 Warriors?

Another comparison that is sure to create great debate is who had the best (or worst) Most Valuable Player (MVP) season in history? This question has a strong empirical dimension, in that we can observe quantifiable aspects across seasons, and can leverage programming to both gather and and analyze the data.

The MVP data will gather comes from basketball-reference.com. Basketball-reference is part of the Sports-Reference sites, “a group of sites providing both basic and sabermetric statistics and resources for sports fans everywhere. [Sports-Reference aims] to be the easiest-to-use, fastest, most complete sources for sports statistics anywhere” (sports-reference.com).

In this post, we will gather and pre-process all the data for the a multi-part series to determine: 1. the MVP finalist with best case for winning their year 2. predict the 2018-2019 MVP

Let the shade begin

Outline

Import modules
Examine html structure of webpage
Use a function with Beautiful Soup to parse webpages into .csv
Analyze .csv of webpage as a Pandas DataFrame
Process the data

import relevant modules
standard library modules:
- os
- urllib
- webbrowser
- datetime
open source modules:
- pandas
- Beautiful Soup
Let’s examine the webpage with all the MVP data from the 1966-1956 season to the 2017-2018 season

True

Scraping the Column Headers

The column headers we need for our DataFrame are found in the th element

['Shooting', 'Advanced', 'Season', 'Lg', 'Player', 'Voting', 'Age', 'Tm', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '2017-18', '2016-17', '2015-16', '2014-15', '2013-14', '2012-13', '2011-12']
['Season', 'Lg', 'Player', 'Voting', 'Age', 'Tm', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']

Scraping the Data

Note that table_rows is a list of tag elements.

<class 'list'>

<tr><th class="left " data-stat="season" scope="row"><a href="/leagues/NBA_2018.html">2017-18</a></th><td class="left " data-stat="lg_id"><a href="/leagues/NBA_2018.html">NBA</a></td><td class="left " csk="Harden,James" data-append-csv="hardeja01" data-stat="player"><a href="/players/h/hardeja01.html">James Harden</a></td><td class="center " data-stat="voting"> (<a href="/awards/awards_2018.html#mvp">V</a>)</td><td class="right " data-stat="age">28</td><td class="left " data-stat="team_id"><a href="/teams/HOU/2018.html">HOU</a></td><td class="right " data-stat="g">72</td><td class="right " data-stat="mp_per_g">35.4</td><td class="right " data-stat="pts_per_g">30.4</td><td class="right " data-stat="trb_per_g">5.4</td><td class="right " data-stat="ast_per_g">8.8</td><td class="right " data-stat="stl_per_g">1.8</td><td class="right " data-stat="blk_per_g">0.7</td><td class="right " data-stat="fg_pct">.449</td><td class="right " data-stat="fg3_pct">.367</td><td class="right " data-stat="ft_pct">.858</td><td class="right " data-stat="ws">15.4</td><td class="right " data-stat="ws_per_48">.289</td></tr>

The data we want for each player is found within the the td (or table data) elements.
Below I’ve created a function that extracts the data we want from table_rows. The comments should walk you through what each part of the function does.
now we can create a DataFrame with the MVP data

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
0	2017-18	NBA	James Harden	(V)	28	HOU	72	35.4	30.4	5.4	8.8	1.8	0.7	.449	.367	.858	15.4	.289

rename the columns
view the data

the MVP dataframe has 113 rows (player-year observations) and  18 columns

	Season	Lg	Player	Voting	Age	Tm	G	MP	PTS	TRB	AST	STL	BLK	FG%	3P%	FT%	WS	WS/48
0	2017-18	NBA	James Harden	(V)	28	HOU	72	35.4	30.4	5.4	8.8	1.8	0.7	.449	.367	.858	15.4	.289
1	2016-17	NBA	Russell Westbrook	(V)	28	OKC	81	34.6	31.6	10.7	10.4	1.6	0.4	.425	.343	.845	13.1	.224
2	2015-16	NBA	Stephen Curry	(V)	27	GSW	79	34.2	30.1	5.4	6.7	2.1	0.2	.504	.454	.908	17.9	.318
3	2014-15	NBA	Stephen Curry	(V)	26	GSW	80	32.7	23.8	4.3	7.7	2.0	0.2	.487	.443	.914	15.7	.288
4	2013-14	NBA	Kevin Durant	(V)	25	OKC	81	38.5	32.0	7.4	5.5	1.3	0.7	.503	.391	.873	19.2	.295

now we need the data on all the finalist for the question:
- which finalist had the best argument for winning that year?

raw column names in finalist table: ['Per Game', 'Shooting', 'Advanced', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '1', '2', '3', '4']
formatted column names in finalist table: ['Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
20 columns in finalist table
<class 'bs4.element.ResultSet'>

<tr><th class="right " data-stat="rank" scope="row">13</th><td class="left " csk="Oladipo,Victor" data-append-csv="oladivi01" data-stat="player"><a href="/players/o/oladivi01.html">Victor Oladipo</a></td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/IND/2018.html">IND</a></td><td class="right " data-stat="votes_first">0.0</td><td class="right " data-stat="points_won">2.0</td><td class="right " data-stat="points_max">1010</td><td class="right " data-stat="award_share">0.002</td><td class="right " data-stat="g">75</td><td class="right " data-stat="mp_per_g">34.0</td><td class="right " data-stat="pts_per_g">23.1</td><td class="right " data-stat="trb_per_g">5.2</td><td class="right " data-stat="ast_per_g">4.3</td><td class="right " data-stat="stl_per_g">2.4</td><td class="right " data-stat="blk_per_g">0.8</td><td class="right " data-stat="fg_pct">.477</td><td class="right " data-stat="fg3_pct">.371</td><td class="right " data-stat="ft_pct">.799</td><td class="right " data-stat="ws">8.2</td><td class="right " data-stat="ws_per_48">.155</td></tr>

create a function to extract MVP finalist data

the MVP finalist dataframe has 13 rows (player-year observations) and  20 columns

	Rank	Player	Age	Tm	Pts Won	Pts Max	Share	G	MP	PTS	TRB	AST	STL	BLK	FG%	3P%	FT%	WS	WS/48
7	8	DeMar DeRozan	28	TOR	32.0	1010	0.032	80	33.9	23.0	3.9	5.2	1.1	0.3	.456	.310	.825	9.6	.170
8	9	LaMarcus Aldridge	32	SAS	6.0	1010	0.006	75	33.5	23.1	8.5	2.0	0.6	1.2	.510	.293	.837	10.9	.209
9	10T	Jimmy Butler	28	MIN	5.0	1010	0.005	59	36.7	22.2	5.3	4.9	2.0	0.4	.474	.350	.854	8.9	.198
10	10T	Stephen Curry	29	GSW	5.0	1010	0.005	51	32.0	26.4	5.1	6.1	1.6	0.2	.495	.423	.921	9.1	.267
11	12	Joel Embiid	23	PHI	4.0	1010	0.004	63	30.3	22.9	11.0	3.2	0.6	1.8	.483	.308	.769	6.2	.155
12	13	Victor Oladipo	25	IND	2.0	1010	0.002	75	34.0	23.1	5.2	4.3	2.4	0.8	.477	.371	.799	8.2	.155

Scraping the Data for All MVP Finalists Since 1956

Scraping the for finalist data since 1956 follows is essentially the same process as above, just repeated for each year, using a for loop.

As we loop over the years, we will create a DataFrame for each year of MVP finalist data, and append it to a large list of DataFrames that contains all the MVP finalists data. We will also have a separate list that will contain any errors and the url associated with that error. This will let us know if there are any issues with our scraper, and which url is causing the error.
let’s time how long this loop takes

2019-05-25 15:10:57.567450

2019-05-25 15:12:03.441050
the loop took 0:01:05.873600

the loop took ~ 1 minute

[]

We don’t get any errors, so that’s good.
Now we can concatenate all the DataFrames we scraped and create one large DataFrame containing all the finalist data

<class 'list'>
63

[   Year Rank          Player Age   Tm First Pts Won Pts Max  Share   G  ...   \
 0  1956    1      Bob Pettit  23  STL  33.0    33.0      80  0.413  72  ...    
 1  1956    2     Paul Arizin  27  PHW  21.0    21.0      80  0.263  72  ...    
 2  1956    3       Bob Cousy  27  BOS  11.0    11.0      80  0.138  72  ...    
 3  1956    4    Mel Hutchins  27  FTW   9.0     9.0      80  0.113  66  ...    
 4  1956   5T   Dolph Schayes  27  SYR   2.0     2.0      80  0.025  72  ...    
 5  1956   5T    Bill Sharman  29  BOS   2.0     2.0      80  0.025  72  ...    
 6  1956   7T        Tom Gola  23  PHW   1.0     1.0      80  0.013  68  ...    
 7  1956   7T  Maurice Stokes  22  ROC   1.0     1.0      80  0.013  67  ...    
 
     PTS   TRB  AST STL BLK   FG% 3P%   FT%    WS WS/48  
 0  25.7  16.2  2.6          .429      .736  13.8  .236  
 1  24.2   7.5  2.6          .448      .810  12.2  .214  
 2  18.8   6.8  8.9          .360      .844   6.8  .119  
 3  12.0   7.5  2.7          .425      .643   4.4  .095  
 4  20.4  12.4  2.8          .387      .858  11.8  .225  
 5  19.9   3.6  4.7          .438      .867   8.8  .157  
 6  10.8   9.1  5.9          .412      .733   6.5  .132  
 7  16.8  16.3  4.9          .354      .714   6.0  .125  
 
 [8 rows x 21 columns]]

['Year', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
21

(972, 21)

Year             1956
Rank                1
Player     Bob Pettit
Age                23
Tm                STL
First            33.0
Pts Won          33.0
Pts Max            80
Share           0.413
G                  72
MP               38.8
PTS              25.7
TRB              16.2
AST               2.6
STL                  
BLK                  
FG%              .429
3P%                  
FT%              .736
WS               13.8
WS/48            .236
Name: 0, dtype: object

Now that we fixed up the necessary columns, let’s write out the raw data to a CSV file.

	Year	Rank	Player	Age	Tm	First	Pts Won	Pts Max	Share	G	...	PTS	TRB	AST	FG%	FT%	WS	WS/48
0	1956	1	Bob Pettit	23	STL	33.0	33.0	80	0.413	72	...	25.7	16.2	2.6	.429	.736	13.8	.236
1	1956	2	Paul Arizin	27	PHW	21.0	21.0	80	0.263	72	...	24.2	7.5	2.6	.448	.810	12.2	.214
2	1956	3	Bob Cousy	27	BOS	11.0	11.0	80	0.138	72	...	18.8	6.8	8.9	.360	.844	6.8	.119
3	1956	4	Mel Hutchins	27	FTW	9.0	9.0	80	0.113	66	...	12.0	7.5	2.7	.425	.643	4.4	.095
4	1956	5T	Dolph Schayes	27	SYR	2.0	2.0	80	0.025	72	...	20.4	12.4	2.8	.387	.858	11.8	.225

5 rows × 21 columns

Cleaning the Data

Now that we have the raw MVP data, we need to clean it up a bit for data exploration

	Season	Lg	Player	Voting	Age	Tm	G	MP	PTS	TRB	AST	STL	BLK	FG%	3P%	FT%	WS	WS/48
0	2017-18	NBA	James Harden	(V)	28.0	HOU	72.0	35.4	30.4	5.4	8.8	1.8	0.7	0.449	0.367	0.858	15.4	0.289
1	2016-17	NBA	Russell Westbrook	(V)	28.0	OKC	81.0	34.6	31.6	10.7	10.4	1.6	0.4	0.425	0.343	0.845	13.1	0.224
2	2015-16	NBA	Stephen Curry	(V)	27.0	GSW	79.0	34.2	30.1	5.4	6.7	2.1	0.2	0.504	0.454	0.908	17.9	0.318
3	2014-15	NBA	Stephen Curry	(V)	26.0	GSW	80.0	32.7	23.8	4.3	7.7	2.0	0.2	0.487	0.443	0.914	15.7	0.288
4	2013-14	NBA	Kevin Durant	(V)	25.0	OKC	81.0	38.5	32.0	7.4	5.5	1.3	0.7	0.503	0.391	0.873	19.2	0.295

create dictionaries for renaming columns
rename all columns with dictionaries

	season	league	player	voting	age	team	games_played	avg_minutes	avg_points	avg_rebounds	avg_assists	avg_steals	avg_blocks	field_goal_pct	three_pt_pct	free_throw_pct	win_shares	win_shares_per_48
0	2017-18	NBA	James Harden	(V)	28.0	HOU	72.0	35.4	30.4	5.4	8.8	1.8	0.7	0.449	0.367	0.858	15.4	0.289
1	2016-17	NBA	Russell Westbrook	(V)	28.0	OKC	81.0	34.6	31.6	10.7	10.4	1.6	0.4	0.425	0.343	0.845	13.1	0.224
2	2015-16	NBA	Stephen Curry	(V)	27.0	GSW	79.0	34.2	30.1	5.4	6.7	2.1	0.2	0.504	0.454	0.908	17.9	0.318
3	2014-15	NBA	Stephen Curry	(V)	26.0	GSW	80.0	32.7	23.8	4.3	7.7	2.0	0.2	0.487	0.443	0.914	15.7	0.288
4	2013-14	NBA	Kevin Durant	(V)	25.0	OKC	81.0	38.5	32.0	7.4	5.5	1.3	0.7	0.503	0.391	0.873	19.2	0.295

	year	rank	player	age	team	first_place_votes	points_won	points_max	vote_share	games_played	...	avg_points	avg_rebounds	avg_assists	avg_steals	avg_blocks	field_goal_pct	three_pt_pct	free_throw_pct	win_shares	win_shares_per_48
0	1956	1	Bob Pettit	23	STL	33.0	33.0	80	0.413	72	...	25.7	16.2	2.6	NaN	NaN	0.429	NaN	0.736	13.8	0.236
1	1956	2	Paul Arizin	27	PHW	21.0	21.0	80	0.263	72	...	24.2	7.5	2.6	NaN	NaN	0.448	NaN	0.810	12.2	0.214
2	1956	3	Bob Cousy	27	BOS	11.0	11.0	80	0.138	72	...	18.8	6.8	8.9	NaN	NaN	0.360	NaN	0.844	6.8	0.119
3	1956	4	Mel Hutchins	27	FTW	9.0	9.0	80	0.113	66	...	12.0	7.5	2.7	NaN	NaN	0.425	NaN	0.643	4.4	0.095
4	1956	5T	Dolph Schayes	27	SYR	2.0	2.0	80	0.025	72	...	20.4	12.4	2.8	NaN	NaN	0.387	NaN	0.858	11.8	0.225

5 rows × 21 columns

Cleaning Up the Rest of the Data

<class 'pandas.core.frame.DataFrame'>
Index: 113 entries, 0 to 112
Data columns (total 18 columns):
season               113 non-null object
league               113 non-null object
player               113 non-null object
voting               73 non-null object
age                  73 non-null float64
team                 73 non-null object
games_played         73 non-null float64
avg_minutes          73 non-null float64
avg_points           73 non-null float64
avg_rebounds         73 non-null float64
avg_assists          73 non-null float64
avg_steals           50 non-null float64
avg_blocks           50 non-null float64
field_goal_pct       73 non-null float64
three_pt_pct         48 non-null float64
free_throw_pct       73 non-null float64
win_shares           73 non-null float64
win_shares_per_48    73 non-null float64
dtypes: float64(13), object(5)
memory usage: 16.8+ KB

<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 971
Data columns (total 21 columns):
year                 972 non-null int64
rank                 972 non-null object
player               972 non-null object
age                  972 non-null int64
team                 972 non-null object
first_place_votes    972 non-null float64
points_won           972 non-null float64
points_max           972 non-null int64
vote_share           972 non-null float64
games_played         972 non-null int64
avg_minutes          972 non-null float64
avg_points           972 non-null float64
avg_rebounds         972 non-null float64
avg_assists          972 non-null float64
avg_steals           756 non-null float64
avg_blocks           756 non-null float64
field_goal_pct       972 non-null float64
three_pt_pct         621 non-null float64
free_throw_pct       972 non-null float64
win_shares           972 non-null float64
win_shares_per_48    972 non-null float64
dtypes: float64(14), int64(4), object(3)
memory usage: 167.1+ KB

We are not done yet. A lot of out numeric columns are missing data because players didn’t accumulate any of those stats. For example, the 3 point line is introduced in 1982 and all players in preceding seasons don’t have this statistic. Additionally, we want to select the columns with numeric data and then replace the NaNs (the current value that represents the missing data) with 0s, as that is a more appropriate value.

<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 971
Data columns (total 21 columns):
year                 972 non-null int64
rank                 972 non-null object
player               972 non-null object
age                  972 non-null int64
team                 972 non-null object
first_place_votes    972 non-null float64
points_won           972 non-null float64
points_max           972 non-null int64
vote_share           972 non-null float64
games_played         972 non-null int64
avg_minutes          972 non-null float64
avg_points           972 non-null float64
avg_rebounds         972 non-null float64
avg_assists          972 non-null float64
avg_steals           972 non-null float64
avg_blocks           972 non-null float64
field_goal_pct       972 non-null float64
three_pt_pct         972 non-null float64
free_throw_pct       972 non-null float64
win_shares           972 non-null float64
win_shares_per_48    972 non-null float64
dtypes: float64(14), int64(4), object(3)
memory usage: 167.1+ KB

remove ABA winners
remove MVP summary table

We are finally done cleaning the data and now we can save it to a CSV file.

(63, 18)

	season	league	player	voting	age	team	games_played	avg_minutes	avg_points	avg_rebounds	avg_assists	avg_steals	avg_blocks	field_goal_pct	three_pt_pct	free_throw_pct	win_shares	win_shares_per_48
0	2017-18	NBA	James Harden	(V)	28.0	HOU	72.0	35.4	30.4	5.4	8.8	1.8	0.7	0.449	0.367	0.858	15.4	0.289
1	2016-17	NBA	Russell Westbrook	(V)	28.0	OKC	81.0	34.6	31.6	10.7	10.4	1.6	0.4	0.425	0.343	0.845	13.1	0.224
2	2015-16	NBA	Stephen Curry	(V)	27.0	GSW	79.0	34.2	30.1	5.4	6.7	2.1	0.2	0.504	0.454	0.908	17.9	0.318
3	2014-15	NBA	Stephen Curry	(V)	26.0	GSW	80.0	32.7	23.8	4.3	7.7	2.0	0.2	0.487	0.443	0.914	15.7	0.288
4	2013-14	NBA	Kevin Durant	(V)	25.0	OKC	81.0	38.5	32.0	7.4	5.5	1.3	0.7	0.503	0.391	0.873	19.2	0.295

(972, 21)

	year	rank	player	age	team	first_place_votes	points_won	points_max	vote_share	games_played	...	avg_points	avg_rebounds	avg_assists	avg_steals	avg_blocks	field_goal_pct	three_pt_pct	free_throw_pct	win_shares	win_shares_per_48
959	2018	1	James Harden	28	HOU	86.0	965.0	1010	0.955	72	...	30.4	5.4	8.8	1.8	0.7	0.449	0.367	0.858	15.4	0.289
960	2018	2	LeBron James	33	CLE	15.0	738.0	1010	0.731	82	...	27.5	8.6	9.1	1.4	0.9	0.542	0.367	0.731	14.0	0.221
961	2018	3	Anthony Davis	24	NOP	0.0	445.0	1010	0.441	75	...	28.1	11.1	2.3	1.5	2.6	0.534	0.340	0.828	13.7	0.241
962	2018	4	Damian Lillard	27	POR	0.0	207.0	1010	0.205	73	...	26.9	4.5	6.6	1.1	0.4	0.439	0.361	0.916	12.6	0.227
963	2018	5	Russell Westbrook	29	OKC	0.0	76.0	1010	0.075	80	...	25.4	10.1	10.3	1.8	0.3	0.449	0.298	0.737	10.1	0.166

5 rows × 21 columns

Review

In this tutorial, we learned how to
- examine the html structure of webpage
- use functions based on the Beautiful Soup module to parse tables on multiple webpage into .csv
- analyzed a .csv file using the Pandas module

Download this notebook or see a static view here

last updated: 2019-05-27 22:35 

System and module version information: 

Python version: sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0)
urllib.request version: 3.7
pandas version: 0.23.4
Beautiful Soup version: 4.6.3

Source: NBA MVP Comparisons

Notebooks

NBA MVP Comparisons

Part 1

Outline

Scraping the Column Headers

Scraping the Data

Scraping the Data for All MVP Finalists Since 1956

Cleaning the Data

Cleaning Up the Rest of the Data

Review