True
NBA MVP Comparisons - Part 3
Kivan Polimis
March 6, 2019
NBA MVP Comparisons
Part 3
- 6 th March 2019
In the last two blog posts we:
1. Gathered the relevant data for all NBA MVPs 2. Evaluated several machine learning models to select the MVP and chose the Latent Discriminant Analysis model as the best predictor.
The goal of this blog post is to gather the data necessary to predict the 2018-2019 MVP
We will look select from the MVP finalists in 2017-2018 * MVP will likely be a finalist from the previous year * Is this a valid assumption?
Update
On May 17th, the NBA announced the 2018-2019 MVP finalists: * Giannis Antetokounmpo, Milwaukee Bucks * Paul George, Oklahoma City Thunder * James Harden, Houston Rockets * source: https://www.nba.com/article/2019/05/17/finalists-2019-nba-awards
Outline
- Import modules
- Examine html structure of webpage
- Use a function with
Beautiful Soupto parse webpages into .csv - Analyze .csv of webpage as a
PandasDataFrame - Process the data
- import relevant modules
- standard library modules:
- open source modules:
- examine webpage
get the html
create the BeautifulSoup object
Extract the necessary values for the column headers from the table and store them as a list
raw column names in finalist table: ['Per Game', 'Shooting', 'Advanced', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '1', '2', '3', '4']
formatted column names in finalist table: ['Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
20 columns in finalist table
- The data is found within the
trelements of the firsttbodyelement - We want the elements from the 3rd row and on
- take a look at the last row to examine data quality
the subset soup object is of type: <class 'bs4.element.ResultSet'>
<tr><th class="right " data-stat="rank" scope="row">13</th><td class="left " csk="Oladipo,Victor" data-append-csv="oladivi01" data-stat="player"><a href="/players/o/oladivi01.html">Victor Oladipo</a></td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/IND/2018.html">IND</a></td><td class="right " data-stat="votes_first">0.0</td><td class="right " data-stat="points_won">2.0</td><td class="right " data-stat="points_max">1010</td><td class="right " data-stat="award_share">0.002</td><td class="right " data-stat="g">75</td><td class="right " data-stat="mp_per_g">34.0</td><td class="right " data-stat="pts_per_g">23.1</td><td class="right " data-stat="trb_per_g">5.2</td><td class="right " data-stat="ast_per_g">4.3</td><td class="right " data-stat="stl_per_g">2.4</td><td class="right " data-stat="blk_per_g">0.8</td><td class="right " data-stat="fg_pct">.477</td><td class="right " data-stat="fg3_pct">.371</td><td class="right " data-stat="ft_pct">.799</td><td class="right " data-stat="ws">8.2</td><td class="right " data-stat="ws_per_48">.155</td></tr>
- to get a player’s 2018-2019 statistical information, we can extract the player’s web page from the
BeautifulSoupobject
'/players/o/oladivi01.html'
- however, the player link is just a stub url and needs a base url for basketball-reference.com appended to it to access the player’s web page
'https://www.basketball-reference.com/players/o/oladivi01.html'
True
create a function to extract MVP finalist data
extract the data we want
and then store it in a
DataFrame
the MVP finalist dataframe has 13 rows (player-year observations) and 21 columns
| Rank | Player | Age | Tm | First | Pts Won | Pts Max | Share | G | MP | ... | TRB | AST | STL | BLK | FG% | 3P% | FT% | WS | WS/48 | player_link | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 8 | DeMar DeRozan | 28 | TOR | 0.0 | 32.0 | 1010 | 0.032 | 80 | 33.9 | ... | 3.9 | 5.2 | 1.1 | 0.3 | .456 | .310 | .825 | 9.6 | .170 | https://www.basketball-reference.com/players/d... |
| 8 | 9 | LaMarcus Aldridge | 32 | SAS | 0.0 | 6.0 | 1010 | 0.006 | 75 | 33.5 | ... | 8.5 | 2.0 | 0.6 | 1.2 | .510 | .293 | .837 | 10.9 | .209 | https://www.basketball-reference.com/players/a... |
| 9 | 10T | Jimmy Butler | 28 | MIN | 0.0 | 5.0 | 1010 | 0.005 | 59 | 36.7 | ... | 5.3 | 4.9 | 2.0 | 0.4 | .474 | .350 | .854 | 8.9 | .198 | https://www.basketball-reference.com/players/b... |
| 10 | 10T | Stephen Curry | 29 | GSW | 0.0 | 5.0 | 1010 | 0.005 | 51 | 32.0 | ... | 5.1 | 6.1 | 1.6 | 0.2 | .495 | .423 | .921 | 9.1 | .267 | https://www.basketball-reference.com/players/c... |
| 11 | 12 | Joel Embiid | 23 | PHI | 0.0 | 4.0 | 1010 | 0.004 | 63 | 30.3 | ... | 11.0 | 3.2 | 0.6 | 1.8 | .483 | .308 | .769 | 6.2 | .155 | https://www.basketball-reference.com/players/e... |
| 12 | 13 | Victor Oladipo | 25 | IND | 0.0 | 2.0 | 1010 | 0.002 | 75 | 34.0 | ... | 5.2 | 4.3 | 2.4 | 0.8 | .477 | .371 | .799 | 8.2 | .155 | https://www.basketball-reference.com/players/o... |
6 rows × 21 columns
- Extract the necessary values for the column headers from the table and store them as a list
- create the
BeautifulSoupobject
the columns in the career data tables are:
['Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
there are 29 total columns in the career table
- let’s examine the data to make sure it conforms with our expectations
player name: James Harden
player years active: ['2009-10', '2010-11', '2011-12', '2012-13', '2013-14', '2014-15', '2015-16', '2016-17', '2017-18', '2018-19']
player career data: ['20', 'OKC', 'NBA', 'SG', '76', '0', '22.9', '3.1', '7.6', '.403', '1.2', '3.3', '.375', '1.8', '4.3', '.424', '.484', '2.6', '3.2', '.808', '0.6', '2.6', '3.2', '1.8', '1.1', '0.3', '1.4', '2.6', '9.9', '21', 'OKC', 'NBA', 'SG', '82', '5', '26.7', '3.6', '8.3', '.436', '1.4', '4.0', '.349', '2.3', '4.4', '.514', '.518', '3.5', '4.2', '.843', '0.5', '2.6', '3.1', '2.1', '1.1', '0.3', '1.3', '2.5', '12.2', '22', 'OKC', 'NBA', 'SG', '62', '2', '31.4', '5.0', '10.1', '.491', '1.8', '4.7', '.390', '3.1', '5.4', '.579', '.582', '5.0', '6.0', '.846', '0.5', '3.6', '4.1', '3.7', '1.0', '0.2', '2.2', '2.4', '16.8', '23', 'HOU', 'NBA', 'SG', '78', '78', '38.3', '7.5', '17.1', '.438', '2.3', '6.2', '.368', '5.2', '10.9', '.477', '.504', '8.6', '10.2', '.851', '0.8', '4.1', '4.9', '5.8', '1.8', '0.5', '3.8', '2.3', '25.9', '24', 'HOU', 'NBA', 'SG', '73', '73', '38.0', '7.5', '16.5', '.456', '2.4', '6.6', '.366', '5.1', '9.9', '.515', '.529', '7.9', '9.1', '.866', '0.8', '3.9', '4.7', '6.1', '1.6', '0.4', '3.6', '2.4', '25.4', '25', 'HOU', 'NBA', 'SG', '81', '81', '36.8', '8.0', '18.1', '.440', '2.6', '6.9', '.375', '5.4', '11.3', '.480', '.511', '8.8', '10.2', '.868', '0.9', '4.7', '5.7', '7.0', '1.9', '0.7', '4.0', '2.6', '27.4', '26', 'HOU', 'NBA', 'SG', '82', '82', '38.1', '8.7', '19.7', '.439', '2.9', '8.0', '.359', '5.8', '11.7', '.494', '.512', '8.8', '10.2', '.860', '0.8', '5.3', '6.1', '7.5', '1.7', '0.6', '4.6', '2.8', '29.0', '27', 'HOU', 'NBA', 'PG', '81', '81', '36.4', '8.3', '18.9', '.440', '3.2', '9.3', '.347', '5.1', '9.6', '.530', '.525', '9.2', '10.9', '.847', '1.2', '7.0', '8.1', '11.2', '1.5', '0.5', '5.7', '2.7', '29.1', '28', 'HOU', 'NBA', 'SG', '72', '72', '35.4', '9.0', '20.1', '.449', '3.7', '10.0', '.367', '5.4', '10.1', '.531', '.541', '8.7', '10.1', '.858', '0.6', '4.8', '5.4', '8.8', '1.8', '0.7', '4.4', '2.3', '30.4', '29', 'HOU', 'NBA', 'PG', '78', '78', '36.8', '10.8', '24.5', '.442', '4.8', '13.2', '.368', '6.0', '11.3', '.528', '.541', '9.7', '11.0', '.879', '0.8', '5.8', '6.6', '7.5', '2.0', '0.7', '5.0', '3.1', '36.1', '', '', 'NBA', '', '765', '552', '34.1', '7.2', '16.2', '.443', '2.6', '7.3', '.365', '4.5', '9.0', '.506', '.525', '7.3', '8.5', '.857', '0.8', '4.5', '5.2', '6.2', '1.6', '0.5', '3.6', '2.6', '24.3', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'HOU', 'NBA', '', '545', '545', '37.1', '8.5', '19.3', '.443', '3.1', '8.6', '.364', '5.4', '10.7', '.506', '.524', '8.8', '10.2', '.861', '0.8', '5.1', '6.0', '7.7', '1.8', '0.6', '4.4', '2.6', '29.0', '', 'OKC', 'NBA', '', '220', '7', '26.7', '3.8', '8.6', '.444', '1.5', '3.9', '.370', '2.4', '4.7', '.506', '.529', '3.6', '4.3', '.835', '0.5', '2.9', '3.4', '2.5', '1.1', '0.3', '1.6', '2.5', '12.7']
the player career data is a list that includes every year the player was active
- this list is not broken up by year
to create yearly player data from the player career data list, we need a function to seperate the list
we can use the following function is from StackOverflow
every 29 items in the list represents a year of player data
- we will split the
player_career_dataobject every 29 steps (using the language of theslice_perfunction) to create a year of data
- we will split the
then we add the column headers to the dataframe as a sanity check that the information was extracted correctly
<class 'list'>
| Age | Tm | Lg | Pos | G | GS | MP | FG | FGA | FG% | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | 7.6 | .403 | ... | .808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
| 1 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | 8.3 | .436 | ... | .843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
| 2 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | 10.1 | .491 | ... | .846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
| 3 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | 17.1 | .438 | ... | .851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
| 4 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | 16.5 | .456 | ... | .866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
| 5 | 25 | HOU | NBA | SG | 81 | 81 | 36.8 | 8.0 | 18.1 | .440 | ... | .868 | 0.9 | 4.7 | 5.7 | 7.0 | 1.9 | 0.7 | 4.0 | 2.6 | 27.4 |
| 6 | 26 | HOU | NBA | SG | 82 | 82 | 38.1 | 8.7 | 19.7 | .439 | ... | .860 | 0.8 | 5.3 | 6.1 | 7.5 | 1.7 | 0.6 | 4.6 | 2.8 | 29.0 |
| 7 | 27 | HOU | NBA | PG | 81 | 81 | 36.4 | 8.3 | 18.9 | .440 | ... | .847 | 1.2 | 7.0 | 8.1 | 11.2 | 1.5 | 0.5 | 5.7 | 2.7 | 29.1 |
| 8 | 28 | HOU | NBA | SG | 72 | 72 | 35.4 | 9.0 | 20.1 | .449 | ... | .858 | 0.6 | 4.8 | 5.4 | 8.8 | 1.8 | 0.7 | 4.4 | 2.3 | 30.4 |
| 9 | 29 | HOU | NBA | PG | 78 | 78 | 36.8 | 10.8 | 24.5 | .442 | ... | .879 | 0.8 | 5.8 | 6.6 | 7.5 | 2.0 | 0.7 | 5.0 | 3.1 | 36.1 |
| 10 | NBA | 765 | 552 | 34.1 | 7.2 | 16.2 | .443 | ... | .857 | 0.8 | 4.5 | 5.2 | 6.2 | 1.6 | 0.5 | 3.6 | 2.6 | 24.3 | |||
| 11 | ... | ||||||||||||||||||||
| 12 | HOU | NBA | 545 | 545 | 37.1 | 8.5 | 19.3 | .443 | ... | .861 | 0.8 | 5.1 | 6.0 | 7.7 | 1.8 | 0.6 | 4.4 | 2.6 | 29.0 | ||
| 13 | OKC | NBA | 220 | 7 | 26.7 | 3.8 | 8.6 | .444 | ... | .835 | 0.5 | 2.9 | 3.4 | 2.5 | 1.1 | 0.3 | 1.6 | 2.5 | 12.7 |
14 rows × 29 columns
now we need a function to extract the player career data using the link to the player’s profile as the input
use list comprehension to extract career data for each player link in
mvp_finalist_2018_datause
Pandasconcat function to store all finalist data in oneDataFramestore all finalist data in one
DataFrame
the MVP finalist dataframe has 13 rows (player-year observations) and 21 columns
| Player | Year | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ... | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | James Harden | 2009-10 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | ... | .808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
| 1 | James Harden | 2010-11 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | ... | .843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
| 2 | James Harden | 2011-12 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | ... | .846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
| 3 | James Harden | 2012-13 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | ... | .851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
| 4 | James Harden | 2013-14 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | ... | .866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 rows × 31 columns
- rename columns by concatentating
column_headers_playerwith the two new columns we added in ourextract_career_datafunction
Index(['Player', 'Year', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG',
'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT',
'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF',
'PTS'],
dtype='object')
- examine the
mvp_finalist_2019_career_datadata
| Player | Year | Age | Tm | Lg | Pos | G | GS | MP | FG | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Victor Oladipo | 2014-15 | 22 | ORL | NBA | SG | 72 | 71 | 35.7 | 6.6 | ... | .819 | 0.7 | 3.5 | 4.2 | 4.1 | 1.7 | 0.3 | 2.8 | 2.6 | 17.9 |
| 2 | Victor Oladipo | 2015-16 | 23 | ORL | NBA | SG | 72 | 52 | 33.0 | 5.9 | ... | .830 | 0.7 | 4.1 | 4.8 | 3.9 | 1.6 | 0.8 | 2.1 | 2.4 | 16.0 |
| 3 | Victor Oladipo | 2016-17 | 24 | OKC | NBA | SG | 67 | 67 | 33.2 | 6.1 | ... | .753 | 0.6 | 3.8 | 4.3 | 2.6 | 1.2 | 0.3 | 1.8 | 2.3 | 15.9 |
| 4 | Victor Oladipo | 2017-18 | 25 | IND | NBA | SG | 75 | 75 | 34.0 | 8.5 | ... | .799 | 0.6 | 4.6 | 5.2 | 4.3 | 2.4 | 0.8 | 2.9 | 2.3 | 23.1 |
| 5 | Victor Oladipo | 2018-19 | 26 | IND | NBA | SG | 36 | 36 | 31.9 | 6.9 | ... | .730 | 0.6 | 5.0 | 5.6 | 5.2 | 1.7 | 0.3 | 2.3 | 2.0 | 18.8 |
5 rows × 31 columns
Now that we fixed up the necessary columns, let’s write out the raw data to a CSV file.
- Write out the career data for 2018 MVP finalists to the raw_data folder in the data folder
Cleaning the Data
- Now that we have the raw MVP data, we need to clean it up a bit for data exploration
| Player | Year | Age | Tm | Lg | Pos | G | GS | MP | FG | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | James Harden | 2009-10 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | ... | 0.808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
| 1 | James Harden | 2010-11 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | ... | 0.843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
| 2 | James Harden | 2011-12 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | ... | 0.846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
| 3 | James Harden | 2012-13 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | ... | 0.851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
| 4 | James Harden | 2013-14 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | ... | 0.866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 rows × 31 columns
Index(['Player', 'Year', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG',
'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT',
'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF',
'PTS'],
dtype='object')
- create dictionaries for renaming columns
- rename all columns with dictionaries
| player | year | age | team | league | Pos | games_played | games_started | avg_minutes | field_goals_made_per_game | ... | free_throw_pct | offensive_rebounds_per_game | defensive_rebounds_per_game | total_rebounds_per_game | assists_per_game | steals_per_game | blocks_per_game | turnovers_per_game | fouls_committed_per_game | points_per_game | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | James Harden | 2009-10 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | ... | 0.808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
| 1 | James Harden | 2010-11 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | ... | 0.843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
| 2 | James Harden | 2011-12 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | ... | 0.846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
| 3 | James Harden | 2012-13 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | ... | 0.851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
| 4 | James Harden | 2013-14 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | ... | 0.866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 rows × 31 columns
Index(['player', 'year', 'age', 'team', 'league', 'Pos', 'games_played',
'games_started', 'avg_minutes', 'field_goals_made_per_game',
'field_goals_attempted_per_game', 'field_goal_pct',
'three_pt_fg_made_per_game', 'three_pt_fg_attempted_per_game',
'three_pt_pct', 'two_pt_fg_made_per_game',
'two_pt_fg_attempted_per_game', 'two_pt_fg_pct', 'effective_fg_pct',
'free_throws_made_per_game', 'free_throws_attempted_per_game',
'free_throw_pct', 'offensive_rebounds_per_game',
'defensive_rebounds_per_game', 'total_rebounds_per_game',
'assists_per_game', 'steals_per_game', 'blocks_per_game',
'turnovers_per_game', 'fouls_committed_per_game', 'points_per_game'],
dtype='object')
Cleaning Up the Rest of the Data
- convert the data to proper numeric types
<class 'pandas.core.frame.DataFrame'>
Index: 121 entries, 0 to 120
Data columns (total 31 columns):
player 121 non-null object
year 121 non-null object
age 121 non-null int64
team 121 non-null object
league 121 non-null object
Pos 121 non-null object
games_played 121 non-null int64
games_started 121 non-null int64
avg_minutes 121 non-null float64
field_goals_made_per_game 121 non-null float64
field_goals_attempted_per_game 121 non-null float64
field_goal_pct 121 non-null float64
three_pt_fg_made_per_game 121 non-null float64
three_pt_fg_attempted_per_game 121 non-null float64
three_pt_pct 121 non-null float64
two_pt_fg_made_per_game 121 non-null float64
two_pt_fg_attempted_per_game 121 non-null float64
two_pt_fg_pct 121 non-null float64
effective_fg_pct 121 non-null float64
free_throws_made_per_game 121 non-null float64
free_throws_attempted_per_game 121 non-null float64
free_throw_pct 121 non-null float64
offensive_rebounds_per_game 121 non-null float64
defensive_rebounds_per_game 121 non-null float64
total_rebounds_per_game 121 non-null float64
assists_per_game 121 non-null float64
steals_per_game 121 non-null float64
blocks_per_game 121 non-null float64
turnovers_per_game 121 non-null float64
fouls_committed_per_game 121 non-null float64
points_per_game 121 non-null float64
dtypes: float64(23), int64(3), object(5)
memory usage: 30.2+ KB
- Get the column names for the numeric columns
- Replace all NaNs with 0
<class 'pandas.core.frame.DataFrame'>
Index: 121 entries, 0 to 120
Data columns (total 31 columns):
player 121 non-null object
year 121 non-null object
age 121 non-null int64
team 121 non-null object
league 121 non-null object
Pos 121 non-null object
games_played 121 non-null int64
games_started 121 non-null int64
avg_minutes 121 non-null float64
field_goals_made_per_game 121 non-null float64
field_goals_attempted_per_game 121 non-null float64
field_goal_pct 121 non-null float64
three_pt_fg_made_per_game 121 non-null float64
three_pt_fg_attempted_per_game 121 non-null float64
three_pt_pct 121 non-null float64
two_pt_fg_made_per_game 121 non-null float64
two_pt_fg_attempted_per_game 121 non-null float64
two_pt_fg_pct 121 non-null float64
effective_fg_pct 121 non-null float64
free_throws_made_per_game 121 non-null float64
free_throws_attempted_per_game 121 non-null float64
free_throw_pct 121 non-null float64
offensive_rebounds_per_game 121 non-null float64
defensive_rebounds_per_game 121 non-null float64
total_rebounds_per_game 121 non-null float64
assists_per_game 121 non-null float64
steals_per_game 121 non-null float64
blocks_per_game 121 non-null float64
turnovers_per_game 121 non-null float64
fouls_committed_per_game 121 non-null float64
points_per_game 121 non-null float64
dtypes: float64(23), int64(3), object(5)
memory usage: 30.2+ KB
- We are finally done cleaning the data and now we can save it to a CSV file.
the dimensions for the final data are: (121, 31) (rows, columns)
| player | year | age | team | league | Pos | games_played | games_started | avg_minutes | field_goals_made_per_game | ... | free_throw_pct | offensive_rebounds_per_game | defensive_rebounds_per_game | total_rebounds_per_game | assists_per_game | steals_per_game | blocks_per_game | turnovers_per_game | fouls_committed_per_game | points_per_game | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 120 | Victor Oladipo | 2018-19 | 26 | IND | NBA | SG | 36 | 36 | 31.9 | 6.9 | ... | 0.730 | 0.6 | 5.0 | 5.6 | 5.2 | 1.7 | 0.3 | 2.3 | 2.0 | 18.8 |
| 25 | LeBron James | 2018-19 | 34 | LAL | NBA | SF | 55 | 55 | 35.2 | 10.1 | ... | 0.665 | 1.0 | 7.4 | 8.5 | 8.3 | 1.3 | 0.6 | 3.6 | 1.7 | 27.4 |
| 91 | LaMarcus Aldridge | 2018-19 | 33 | SAS | NBA | C | 81 | 81 | 33.2 | 8.4 | ... | 0.847 | 3.1 | 6.1 | 9.2 | 2.4 | 0.5 | 1.3 | 1.8 | 2.2 | 21.3 |
| 39 | Damian Lillard | 2018-19 | 28 | POR | NBA | PG | 80 | 80 | 35.5 | 8.5 | ... | 0.912 | 0.9 | 3.8 | 4.6 | 6.9 | 1.1 | 0.4 | 2.7 | 1.9 | 25.8 |
| 32 | Anthony Davis | 2018-19 | 25 | NOP | NBA | C | 56 | 56 | 33.0 | 9.5 | ... | 0.794 | 3.1 | 8.9 | 12.0 | 3.9 | 1.6 | 2.4 | 2.0 | 2.4 | 25.9 |
5 rows × 31 columns
Review
- In this tutorial, we learned how to:
- examine the html structure of webpage
- use functions based on the
Beautiful Soupmodule to parse tables on multiple webpage into .csv - analyzed a .csv file using the
Pandasmodule
Download this notebook or see a static view here
last updated: 2019-07-01 09:13
System and module version information:
Python version: sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0)
urllib.request version: 3.7
pandas version: 0.23.4
Beautiful Soup version: 4.6.3