Code
import os
import urllib
import webbrowser
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoupIn the last two blog posts we:
1. Gathered the relevant data for all NBA MVPs 2. Evaluated several machine learning models to select the MVP and chose the Latent Discriminant Analysis model as the best predictor.
We will look select from the MVP finalists in 2017-2018 * MVP will likely be a finalist from the previous year * Is this a valid assumption?
On May 17th, the NBA announced the 2018-2019 MVP finalists: * Giannis Antetokounmpo, Milwaukee Bucks * Paul George, Oklahoma City Thunder * James Harden, Houston Rockets * source: https://www.nba.com/article/2019/05/17/finalists-2019-nba-awards
Beautiful Soup to parse webpages into .csvPandas DataFrameprint(f"raw column names in finalist table: {column_headers_finalist}")
column_headers_finalist = [header for header in column_headers_finalist if header not in ('Shooting', 'Advanced', 'Per Game')][:-4]
print(f"formatted column names in finalist table: {column_headers_finalist}")
print(f"{len(column_headers_finalist)} columns in finalist table")raw column names in finalist table: ['Per Game', 'Shooting', 'Advanced', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '1', '2', '3', '4']
formatted column names in finalist table: ['Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
20 columns in finalist table
tr elements of the first tbody elementthe subset soup object is of type: <class 'bs4.element.ResultSet'>
<tr><th class="right " data-stat="rank" scope="row">13</th><td class="left " csk="Oladipo,Victor" data-append-csv="oladivi01" data-stat="player"><a href="/players/o/oladivi01.html">Victor Oladipo</a></td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/IND/2018.html">IND</a></td><td class="right " data-stat="votes_first">0.0</td><td class="right " data-stat="points_won">2.0</td><td class="right " data-stat="points_max">1010</td><td class="right " data-stat="award_share">0.002</td><td class="right " data-stat="g">75</td><td class="right " data-stat="mp_per_g">34.0</td><td class="right " data-stat="pts_per_g">23.1</td><td class="right " data-stat="trb_per_g">5.2</td><td class="right " data-stat="ast_per_g">4.3</td><td class="right " data-stat="stl_per_g">2.4</td><td class="right " data-stat="blk_per_g">0.8</td><td class="right " data-stat="fg_pct">.477</td><td class="right " data-stat="fg3_pct">.371</td><td class="right " data-stat="ft_pct">.799</td><td class="right " data-stat="ws">8.2</td><td class="right " data-stat="ws_per_48">.155</td></tr>
BeautifulSoup object'/players/o/oladivi01.html'
'https://www.basketball-reference.com/players/o/oladivi01.html'
def extract_finalist_data(table_rows):
"""
Extract and return the the desired information from the td elements within the table rows.
:param: table_rows: list of soup `tr` elements
:return: list of player-year MVP finalist observations
"""
base_basketball_ref_url = "https://www.basketball-reference.com"
# create the empty list to store the player data
player_data = []
for row in table_rows: # for each row do the following
# Get the text for each table data (td) element in the row
player_rank = [td.get_text() for td in row.find_all("th")]
player_list = [td.get_text() for td in row.find_all("td")]
player_href = [href for href in row.find_all("a")][0]["href"]
player_link = [base_basketball_ref_url + player_href]
# there are some empty table rows, which are the repeated column headers in the table
# we skip over those rows and and continue the for loop
if not player_list:
continue
# Now append the individual player data to list of all player data
player_info = player_rank+player_list+player_link
player_data.append(player_info)
return player_dataDataFrameextracted_finalist_2018_data = extract_finalist_data(table_rows_finalist)
mvp_finalist_2018_data = pd.DataFrame(extracted_finalist_2018_data)
mvp_finalist_2018_data.columns = column_headers_finalist+["player_link"]
print(f"the MVP finalist dataframe has {mvp_finalist_2018_data.shape[0]} rows (player-year observations) and {mvp_finalist_2018_data.shape[1]} columns")
mvp_finalist_2018_data.tail(6)the MVP finalist dataframe has 13 rows (player-year observations) and 21 columns
| Rank | Player | Age | Tm | First | Pts Won | Pts Max | Share | G | MP | ... | TRB | AST | STL | BLK | FG% | 3P% | FT% | WS | WS/48 | player_link | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 8 | DeMar DeRozan | 28 | TOR | 0.0 | 32.0 | 1010 | 0.032 | 80 | 33.9 | ... | 3.9 | 5.2 | 1.1 | 0.3 | .456 | .310 | .825 | 9.6 | .170 | https://www.basketball-reference.com/players/d... |
| 8 | 9 | LaMarcus Aldridge | 32 | SAS | 0.0 | 6.0 | 1010 | 0.006 | 75 | 33.5 | ... | 8.5 | 2.0 | 0.6 | 1.2 | .510 | .293 | .837 | 10.9 | .209 | https://www.basketball-reference.com/players/a... |
| 9 | 10T | Jimmy Butler | 28 | MIN | 0.0 | 5.0 | 1010 | 0.005 | 59 | 36.7 | ... | 5.3 | 4.9 | 2.0 | 0.4 | .474 | .350 | .854 | 8.9 | .198 | https://www.basketball-reference.com/players/b... |
| 10 | 10T | Stephen Curry | 29 | GSW | 0.0 | 5.0 | 1010 | 0.005 | 51 | 32.0 | ... | 5.1 | 6.1 | 1.6 | 0.2 | .495 | .423 | .921 | 9.1 | .267 | https://www.basketball-reference.com/players/c... |
| 11 | 12 | Joel Embiid | 23 | PHI | 0.0 | 4.0 | 1010 | 0.004 | 63 | 30.3 | ... | 11.0 | 3.2 | 0.6 | 1.8 | .483 | .308 | .769 | 6.2 | .155 | https://www.basketball-reference.com/players/e... |
| 12 | 13 | Victor Oladipo | 25 | IND | 0.0 | 2.0 | 1010 | 0.002 | 75 | 34.0 | ... | 5.2 | 4.3 | 2.4 | 0.8 | .477 | .371 | .799 | 8.2 | .155 | https://www.basketball-reference.com/players/o... |
6 rows × 21 columns
BeautifulSoup objectcolumn_headers_player = [th.get_text() for th in player_profile_soup.find_all('th', limit=30)]
column_headers_player = [header for header in column_headers_player if len(header) != 0]
column_headers_player = column_headers_player[1:]
print(f"the columns in the career data tables are: \n {column_headers_player} \n")
print(f"there are {len(column_headers_player)} total columns in the career table")the columns in the career data tables are:
['Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
there are 29 total columns in the career table
player name: James Harden
player years active: ['2009-10', '2010-11', '2011-12', '2012-13', '2013-14', '2014-15', '2015-16', '2016-17', '2017-18', '2018-19']
player career data: ['20', 'OKC', 'NBA', 'SG', '76', '0', '22.9', '3.1', '7.6', '.403', '1.2', '3.3', '.375', '1.8', '4.3', '.424', '.484', '2.6', '3.2', '.808', '0.6', '2.6', '3.2', '1.8', '1.1', '0.3', '1.4', '2.6', '9.9', '21', 'OKC', 'NBA', 'SG', '82', '5', '26.7', '3.6', '8.3', '.436', '1.4', '4.0', '.349', '2.3', '4.4', '.514', '.518', '3.5', '4.2', '.843', '0.5', '2.6', '3.1', '2.1', '1.1', '0.3', '1.3', '2.5', '12.2', '22', 'OKC', 'NBA', 'SG', '62', '2', '31.4', '5.0', '10.1', '.491', '1.8', '4.7', '.390', '3.1', '5.4', '.579', '.582', '5.0', '6.0', '.846', '0.5', '3.6', '4.1', '3.7', '1.0', '0.2', '2.2', '2.4', '16.8', '23', 'HOU', 'NBA', 'SG', '78', '78', '38.3', '7.5', '17.1', '.438', '2.3', '6.2', '.368', '5.2', '10.9', '.477', '.504', '8.6', '10.2', '.851', '0.8', '4.1', '4.9', '5.8', '1.8', '0.5', '3.8', '2.3', '25.9', '24', 'HOU', 'NBA', 'SG', '73', '73', '38.0', '7.5', '16.5', '.456', '2.4', '6.6', '.366', '5.1', '9.9', '.515', '.529', '7.9', '9.1', '.866', '0.8', '3.9', '4.7', '6.1', '1.6', '0.4', '3.6', '2.4', '25.4', '25', 'HOU', 'NBA', 'SG', '81', '81', '36.8', '8.0', '18.1', '.440', '2.6', '6.9', '.375', '5.4', '11.3', '.480', '.511', '8.8', '10.2', '.868', '0.9', '4.7', '5.7', '7.0', '1.9', '0.7', '4.0', '2.6', '27.4', '26', 'HOU', 'NBA', 'SG', '82', '82', '38.1', '8.7', '19.7', '.439', '2.9', '8.0', '.359', '5.8', '11.7', '.494', '.512', '8.8', '10.2', '.860', '0.8', '5.3', '6.1', '7.5', '1.7', '0.6', '4.6', '2.8', '29.0', '27', 'HOU', 'NBA', 'PG', '81', '81', '36.4', '8.3', '18.9', '.440', '3.2', '9.3', '.347', '5.1', '9.6', '.530', '.525', '9.2', '10.9', '.847', '1.2', '7.0', '8.1', '11.2', '1.5', '0.5', '5.7', '2.7', '29.1', '28', 'HOU', 'NBA', 'SG', '72', '72', '35.4', '9.0', '20.1', '.449', '3.7', '10.0', '.367', '5.4', '10.1', '.531', '.541', '8.7', '10.1', '.858', '0.6', '4.8', '5.4', '8.8', '1.8', '0.7', '4.4', '2.3', '30.4', '29', 'HOU', 'NBA', 'PG', '78', '78', '36.8', '10.8', '24.5', '.442', '4.8', '13.2', '.368', '6.0', '11.3', '.528', '.541', '9.7', '11.0', '.879', '0.8', '5.8', '6.6', '7.5', '2.0', '0.7', '5.0', '3.1', '36.1', '', '', 'NBA', '', '765', '552', '34.1', '7.2', '16.2', '.443', '2.6', '7.3', '.365', '4.5', '9.0', '.506', '.525', '7.3', '8.5', '.857', '0.8', '4.5', '5.2', '6.2', '1.6', '0.5', '3.6', '2.6', '24.3', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'HOU', 'NBA', '', '545', '545', '37.1', '8.5', '19.3', '.443', '3.1', '8.6', '.364', '5.4', '10.7', '.506', '.524', '8.8', '10.2', '.861', '0.8', '5.1', '6.0', '7.7', '1.8', '0.6', '4.4', '2.6', '29.0', '', 'OKC', 'NBA', '', '220', '7', '26.7', '3.8', '8.6', '.444', '1.5', '3.9', '.370', '2.4', '4.7', '.506', '.529', '3.6', '4.3', '.835', '0.5', '2.9', '3.4', '2.5', '1.1', '0.3', '1.6', '2.5', '12.7']
player_career_data object every 29 steps (using the language of the slice_per function) to create a year of data<class 'list'>
| Age | Tm | Lg | Pos | G | GS | MP | FG | FGA | FG% | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | 7.6 | .403 | ... | .808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
| 1 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | 8.3 | .436 | ... | .843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
| 2 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | 10.1 | .491 | ... | .846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
| 3 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | 17.1 | .438 | ... | .851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
| 4 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | 16.5 | .456 | ... | .866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
| 5 | 25 | HOU | NBA | SG | 81 | 81 | 36.8 | 8.0 | 18.1 | .440 | ... | .868 | 0.9 | 4.7 | 5.7 | 7.0 | 1.9 | 0.7 | 4.0 | 2.6 | 27.4 |
| 6 | 26 | HOU | NBA | SG | 82 | 82 | 38.1 | 8.7 | 19.7 | .439 | ... | .860 | 0.8 | 5.3 | 6.1 | 7.5 | 1.7 | 0.6 | 4.6 | 2.8 | 29.0 |
| 7 | 27 | HOU | NBA | PG | 81 | 81 | 36.4 | 8.3 | 18.9 | .440 | ... | .847 | 1.2 | 7.0 | 8.1 | 11.2 | 1.5 | 0.5 | 5.7 | 2.7 | 29.1 |
| 8 | 28 | HOU | NBA | SG | 72 | 72 | 35.4 | 9.0 | 20.1 | .449 | ... | .858 | 0.6 | 4.8 | 5.4 | 8.8 | 1.8 | 0.7 | 4.4 | 2.3 | 30.4 |
| 9 | 29 | HOU | NBA | PG | 78 | 78 | 36.8 | 10.8 | 24.5 | .442 | ... | .879 | 0.8 | 5.8 | 6.6 | 7.5 | 2.0 | 0.7 | 5.0 | 3.1 | 36.1 |
| 10 | NBA | 765 | 552 | 34.1 | 7.2 | 16.2 | .443 | ... | .857 | 0.8 | 4.5 | 5.2 | 6.2 | 1.6 | 0.5 | 3.6 | 2.6 | 24.3 | |||
| 11 | ... | ||||||||||||||||||||
| 12 | HOU | NBA | 545 | 545 | 37.1 | 8.5 | 19.3 | .443 | ... | .861 | 0.8 | 5.1 | 6.0 | 7.7 | 1.8 | 0.6 | 4.4 | 2.6 | 29.0 | ||
| 13 | OKC | NBA | 220 | 7 | 26.7 | 3.8 | 8.6 | .444 | ... | .835 | 0.5 | 2.9 | 3.4 | 2.5 | 1.1 | 0.3 | 1.6 | 2.5 | 12.7 |
14 rows × 29 columns
def extract_career_data(player_link):
"""
Extract and return the the desired information from the td elements within the table rows.
:param: table_rows: list of soup `tr` elements
:return: list of player-year MVP finalist observations
"""
player_profile_request = urllib.request.urlopen(player_link)
# create the BeautifulSoup object
player_profile_soup = BeautifulSoup(player_profile_request, "lxml")
extracted_player_data = [td for td in player_profile_soup.find_all("tbody")]
player_name = player_profile_soup.find_all("h1")[0].text
player_years_active = [th.get_text() for th in player_profile_soup.find_all("tbody")[0].find_all("th")]
player_career_data = [td.get_text() for td in extracted_player_data[0].find_all("td")]
# slice list of list (player_data into yearly lists)
sliced_player_data = slice_per(player_career_data, 29)
player_career_df = pd.DataFrame(sliced_player_data).transpose()
player_career_df.insert(0, "Player", player_name)
player_career_df.insert(1, "Year", player_years_active)
return player_career_dfmvp_finalist_2018_dataPandas concat function to store all finalist data in one DataFrameDataFramethe MVP finalist dataframe has 13 rows (player-year observations) and 21 columns
| Player | Year | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ... | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | James Harden | 2009-10 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | ... | .808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
| 1 | James Harden | 2010-11 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | ... | .843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
| 2 | James Harden | 2011-12 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | ... | .846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
| 3 | James Harden | 2012-13 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | ... | .851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
| 4 | James Harden | 2013-14 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | ... | .866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 rows × 31 columns
column_headers_player with the two new columns we added in our extract_career_data functionIndex(['Player', 'Year', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG',
'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT',
'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF',
'PTS'],
dtype='object')
mvp_finalist_2019_career_data data| Player | Year | Age | Tm | Lg | Pos | G | GS | MP | FG | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Victor Oladipo | 2014-15 | 22 | ORL | NBA | SG | 72 | 71 | 35.7 | 6.6 | ... | .819 | 0.7 | 3.5 | 4.2 | 4.1 | 1.7 | 0.3 | 2.8 | 2.6 | 17.9 |
| 2 | Victor Oladipo | 2015-16 | 23 | ORL | NBA | SG | 72 | 52 | 33.0 | 5.9 | ... | .830 | 0.7 | 4.1 | 4.8 | 3.9 | 1.6 | 0.8 | 2.1 | 2.4 | 16.0 |
| 3 | Victor Oladipo | 2016-17 | 24 | OKC | NBA | SG | 67 | 67 | 33.2 | 6.1 | ... | .753 | 0.6 | 3.8 | 4.3 | 2.6 | 1.2 | 0.3 | 1.8 | 2.3 | 15.9 |
| 4 | Victor Oladipo | 2017-18 | 25 | IND | NBA | SG | 75 | 75 | 34.0 | 8.5 | ... | .799 | 0.6 | 4.6 | 5.2 | 4.3 | 2.4 | 0.8 | 2.9 | 2.3 | 23.1 |
| 5 | Victor Oladipo | 2018-19 | 26 | IND | NBA | SG | 36 | 36 | 31.9 | 6.9 | ... | .730 | 0.6 | 5.0 | 5.6 | 5.2 | 1.7 | 0.3 | 2.3 | 2.0 | 18.8 |
5 rows × 31 columns
Now that we fixed up the necessary columns, let’s write out the raw data to a CSV file.
| Player | Year | Age | Tm | Lg | Pos | G | GS | MP | FG | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | James Harden | 2009-10 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | ... | 0.808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
| 1 | James Harden | 2010-11 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | ... | 0.843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
| 2 | James Harden | 2011-12 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | ... | 0.846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
| 3 | James Harden | 2012-13 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | ... | 0.851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
| 4 | James Harden | 2013-14 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | ... | 0.866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 rows × 31 columns
Index(['Player', 'Year', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG',
'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT',
'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF',
'PTS'],
dtype='object')
mvp_finalist_columns_dict = {'Player':'player', 'Year':'year', 'Age': 'age', 'Tm': 'team', 'Lg': 'league',
'POS': 'position', 'G': 'games_played', 'GS': 'games_started', 'MP': 'avg_minutes',
'FG': 'field_goals_made_per_game', 'FGA': 'field_goals_attempted_per_game',
'FG%': 'field_goal_pct', '3P': 'three_pt_fg_made_per_game', '3PA': 'three_pt_fg_attempted_per_game',
'3P%': 'three_pt_pct', '2P': 'two_pt_fg_made_per_game', '2PA': 'two_pt_fg_attempted_per_game',
'2P%': 'two_pt_fg_pct', 'eFG%': 'effective_fg_pct',
'FT': 'free_throws_made_per_game', 'FTA': 'free_throws_attempted_per_game',
'FT%': 'free_throw_pct', 'ORB': 'offensive_rebounds_per_game', 'DRB': 'defensive_rebounds_per_game',
'TRB': 'total_rebounds_per_game', 'AST': 'assists_per_game', 'STL': 'steals_per_game',
'BLK': 'blocks_per_game', 'TOV': 'turnovers_per_game', 'PF': 'fouls_committed_per_game', 'PTS': 'points_per_game'
}| player | year | age | team | league | Pos | games_played | games_started | avg_minutes | field_goals_made_per_game | ... | free_throw_pct | offensive_rebounds_per_game | defensive_rebounds_per_game | total_rebounds_per_game | assists_per_game | steals_per_game | blocks_per_game | turnovers_per_game | fouls_committed_per_game | points_per_game | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | James Harden | 2009-10 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | ... | 0.808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
| 1 | James Harden | 2010-11 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | ... | 0.843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
| 2 | James Harden | 2011-12 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | ... | 0.846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
| 3 | James Harden | 2012-13 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | ... | 0.851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
| 4 | James Harden | 2013-14 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | ... | 0.866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 rows × 31 columns
Index(['player', 'year', 'age', 'team', 'league', 'Pos', 'games_played',
'games_started', 'avg_minutes', 'field_goals_made_per_game',
'field_goals_attempted_per_game', 'field_goal_pct',
'three_pt_fg_made_per_game', 'three_pt_fg_attempted_per_game',
'three_pt_pct', 'two_pt_fg_made_per_game',
'two_pt_fg_attempted_per_game', 'two_pt_fg_pct', 'effective_fg_pct',
'free_throws_made_per_game', 'free_throws_attempted_per_game',
'free_throw_pct', 'offensive_rebounds_per_game',
'defensive_rebounds_per_game', 'total_rebounds_per_game',
'assists_per_game', 'steals_per_game', 'blocks_per_game',
'turnovers_per_game', 'fouls_committed_per_game', 'points_per_game'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
Index: 121 entries, 0 to 120
Data columns (total 31 columns):
player 121 non-null object
year 121 non-null object
age 121 non-null int64
team 121 non-null object
league 121 non-null object
Pos 121 non-null object
games_played 121 non-null int64
games_started 121 non-null int64
avg_minutes 121 non-null float64
field_goals_made_per_game 121 non-null float64
field_goals_attempted_per_game 121 non-null float64
field_goal_pct 121 non-null float64
three_pt_fg_made_per_game 121 non-null float64
three_pt_fg_attempted_per_game 121 non-null float64
three_pt_pct 121 non-null float64
two_pt_fg_made_per_game 121 non-null float64
two_pt_fg_attempted_per_game 121 non-null float64
two_pt_fg_pct 121 non-null float64
effective_fg_pct 121 non-null float64
free_throws_made_per_game 121 non-null float64
free_throws_attempted_per_game 121 non-null float64
free_throw_pct 121 non-null float64
offensive_rebounds_per_game 121 non-null float64
defensive_rebounds_per_game 121 non-null float64
total_rebounds_per_game 121 non-null float64
assists_per_game 121 non-null float64
steals_per_game 121 non-null float64
blocks_per_game 121 non-null float64
turnovers_per_game 121 non-null float64
fouls_committed_per_game 121 non-null float64
points_per_game 121 non-null float64
dtypes: float64(23), int64(3), object(5)
memory usage: 30.2+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 121 entries, 0 to 120
Data columns (total 31 columns):
player 121 non-null object
year 121 non-null object
age 121 non-null int64
team 121 non-null object
league 121 non-null object
Pos 121 non-null object
games_played 121 non-null int64
games_started 121 non-null int64
avg_minutes 121 non-null float64
field_goals_made_per_game 121 non-null float64
field_goals_attempted_per_game 121 non-null float64
field_goal_pct 121 non-null float64
three_pt_fg_made_per_game 121 non-null float64
three_pt_fg_attempted_per_game 121 non-null float64
three_pt_pct 121 non-null float64
two_pt_fg_made_per_game 121 non-null float64
two_pt_fg_attempted_per_game 121 non-null float64
two_pt_fg_pct 121 non-null float64
effective_fg_pct 121 non-null float64
free_throws_made_per_game 121 non-null float64
free_throws_attempted_per_game 121 non-null float64
free_throw_pct 121 non-null float64
offensive_rebounds_per_game 121 non-null float64
defensive_rebounds_per_game 121 non-null float64
total_rebounds_per_game 121 non-null float64
assists_per_game 121 non-null float64
steals_per_game 121 non-null float64
blocks_per_game 121 non-null float64
turnovers_per_game 121 non-null float64
fouls_committed_per_game 121 non-null float64
points_per_game 121 non-null float64
dtypes: float64(23), int64(3), object(5)
memory usage: 30.2+ KB
the dimensions for the final data are: (121, 31) (rows, columns)
| player | year | age | team | league | Pos | games_played | games_started | avg_minutes | field_goals_made_per_game | ... | free_throw_pct | offensive_rebounds_per_game | defensive_rebounds_per_game | total_rebounds_per_game | assists_per_game | steals_per_game | blocks_per_game | turnovers_per_game | fouls_committed_per_game | points_per_game | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 120 | Victor Oladipo | 2018-19 | 26 | IND | NBA | SG | 36 | 36 | 31.9 | 6.9 | ... | 0.730 | 0.6 | 5.0 | 5.6 | 5.2 | 1.7 | 0.3 | 2.3 | 2.0 | 18.8 |
| 25 | LeBron James | 2018-19 | 34 | LAL | NBA | SF | 55 | 55 | 35.2 | 10.1 | ... | 0.665 | 1.0 | 7.4 | 8.5 | 8.3 | 1.3 | 0.6 | 3.6 | 1.7 | 27.4 |
| 91 | LaMarcus Aldridge | 2018-19 | 33 | SAS | NBA | C | 81 | 81 | 33.2 | 8.4 | ... | 0.847 | 3.1 | 6.1 | 9.2 | 2.4 | 0.5 | 1.3 | 1.8 | 2.2 | 21.3 |
| 39 | Damian Lillard | 2018-19 | 28 | POR | NBA | PG | 80 | 80 | 35.5 | 8.5 | ... | 0.912 | 0.9 | 3.8 | 4.6 | 6.9 | 1.1 | 0.4 | 2.7 | 1.9 | 25.8 |
| 32 | Anthony Davis | 2018-19 | 25 | NOP | NBA | C | 56 | 56 | 33.0 | 9.5 | ... | 0.794 | 3.1 | 8.9 | 12.0 | 3.9 | 1.6 | 2.4 | 2.0 | 2.4 | 25.9 |
5 rows × 31 columns
Beautiful Soup module to parse tables on multiple webpage into .csvPandas moduleDownload this notebook or see a static view here
import sys
import bs4
print(f'last updated: {datetime.now().strftime("%Y-%m-%d %H:%M")} \n')
print(f"System and module version information: \n")
print(f"Python version: {sys.version_info}")
print(f"urllib.request version: {urllib.request.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"Beautiful Soup version: {bs4.__version__}")last updated: 2019-07-01 09:13
System and module version information:
Python version: sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0)
urllib.request version: 3.7
pandas version: 0.23.4
Beautiful Soup version: 4.6.3