Kivan Polimis, Wed 06 March 2019, Sports
In the last two blog posts we:
We will look select from the MVP finalists in 2017-2018
On May 17th, the NBA announced the 2018-2019 MVP finalists:
Beautiful Soup
to parse webpages into .csvPandas
DataFrameimport os
import urllib
import webbrowser
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
url_2018_mvp_finalist = "https://www.basketball-reference.com/awards/awards_2018.html"
webbrowser.open_new_tab(url_2018_mvp_finalist)
True
html_finalist = urllib.request.urlopen(url_2018_mvp_finalist)
soup_finalist = BeautifulSoup(html_finalist, "lxml")
column_headers_finalist = [th.get_text() for th in soup_finalist.find_all('th', limit=30)]
column_headers_finalist = [header for header in column_headers_finalist if len(header) != 0]
column_headers_finalist = column_headers_finalist[1:]
print(f"raw column names in finalist table: {column_headers_finalist}")
column_headers_finalist = [header for header in column_headers_finalist if header not in ('Shooting', 'Advanced', 'Per Game')][:-4]
print(f"formatted column names in finalist table: {column_headers_finalist}")
print(f"{len(column_headers_finalist)} columns in finalist table")
raw column names in finalist table: ['Per Game', 'Shooting', 'Advanced', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '1', '2', '3', '4'] formatted column names in finalist table: ['Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48'] 20 columns in finalist table
tr
elements of the first tbody
elementtable_rows_finalist = soup_finalist.find("tbody").find_all("tr")
print(f"the subset soup object is of type: {type(table_rows_finalist)}")
table_rows_finalist[-1]
the subset soup object is of type: <class 'bs4.element.ResultSet'>
<tr><th class="right " data-stat="rank" scope="row">13</th><td class="left " csk="Oladipo,Victor" data-append-csv="oladivi01" data-stat="player"><a href="/players/o/oladivi01.html">Victor Oladipo</a></td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/IND/2018.html">IND</a></td><td class="right " data-stat="votes_first">0.0</td><td class="right " data-stat="points_won">2.0</td><td class="right " data-stat="points_max">1010</td><td class="right " data-stat="award_share">0.002</td><td class="right " data-stat="g">75</td><td class="right " data-stat="mp_per_g">34.0</td><td class="right " data-stat="pts_per_g">23.1</td><td class="right " data-stat="trb_per_g">5.2</td><td class="right " data-stat="ast_per_g">4.3</td><td class="right " data-stat="stl_per_g">2.4</td><td class="right " data-stat="blk_per_g">0.8</td><td class="right " data-stat="fg_pct">.477</td><td class="right " data-stat="fg3_pct">.371</td><td class="right " data-stat="ft_pct">.799</td><td class="right " data-stat="ws">8.2</td><td class="right " data-stat="ws_per_48">.155</td></tr>
BeautifulSoup
objectplayer_href = [href for href in table_rows_finalist[-1].find_all("a")][0]["href"]
player_href
'/players/o/oladivi01.html'
base_basketball_ref_url = "https://www.basketball-reference.com"
player_link = base_basketball_ref_url + player_href
player_link
'https://www.basketball-reference.com/players/o/oladivi01.html'
webbrowser.open_new_tab(player_link)
True
def extract_finalist_data(table_rows):
"""
Extract and return the the desired information from the td elements within the table rows.
:param: table_rows: list of soup `tr` elements
:return: list of player-year MVP finalist observations
"""
base_basketball_ref_url = "https://www.basketball-reference.com"
# create the empty list to store the player data
player_data = []
for row in table_rows: # for each row do the following
# Get the text for each table data (td) element in the row
player_rank = [td.get_text() for td in row.find_all("th")]
player_list = [td.get_text() for td in row.find_all("td")]
player_href = [href for href in row.find_all("a")][0]["href"]
player_link = [base_basketball_ref_url + player_href]
# there are some empty table rows, which are the repeated column headers in the table
# we skip over those rows and and continue the for loop
if not player_list:
continue
# Now append the individual player data to list of all player data
player_info = player_rank+player_list+player_link
player_data.append(player_info)
return player_data
DataFrame
extracted_finalist_2018_data = extract_finalist_data(table_rows_finalist)
mvp_finalist_2018_data = pd.DataFrame(extracted_finalist_2018_data)
mvp_finalist_2018_data.columns = column_headers_finalist+["player_link"]
print(f"the MVP finalist dataframe has {mvp_finalist_2018_data.shape[0]} rows (player-year observations) and {mvp_finalist_2018_data.shape[1]} columns")
mvp_finalist_2018_data.tail(6)
the MVP finalist dataframe has 13 rows (player-year observations) and 21 columns
Rank | Player | Age | Tm | First | Pts Won | Pts Max | Share | G | MP | ... | TRB | AST | STL | BLK | FG% | 3P% | FT% | WS | WS/48 | player_link | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 8 | DeMar DeRozan | 28 | TOR | 0.0 | 32.0 | 1010 | 0.032 | 80 | 33.9 | ... | 3.9 | 5.2 | 1.1 | 0.3 | .456 | .310 | .825 | 9.6 | .170 | https://www.basketball-reference.com/players/d... |
8 | 9 | LaMarcus Aldridge | 32 | SAS | 0.0 | 6.0 | 1010 | 0.006 | 75 | 33.5 | ... | 8.5 | 2.0 | 0.6 | 1.2 | .510 | .293 | .837 | 10.9 | .209 | https://www.basketball-reference.com/players/a... |
9 | 10T | Jimmy Butler | 28 | MIN | 0.0 | 5.0 | 1010 | 0.005 | 59 | 36.7 | ... | 5.3 | 4.9 | 2.0 | 0.4 | .474 | .350 | .854 | 8.9 | .198 | https://www.basketball-reference.com/players/b... |
10 | 10T | Stephen Curry | 29 | GSW | 0.0 | 5.0 | 1010 | 0.005 | 51 | 32.0 | ... | 5.1 | 6.1 | 1.6 | 0.2 | .495 | .423 | .921 | 9.1 | .267 | https://www.basketball-reference.com/players/c... |
11 | 12 | Joel Embiid | 23 | PHI | 0.0 | 4.0 | 1010 | 0.004 | 63 | 30.3 | ... | 11.0 | 3.2 | 0.6 | 1.8 | .483 | .308 | .769 | 6.2 | .155 | https://www.basketball-reference.com/players/e... |
12 | 13 | Victor Oladipo | 25 | IND | 0.0 | 2.0 | 1010 | 0.002 | 75 | 34.0 | ... | 5.2 | 4.3 | 2.4 | 0.8 | .477 | .371 | .799 | 8.2 | .155 | https://www.basketball-reference.com/players/o... |
6 rows × 21 columns
BeautifulSoup
objectplayer_test_table = "https://www.basketball-reference.com/players/h/hardeja01.html#per_game::none"
player_profile_request = urllib.request.urlopen(player_test_table)
player_profile_soup = BeautifulSoup(player_profile_request, "lxml")
column_headers_player = [th.get_text() for th in player_profile_soup.find_all('th', limit=30)]
column_headers_player = [header for header in column_headers_player if len(header) != 0]
column_headers_player = column_headers_player[1:]
print(f"the columns in the career data tables are: \n {column_headers_player} \n")
print(f"there are {len(column_headers_player)} total columns in the career table")
the columns in the career data tables are: ['Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'] there are 29 total columns in the career table
player_name = player_profile_soup.find_all("h1")[0].text
player_years_active = [th.get_text() for th in player_profile_soup.find_all("tbody")[0].find_all("th")]
player_career_data = [td.get_text() for td in player_profile_soup.find_all("td")]
print(f"player name: {player_name}")
print(f"player years active: {player_years_active}")
print(f"player career data: {player_career_data}")
player name: James Harden player years active: ['2009-10', '2010-11', '2011-12', '2012-13', '2013-14', '2014-15', '2015-16', '2016-17', '2017-18', '2018-19'] player career data: ['20', 'OKC', 'NBA', 'SG', '76', '0', '22.9', '3.1', '7.6', '.403', '1.2', '3.3', '.375', '1.8', '4.3', '.424', '.484', '2.6', '3.2', '.808', '0.6', '2.6', '3.2', '1.8', '1.1', '0.3', '1.4', '2.6', '9.9', '21', 'OKC', 'NBA', 'SG', '82', '5', '26.7', '3.6', '8.3', '.436', '1.4', '4.0', '.349', '2.3', '4.4', '.514', '.518', '3.5', '4.2', '.843', '0.5', '2.6', '3.1', '2.1', '1.1', '0.3', '1.3', '2.5', '12.2', '22', 'OKC', 'NBA', 'SG', '62', '2', '31.4', '5.0', '10.1', '.491', '1.8', '4.7', '.390', '3.1', '5.4', '.579', '.582', '5.0', '6.0', '.846', '0.5', '3.6', '4.1', '3.7', '1.0', '0.2', '2.2', '2.4', '16.8', '23', 'HOU', 'NBA', 'SG', '78', '78', '38.3', '7.5', '17.1', '.438', '2.3', '6.2', '.368', '5.2', '10.9', '.477', '.504', '8.6', '10.2', '.851', '0.8', '4.1', '4.9', '5.8', '1.8', '0.5', '3.8', '2.3', '25.9', '24', 'HOU', 'NBA', 'SG', '73', '73', '38.0', '7.5', '16.5', '.456', '2.4', '6.6', '.366', '5.1', '9.9', '.515', '.529', '7.9', '9.1', '.866', '0.8', '3.9', '4.7', '6.1', '1.6', '0.4', '3.6', '2.4', '25.4', '25', 'HOU', 'NBA', 'SG', '81', '81', '36.8', '8.0', '18.1', '.440', '2.6', '6.9', '.375', '5.4', '11.3', '.480', '.511', '8.8', '10.2', '.868', '0.9', '4.7', '5.7', '7.0', '1.9', '0.7', '4.0', '2.6', '27.4', '26', 'HOU', 'NBA', 'SG', '82', '82', '38.1', '8.7', '19.7', '.439', '2.9', '8.0', '.359', '5.8', '11.7', '.494', '.512', '8.8', '10.2', '.860', '0.8', '5.3', '6.1', '7.5', '1.7', '0.6', '4.6', '2.8', '29.0', '27', 'HOU', 'NBA', 'PG', '81', '81', '36.4', '8.3', '18.9', '.440', '3.2', '9.3', '.347', '5.1', '9.6', '.530', '.525', '9.2', '10.9', '.847', '1.2', '7.0', '8.1', '11.2', '1.5', '0.5', '5.7', '2.7', '29.1', '28', 'HOU', 'NBA', 'SG', '72', '72', '35.4', '9.0', '20.1', '.449', '3.7', '10.0', '.367', '5.4', '10.1', '.531', '.541', '8.7', '10.1', '.858', '0.6', '4.8', '5.4', '8.8', '1.8', '0.7', '4.4', '2.3', '30.4', '29', 'HOU', 'NBA', 'PG', '78', '78', '36.8', '10.8', '24.5', '.442', '4.8', '13.2', '.368', '6.0', '11.3', '.528', '.541', '9.7', '11.0', '.879', '0.8', '5.8', '6.6', '7.5', '2.0', '0.7', '5.0', '3.1', '36.1', '', '', 'NBA', '', '765', '552', '34.1', '7.2', '16.2', '.443', '2.6', '7.3', '.365', '4.5', '9.0', '.506', '.525', '7.3', '8.5', '.857', '0.8', '4.5', '5.2', '6.2', '1.6', '0.5', '3.6', '2.6', '24.3', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'HOU', 'NBA', '', '545', '545', '37.1', '8.5', '19.3', '.443', '3.1', '8.6', '.364', '5.4', '10.7', '.506', '.524', '8.8', '10.2', '.861', '0.8', '5.1', '6.0', '7.7', '1.8', '0.6', '4.4', '2.6', '29.0', '', 'OKC', 'NBA', '', '220', '7', '26.7', '3.8', '8.6', '.444', '1.5', '3.9', '.370', '2.4', '4.7', '.506', '.529', '3.6', '4.3', '.835', '0.5', '2.9', '3.4', '2.5', '1.1', '0.3', '1.6', '2.5', '12.7']
def slice_per(source, step):
return [source[i::step] for i in range(step)]
player_career_data
object every 29 steps (using the language of the slice_per
function) to create a year of datasliced_player_data = slice_per(player_career_data, 29)
print(type(sliced_player_data))
player_career_df = pd.DataFrame(sliced_player_data).transpose()
player_career_df.columns = column_headers_player
player_career_df
<class 'list'>
Age | Tm | Lg | Pos | G | GS | MP | FG | FGA | FG% | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | 7.6 | .403 | ... | .808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
1 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | 8.3 | .436 | ... | .843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
2 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | 10.1 | .491 | ... | .846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
3 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | 17.1 | .438 | ... | .851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
4 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | 16.5 | .456 | ... | .866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 | 25 | HOU | NBA | SG | 81 | 81 | 36.8 | 8.0 | 18.1 | .440 | ... | .868 | 0.9 | 4.7 | 5.7 | 7.0 | 1.9 | 0.7 | 4.0 | 2.6 | 27.4 |
6 | 26 | HOU | NBA | SG | 82 | 82 | 38.1 | 8.7 | 19.7 | .439 | ... | .860 | 0.8 | 5.3 | 6.1 | 7.5 | 1.7 | 0.6 | 4.6 | 2.8 | 29.0 |
7 | 27 | HOU | NBA | PG | 81 | 81 | 36.4 | 8.3 | 18.9 | .440 | ... | .847 | 1.2 | 7.0 | 8.1 | 11.2 | 1.5 | 0.5 | 5.7 | 2.7 | 29.1 |
8 | 28 | HOU | NBA | SG | 72 | 72 | 35.4 | 9.0 | 20.1 | .449 | ... | .858 | 0.6 | 4.8 | 5.4 | 8.8 | 1.8 | 0.7 | 4.4 | 2.3 | 30.4 |
9 | 29 | HOU | NBA | PG | 78 | 78 | 36.8 | 10.8 | 24.5 | .442 | ... | .879 | 0.8 | 5.8 | 6.6 | 7.5 | 2.0 | 0.7 | 5.0 | 3.1 | 36.1 |
10 | NBA | 765 | 552 | 34.1 | 7.2 | 16.2 | .443 | ... | .857 | 0.8 | 4.5 | 5.2 | 6.2 | 1.6 | 0.5 | 3.6 | 2.6 | 24.3 | |||
11 | ... | ||||||||||||||||||||
12 | HOU | NBA | 545 | 545 | 37.1 | 8.5 | 19.3 | .443 | ... | .861 | 0.8 | 5.1 | 6.0 | 7.7 | 1.8 | 0.6 | 4.4 | 2.6 | 29.0 | ||
13 | OKC | NBA | 220 | 7 | 26.7 | 3.8 | 8.6 | .444 | ... | .835 | 0.5 | 2.9 | 3.4 | 2.5 | 1.1 | 0.3 | 1.6 | 2.5 | 12.7 |
14 rows × 29 columns
def extract_career_data(player_link):
"""
Extract and return the the desired information from the td elements within the table rows.
:param: table_rows: list of soup `tr` elements
:return: list of player-year MVP finalist observations
"""
player_profile_request = urllib.request.urlopen(player_link)
# create the BeautifulSoup object
player_profile_soup = BeautifulSoup(player_profile_request, "lxml")
extracted_player_data = [td for td in player_profile_soup.find_all("tbody")]
player_name = player_profile_soup.find_all("h1")[0].text
player_years_active = [th.get_text() for th in player_profile_soup.find_all("tbody")[0].find_all("th")]
player_career_data = [td.get_text() for td in extracted_player_data[0].find_all("td")]
# slice list of list (player_data into yearly lists)
sliced_player_data = slice_per(player_career_data, 29)
player_career_df = pd.DataFrame(sliced_player_data).transpose()
player_career_df.insert(0, "Player", player_name)
player_career_df.insert(1, "Year", player_years_active)
return player_career_df
mvp_finalist_2018_data
all_player_career_data = [extract_career_data(player_link) for player_link in mvp_finalist_2018_data["player_link"]]
Pandas
concat function to store all finalist data in one DataFrame
DataFrame
mvp_finalist_2019_career_data = pd.concat(all_player_career_data, axis=0)
print(f"the MVP finalist dataframe has {mvp_finalist_2018_data.shape[0]} rows (player-year observations) and {mvp_finalist_2018_data.shape[1]} columns")
mvp_finalist_2019_career_data.head()
the MVP finalist dataframe has 13 rows (player-year observations) and 21 columns
Player | Year | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ... | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | James Harden | 2009-10 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | ... | .808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
1 | James Harden | 2010-11 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | ... | .843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
2 | James Harden | 2011-12 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | ... | .846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
3 | James Harden | 2012-13 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | ... | .851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
4 | James Harden | 2013-14 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | ... | .866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 rows × 31 columns
column_headers_player
with the two new columns we added in our extract_career_data
functionmvp_finalist_2019_career_data.columns = ["Player"] + ["Year"] + column_headers_player
mvp_finalist_2019_career_data.columns
Index(['Player', 'Year', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'], dtype='object')
mvp_finalist_2019_career_data
datamvp_finalist_2019_career_data.tail()
Player | Year | Age | Tm | Lg | Pos | G | GS | MP | FG | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Victor Oladipo | 2014-15 | 22 | ORL | NBA | SG | 72 | 71 | 35.7 | 6.6 | ... | .819 | 0.7 | 3.5 | 4.2 | 4.1 | 1.7 | 0.3 | 2.8 | 2.6 | 17.9 |
2 | Victor Oladipo | 2015-16 | 23 | ORL | NBA | SG | 72 | 52 | 33.0 | 5.9 | ... | .830 | 0.7 | 4.1 | 4.8 | 3.9 | 1.6 | 0.8 | 2.1 | 2.4 | 16.0 |
3 | Victor Oladipo | 2016-17 | 24 | OKC | NBA | SG | 67 | 67 | 33.2 | 6.1 | ... | .753 | 0.6 | 3.8 | 4.3 | 2.6 | 1.2 | 0.3 | 1.8 | 2.3 | 15.9 |
4 | Victor Oladipo | 2017-18 | 25 | IND | NBA | SG | 75 | 75 | 34.0 | 8.5 | ... | .799 | 0.6 | 4.6 | 5.2 | 4.3 | 2.4 | 0.8 | 2.9 | 2.3 | 23.1 |
5 | Victor Oladipo | 2018-19 | 26 | IND | NBA | SG | 36 | 36 | 31.9 | 6.9 | ... | .730 | 0.6 | 5.0 | 5.6 | 5.2 | 1.7 | 0.3 | 2.3 | 2.0 | 18.8 |
5 rows × 31 columns
Now that we fixed up the necessary columns, let's write out the raw data to a CSV file.
os.makedirs('../data/raw_data', exist_ok=True)
os.makedirs('../data/clean_data', exist_ok=True)
mvp_finalist_2019_career_data.to_csv("../data/raw_data/mvp_finalist_2019_career_data.csv", index=False)
mvp_finalist_2019_df_clean = pd.read_csv("../data/raw_data/mvp_finalist_2019_career_data.csv", encoding = "Latin-1")
mvp_finalist_2019_df_clean.head()
Player | Year | Age | Tm | Lg | Pos | G | GS | MP | FG | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | James Harden | 2009-10 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | ... | 0.808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
1 | James Harden | 2010-11 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | ... | 0.843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
2 | James Harden | 2011-12 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | ... | 0.846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
3 | James Harden | 2012-13 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | ... | 0.851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
4 | James Harden | 2013-14 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | ... | 0.866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 rows × 31 columns
mvp_finalist_2019_df_clean.columns
Index(['Player', 'Year', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'], dtype='object')
mvp_finalist_columns_dict = {'Player':'player', 'Year':'year', 'Age': 'age', 'Tm': 'team', 'Lg': 'league',
'POS': 'position', 'G': 'games_played', 'GS': 'games_started', 'MP': 'avg_minutes',
'FG': 'field_goals_made_per_game', 'FGA': 'field_goals_attempted_per_game',
'FG%': 'field_goal_pct', '3P': 'three_pt_fg_made_per_game', '3PA': 'three_pt_fg_attempted_per_game',
'3P%': 'three_pt_pct', '2P': 'two_pt_fg_made_per_game', '2PA': 'two_pt_fg_attempted_per_game',
'2P%': 'two_pt_fg_pct', 'eFG%': 'effective_fg_pct',
'FT': 'free_throws_made_per_game', 'FTA': 'free_throws_attempted_per_game',
'FT%': 'free_throw_pct', 'ORB': 'offensive_rebounds_per_game', 'DRB': 'defensive_rebounds_per_game',
'TRB': 'total_rebounds_per_game', 'AST': 'assists_per_game', 'STL': 'steals_per_game',
'BLK': 'blocks_per_game', 'TOV': 'turnovers_per_game', 'PF': 'fouls_committed_per_game', 'PTS': 'points_per_game'
}
mvp_finalist_2019_df_clean.rename(index=str,columns=mvp_finalist_columns_dict, inplace=True)
mvp_finalist_2019_df_clean.head()
player | year | age | team | league | Pos | games_played | games_started | avg_minutes | field_goals_made_per_game | ... | free_throw_pct | offensive_rebounds_per_game | defensive_rebounds_per_game | total_rebounds_per_game | assists_per_game | steals_per_game | blocks_per_game | turnovers_per_game | fouls_committed_per_game | points_per_game | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | James Harden | 2009-10 | 20 | OKC | NBA | SG | 76 | 0 | 22.9 | 3.1 | ... | 0.808 | 0.6 | 2.6 | 3.2 | 1.8 | 1.1 | 0.3 | 1.4 | 2.6 | 9.9 |
1 | James Harden | 2010-11 | 21 | OKC | NBA | SG | 82 | 5 | 26.7 | 3.6 | ... | 0.843 | 0.5 | 2.6 | 3.1 | 2.1 | 1.1 | 0.3 | 1.3 | 2.5 | 12.2 |
2 | James Harden | 2011-12 | 22 | OKC | NBA | SG | 62 | 2 | 31.4 | 5.0 | ... | 0.846 | 0.5 | 3.6 | 4.1 | 3.7 | 1.0 | 0.2 | 2.2 | 2.4 | 16.8 |
3 | James Harden | 2012-13 | 23 | HOU | NBA | SG | 78 | 78 | 38.3 | 7.5 | ... | 0.851 | 0.8 | 4.1 | 4.9 | 5.8 | 1.8 | 0.5 | 3.8 | 2.3 | 25.9 |
4 | James Harden | 2013-14 | 24 | HOU | NBA | SG | 73 | 73 | 38.0 | 7.5 | ... | 0.866 | 0.8 | 3.9 | 4.7 | 6.1 | 1.6 | 0.4 | 3.6 | 2.4 | 25.4 |
5 rows × 31 columns
mvp_finalist_2019_df_clean.columns
Index(['player', 'year', 'age', 'team', 'league', 'Pos', 'games_played', 'games_started', 'avg_minutes', 'field_goals_made_per_game', 'field_goals_attempted_per_game', 'field_goal_pct', 'three_pt_fg_made_per_game', 'three_pt_fg_attempted_per_game', 'three_pt_pct', 'two_pt_fg_made_per_game', 'two_pt_fg_attempted_per_game', 'two_pt_fg_pct', 'effective_fg_pct', 'free_throws_made_per_game', 'free_throws_attempted_per_game', 'free_throw_pct', 'offensive_rebounds_per_game', 'defensive_rebounds_per_game', 'total_rebounds_per_game', 'assists_per_game', 'steals_per_game', 'blocks_per_game', 'turnovers_per_game', 'fouls_committed_per_game', 'points_per_game'], dtype='object')
mvp_finalist_2019_df_clean = mvp_finalist_2019_df_clean.apply(pd.to_numeric, errors="ignore")
mvp_finalist_2019_df_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 121 entries, 0 to 120 Data columns (total 31 columns): player 121 non-null object year 121 non-null object age 121 non-null int64 team 121 non-null object league 121 non-null object Pos 121 non-null object games_played 121 non-null int64 games_started 121 non-null int64 avg_minutes 121 non-null float64 field_goals_made_per_game 121 non-null float64 field_goals_attempted_per_game 121 non-null float64 field_goal_pct 121 non-null float64 three_pt_fg_made_per_game 121 non-null float64 three_pt_fg_attempted_per_game 121 non-null float64 three_pt_pct 121 non-null float64 two_pt_fg_made_per_game 121 non-null float64 two_pt_fg_attempted_per_game 121 non-null float64 two_pt_fg_pct 121 non-null float64 effective_fg_pct 121 non-null float64 free_throws_made_per_game 121 non-null float64 free_throws_attempted_per_game 121 non-null float64 free_throw_pct 121 non-null float64 offensive_rebounds_per_game 121 non-null float64 defensive_rebounds_per_game 121 non-null float64 total_rebounds_per_game 121 non-null float64 assists_per_game 121 non-null float64 steals_per_game 121 non-null float64 blocks_per_game 121 non-null float64 turnovers_per_game 121 non-null float64 fouls_committed_per_game 121 non-null float64 points_per_game 121 non-null float64 dtypes: float64(23), int64(3), object(5) memory usage: 30.2+ KB
num_cols_finalist = mvp_finalist_2019_df_clean.columns[mvp_finalist_2019_df_clean.dtypes != object]
mvp_finalist_2019_df_clean.loc[:, num_cols_finalist] = mvp_finalist_2019_df_clean.loc[:, num_cols_finalist].fillna(0)
mvp_finalist_2019_df_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 121 entries, 0 to 120 Data columns (total 31 columns): player 121 non-null object year 121 non-null object age 121 non-null int64 team 121 non-null object league 121 non-null object Pos 121 non-null object games_played 121 non-null int64 games_started 121 non-null int64 avg_minutes 121 non-null float64 field_goals_made_per_game 121 non-null float64 field_goals_attempted_per_game 121 non-null float64 field_goal_pct 121 non-null float64 three_pt_fg_made_per_game 121 non-null float64 three_pt_fg_attempted_per_game 121 non-null float64 three_pt_pct 121 non-null float64 two_pt_fg_made_per_game 121 non-null float64 two_pt_fg_attempted_per_game 121 non-null float64 two_pt_fg_pct 121 non-null float64 effective_fg_pct 121 non-null float64 free_throws_made_per_game 121 non-null float64 free_throws_attempted_per_game 121 non-null float64 free_throw_pct 121 non-null float64 offensive_rebounds_per_game 121 non-null float64 defensive_rebounds_per_game 121 non-null float64 total_rebounds_per_game 121 non-null float64 assists_per_game 121 non-null float64 steals_per_game 121 non-null float64 blocks_per_game 121 non-null float64 turnovers_per_game 121 non-null float64 fouls_committed_per_game 121 non-null float64 points_per_game 121 non-null float64 dtypes: float64(23), int64(3), object(5) memory usage: 30.2+ KB
mvp_finalist_2019_df_clean = mvp_finalist_2019_df_clean[pd.notnull(mvp_finalist_2019_df_clean['player'])]
mvp_finalist_2019_df_clean.sort_values(['year'], ascending=False, axis=0, inplace=True)
mvp_finalist_2019_df_clean.to_csv("../data/clean_data/mvp_finalist_2019_df_clean.csv", index=False)
print(f" the dimensions for the final data are: {mvp_finalist_2019_df_clean.shape} (rows, columns)")
mvp_finalist_2019_df_clean.head()
the dimensions for the final data are: (121, 31) (rows, columns)
player | year | age | team | league | Pos | games_played | games_started | avg_minutes | field_goals_made_per_game | ... | free_throw_pct | offensive_rebounds_per_game | defensive_rebounds_per_game | total_rebounds_per_game | assists_per_game | steals_per_game | blocks_per_game | turnovers_per_game | fouls_committed_per_game | points_per_game | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
120 | Victor Oladipo | 2018-19 | 26 | IND | NBA | SG | 36 | 36 | 31.9 | 6.9 | ... | 0.730 | 0.6 | 5.0 | 5.6 | 5.2 | 1.7 | 0.3 | 2.3 | 2.0 | 18.8 |
25 | LeBron James | 2018-19 | 34 | LAL | NBA | SF | 55 | 55 | 35.2 | 10.1 | ... | 0.665 | 1.0 | 7.4 | 8.5 | 8.3 | 1.3 | 0.6 | 3.6 | 1.7 | 27.4 |
91 | LaMarcus Aldridge | 2018-19 | 33 | SAS | NBA | C | 81 | 81 | 33.2 | 8.4 | ... | 0.847 | 3.1 | 6.1 | 9.2 | 2.4 | 0.5 | 1.3 | 1.8 | 2.2 | 21.3 |
39 | Damian Lillard | 2018-19 | 28 | POR | NBA | PG | 80 | 80 | 35.5 | 8.5 | ... | 0.912 | 0.9 | 3.8 | 4.6 | 6.9 | 1.1 | 0.4 | 2.7 | 1.9 | 25.8 |
32 | Anthony Davis | 2018-19 | 25 | NOP | NBA | C | 56 | 56 | 33.0 | 9.5 | ... | 0.794 | 3.1 | 8.9 | 12.0 | 3.9 | 1.6 | 2.4 | 2.0 | 2.4 | 25.9 |
5 rows × 31 columns
Beautiful Soup
module to parse tables on multiple webpage into .csvPandas
moduleimport sys
import bs4
print(f'last updated: {datetime.now().strftime("%Y-%m-%d %H:%M")} \n')
print(f"System and module version information: \n")
print(f"Python version: {sys.version_info}")
print(f"urllib.request version: {urllib.request.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"Beautiful Soup version: {bs4.__version__}")
last updated: 2019-07-01 09:13 System and module version information: Python version: sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0) urllib.request version: 3.7 pandas version: 0.23.4 Beautiful Soup version: 4.6.3