We are going to use machine learning and statistics to predict NBA matchups. To do this, we are going to need data on NBA games, and lots of it. So let’s get all the team matchups and box scores from stats.nba.com, and make them ready for use.
This post has two purposes. The first is to show you how to do the actual web scraping. The second purpose is to show you how to examine data before you us it. Data are almost always a bit messy and need to be handled with care. It’s important to take some time to look at data and to make sure it’s clean before use.
This post has a lot of code. If you don’t want to spend time going through it, you can just run the code if you want. The code in this post will scrape and store the data for you. You won’t need to run this code again after you get the data.
A Jupyter Notebook with the code and portions of the text from this post can be found here.
I think the most useful information in this post is learning how to assess the data quality, and make smart decisions about how to address potential problems. If you decide to just skim the code and focus on the decisions, hopefully this example will still be useful for you. You may also find it interesting to see how real-world NBA facts (such as expansion teams, team moves and name changes) actually impact the data!
The Data
We are going to scrape Team Box Scores for completed NBA seasons from stats.nba.com. We are going to start with tradtional box scores, which give us plenty of statistics to consider, as well as all the matchup information we need. If you explore the site, you’ll see there are also other more advanced data available. We’ll consider these other data in future posts.
The advanced data are only available going back to the 1996-97 season, so we’ll limit our traditional box scores to the same time period. The site actually has traditional box scores going back to the 1946-47 season. Such old data are interesting, but aren’t useful for predicting current matchups. We have to consider how much the game has changed over the years. Although there have been a number of NBA rule changes since 1996 and game style has evolved, I think data over this time period are still useful today.
The data from 1996-97 to 2016-17 comprise 21 complete seasons, including pre-season data from recent years. This is plenty of data for our purposes. Notice that we’re not going to download data from the current 2017-18 season yet. We are going to use the historical data for building prediction models, since we know how the seasons turned out. After we build the model, we can apply it to the current season and see how it does over the remainder of the season.
Why Write a Web Scraper
As I mentioned, there’s a lot of code in this post. You may wonder why it’s worth it to build a web scraper.
First, in many cases it’s the easiest way to get the data you need. Even if it takes a few hours to build a scraper, it’s a lot faster than manually downloading it. Consider the excellent site, Basketball Reference. Basketball Reference offers you the ability to download games into a CSV file. Here is a link to the 2016-17 season games.
Notice that this site shows games grouped by month. You can download the games by looking at the “Share & More” drop-down menu next to the month schedule header. Each month requires a separate download. So, if you want to download all the games for the 21 complete seasons since 1996-97, you’d need to manually download around 180 CSV files (there are 9 months in the typical NBA season). Then, to create a master spreadsheet of matchup results, you’d need to open them up in Microsoft Excel or Google Sheets and put all the data together in one place.
Also, if you click on one of the box score links (for example, the October 25, 2016 matchup between the Knicks and the Cavaliers), you’ll see very detailed box score information, but no easy way to extract and download all the box scores at once. Scraping the original data is much faster and more practical. As you’ll see, it’s not actually that hard.
Lastly, writing a scraper is essential if you want to update the data relatively frequently. If you create a matchup prediction model, you’ll want to run it after every game, to update the probabilities for future matchups. Imagine if you had to type information into a spreadsheet or download a few new spreadsheets every night during the NBA season?
The web scraper we’ll develop here will also be able to scrape current season games. So once you have it working, you can run each day during the NBA season to get the lastest matchup results and team box scores automatically.
Scraping a JavaScript-Rendered Web Site
stats.nba.com is a JavaScript-rendered site. As discussed in our previous post on web-scraping, these sites work differently from HTML-rendered sites.
We can still use the Requests package to communicate with the web site, but we can’t use BeautifulSoup like we did previously. BeautifulSoup is great for scraping HTML-rendered sites, but it doesn’t work with JavaScript-rendered sites.
Scraping the box scores isn’t actually very hard, but it has one very tricky aspect. The URL to send the GET request isn’t the same URL that you see in your browser. Remember that the whole point of JavaScript-rendered pages is the web site returns JavaScript code to your browser, which then runs the JavaScript to get the data to render in the browser window. The URL you want to scrape is really that second URL, which returns the data.
Let’s look at the specific links for the box scores. The URL for the link to the page we can view is:
https://stats.nba.com/teams/boxscores-traditional/?Season=2016-17&SeasonType=Regular%20Season
It’s clear from the URL that this page has traditional box scores for the 2016-17 regular season. It’s pretty clear that we can access different seasons by changing the parameters at the end of the URL.
Now here’s the trick. The URL which returns the actual data is:
https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00
&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0
&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season
&ShotClockRange=&VsConference=&VsDivision=
That’s all actually one line, which I’ve broken up for clarity. You can open up the a browser tab with the raw data by clicking this link. Notice that this URL has a lot of parameters.
So, how did I figure out that this is the correct URL?
Use the Browser Developer Tools
As mentioned in the prior post on web scraping, your browser’s developer tools make it easier to see what is going on. In that post, we used the Elements section of the developer tools to look at the different HTML tags. To scrape this JavaScript-rendered page, we will instead use the Network section. Here are instructions for how to use these tools in Chrome. Here are similar instructions for Firefox. I think these browsers are the easiest to use to understand JavaScript-rendered sites.
Under the Network section, check out the XHR tab. XHR is short for XMLHttpRequest, but the details don’t matter. The point is that if you load (or reload) the page (with the “normal” URL above), you will see a few lines show up under the XHR tab, one of which starts with teamgamelogs
. That’s the item we want.
If you select that teamgamelogs
item, in either Chrome or Firefox, you can explore what’s in that object. I’ll move on the scraping now, but hopefully this brief introduction will serve to explain how you could figure this out for yourself on a different web page. As we go through the steps below, see if you can follow along and find the corresponding information in your browser’s developer tools on that XHR tab.
If you explore other pages on stats.nba.com with the developer tools open to the XHR tab, you’ll be able to apply these techniques to figure out how to scrape those other pages, too.
Scraping the Site
Let’s do all the necessary imports.
from itertools import chain
from pathlib import Path
from time import sleep
from datetime import datetime
We are going to use Requests to scrape the JavaScript.
import requests
from tqdm import tqdm
tqdm.monitor_interval = 0
We will use pandas
and numpy
to explore and organize the data toward the end.
import numpy as np
import pandas as pd
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
We will store the data (both in raw from and processed for later analysis) on disk. The code below will create directories on your computer if they don’t exist already. Feel free to change the directories below if you’re running this code on your computer.
PROJECT_DIR = Path.cwd().parent
DATA_DIR = PROJECT_DIR / 'data' / 'scraped' / 'teamgamelogs'
DATA_DIR.mkdir(exist_ok=True, parents=True)
OUTPUT_DIR = PROJECT_DIR / 'data' / 'prepared'
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
USER_AGENT = (
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) ' +
'AppleWebKit/537.36 (KHTML, like Gecko) ' +
'Chrome/61.0.3163.100 Safari/537.36'
)
REQUEST_HEADERS = {
'user-agent': USER_AGENT,
}
Below is the base URL for the team box scores. This type of URL is also called a web API endpoint. Recall that an API is an application programming interface, which makes it easier to communicate with a web site and obtain information.
The NBA data defines the NBA itself as league ’00’.
NBA_URL = 'http://stats.nba.com/stats/teamgamelogs'
NBA_ID = '00'
The site can give us information for the regular season, playoffs and pre-season.
NBA_SEASON_TYPES = {
'regular': 'Regular Season',
'playoffs': 'Playoffs',
'preseason': 'Pre Season',
}
To explore, we will start with the 2016-17 season, the most recent completed season. We will end up scraping all completed seasons since 1996-97.
season = '2016-17'
season_type = NBA_SEASON_TYPES['regular']
nba_params = {
'LeagueID': NBA_ID,
'Season': season,
'SeasonType': season_type,
}
r = requests.get(NBA_URL, params=nba_params, headers=REQUEST_HEADERS, allow_redirects=False, timeout=15)
r.status_code
Figuring out the JSON
The GET request worked. Let’s look at what type of data was returned.
r.headers['content-type']
The data are in JavaScript Object Notation, or JSON. Remember, if you want to look at the raw data in JSON format, you can open up the a browser tab pointed to the web API endpoint by clicking this link, or you can explore with your browser’s development tools.
Requests makes it easy to process JSON.
json = r.json()
type(json)
json.keys()
You can see which parameters are available, and how they were filled in by the GET request.
json['parameters']
Now let’s get the results.
type(json['resultSets'])
len(json['resultSets'])
results = json['resultSets'][0]
type(results)
results.keys()
headers = results['headers']
type(headers)
len(headers)
There are 56 columns of data per row.
rows = results['rowSet']
type(rows)
len(rows)
There are 30 NBA teams, and each team played 82 games in the 2016-17 season. Since 82 times 30 is 2460, there had better be 2460 rows in the results.
type(rows[0])
Each row is itself a list. Let’s see how many columns are in the row.
len(rows[0])
As expected, there are 56 columns per row, same as we saw for the column headers. Let’s look at the first row in the results.
print(rows[0])
This is the Houston Rockets playing against the Minnesota Timberwolves in April 12, 2017. The Rockets won. To make sense of all the numbers, let’s put things into a more useful framework.
Scraping and Saving the Raw Data
First, let’s put what we’ve learned about scraping the JSON into a function. We are going to store our data in a pandas
DataFrame
to make analysis easier.
def scrape_teamgamelogs(season, season_type, sleep_for=None):
"""Process JSON from stats.nba.com teamgamelogs endpoint and return unformatted DataFrame."""
if sleep_for:
sleep(sleep_for) # be nice to server by sleeping if we are scraping inside a loop
nba_params = {
'LeagueID': NBA_ID,
'Season': season,
'SeasonType': season_type,
}
r = requests.get(
NBA_URL,
params=nba_params,
headers=REQUEST_HEADERS,
allow_redirects=False,
timeout=15,
)
r.raise_for_status()
results = r.json()['resultSets'][0]
headers = results['headers']
rows = results['rowSet']
return pd.DataFrame(rows, columns=headers)
It’s import when scraping data from the web to save it to your computer, especially if it’s data that won’t change in the future. There’s no point scraping data again and again if it won’t change. In this case, we are looking at NBA games from prior seasons, so once we scrape these games we are done.
Let’s call our scraping function from inside another function. This outer function will first check if a file already exists on our computer that has the data. If the file exists, we don’t need to scrape the data; we can just load it from the file into a DataFrame
.
On the other hand, if we scrape the data, our function needs to save them to a file so we don’t need to scrape the web site again. Also, our function has a parameter to overwrite the files on disk with freshly scraped data, in case we need or want to do that.
It’s important when scraping data (or collecting data of any kind) to save the raw results first. We are going to be doing a lot of processing to get the data into a useful format. We are going to end up saving those results too. However, if you have an error in your code and don’t save the raw data first, you might not be able to figure out later where you went wrong. That’s why it’s good practice to save the raw data, so you can always start over and reconstruct your results.
def raw_teamgamelogs(season, season_type, data_dir=None, overwrite=False, sleep_for=None):
"""Scrape stats.nba.com teamgamelogs or read from a CSV file if it exists."""
if data_dir:
csv_filename = 'stats_nba_com-teamgamelogs-{season}-{season_type}.csv'.format(
season=season.replace('-', '_'),
season_type=season_type.replace(' ', '_').lower(),
)
csvfile = data_dir.joinpath(csv_filename)
else:
csvfile = None
if csvfile and not overwrite and csvfile.exists():
df = pd.read_csv(csvfile)
else:
df = scrape_teamgamelogs(season, season_type, sleep_for)
if csvfile:
df.to_csv(csvfile, index=False)
return df
raw = raw_teamgamelogs(season, season_type, data_dir=DATA_DIR)
raw.shape
Remember that this function has also saved the raw data for us behind the scenes. If we call it again for this season and season type, it will read it from the file, rather than scrape it from the web site.
raw.head()
Cleaning up the Columns
Now we have the 2016-17 regular season games inside a pandas
DataFrame
. Let’s see what the columns are.
raw.columns
We don’t need a number of these columns. All of the ‘_RANK’ columns are designed to help the browser sort the table by different columns. The data in the ‘_RANK’ columns are just the sort orders for each row, if the user wants to sort the results by that particular column. (For example, the first row is the 244th row by field goals made ‘FGM_RANK’).
We are going to throw away all these ‘_RANK’ columns, since we can sort in pandas
by any column we wish anyway. We are also going to throw away the ‘_PCT’ columns, since we can compute any percentages we wish if we need them. For example, ‘FG_PCT’ = ‘FGM’ / ‘FGA’ for field goals, and similarly for three-point field goals (‘FG3’) and free throws (‘FT’).
We can through away a few other columsn too. The first is ‘PLUS_MINUS’, which is just margin of victory (i.e., winner points minus loser points). We can compute this later from the actual points. Next is ‘REB’ = ‘OREB’ + ‘DREB’. The last two are ‘BLKA’ (blocks allowed) and ‘PFD’ (personal fouls drawn). These are stats that apply to the opponent. As we will soon see, there are better ways to get this information.
Let’s write a function to exclude the columns we don’t need.
def drop_raw_columns(df):
cols = [col for col in df.columns if '_RANK' not in col]
cols = [col for col in cols if '_PCT' not in col]
cols = [col for col in cols if col not in ['PLUS_MINUS', 'REB', 'BLKA', 'PFD']]
return df[cols]
Next, let’s write a function to rename the columns to tidy things up.
def rename_raw_columns(df):
df = df.rename(columns={
'SEASON_YEAR': 'season',
'TEAM_ABBREVIATION': 'team',
'GAME_DATE': 'date',
})
df.columns = df.columns.str.lower()
return df
If we apply these functions to the data, here’s what we’d get so far. Let’s look at the column names and the data types.
rename_raw_columns(drop_raw_columns(raw)).columns
rename_raw_columns(drop_raw_columns(raw)).dtypes
For some reason, game minutes (‘MIN’) and turnovers (‘TOV’) are floating-point, not integers. Let’s keep things simple and round game minutes to the nearest minute. Also, let’s make sure the game date is a datetime
type.
def type_raw_columns(df):
df['date'] = pd.to_datetime(df['date'])
df['min'] = df['min'].round().astype(int) # round minutes
df['tov'] = df['tov'].astype(int)
return df
Lastly, let’s reorder the columns to make the table a little easier to read.
def reorder_raw_columns(df):
first_cols = [
'season',
'date',
'team',
'matchup',
'wl',
'pts',
'min',
]
last_cols = [
'game_id',
'team_id',
'team_name',
]
cols = (
first_cols +
[col for col in df.columns if col not in (first_cols+last_cols)] +
last_cols
)
return df[cols]
We can put all these steps in one function to clean up the raw data.
def formatted_columns(df):
"""Formatted stats.nba.com teamgamelogs DataFrame."""
# Order must be drop -> rename -> others don't matter
df = drop_raw_columns(df)
df = rename_raw_columns(df)
df = type_raw_columns(df)
df = reorder_raw_columns(df)
return df
df = formatted_columns(raw)
df.head()
Figuring out the Matchups
Now we’re getting somewhere. What we need to do now is figure out how to represent the matchup information in a more useful way. It’s pretty clear that the ‘@’ sign means an away game, while ‘vs.’ is a home game. We can use pandas
to split apart the matchup information into a more useful representation.
Let’s try out a few things.
df['matchup'].str.contains('@').head()
df['matchup'].str.split(' ').head()
df['matchup'].str.split(' ').str.get(-1).head()
We can put this together into a new function which will remove the ‘matchup’ column and replace it with two new, more useful columns. The first is whether the game is home (‘H’) or away (‘A’). The second is the opponent abbreviation.
def parse_matchup(df):
"""Add more useful columns based upon matchup information."""
df['ha'] = np.where(df['matchup'].str.contains('@'), 'A', 'H')
df['opp'] = df['matchup'].str.split(' ').str.get(-1)
# Put new columns where matchup used to be, and drop matchup
cols = []
for col in df.columns:
if col not in ['matchup', 'ha', 'opp']:
cols.append(col)
elif col == 'matchup':
cols.append('ha')
cols.append('opp')
return df[cols]
parse_matchup(df).head()
More Data Cleanup
Now things are looking good. Let’s put all this formatting into a new function, which will return the formatted DataFrame
. We will also use the ‘game_id’ as the index for the DataFrame
. Notice that the ‘game_id’ is an integer. This is an internal identifier that the web site database is using to uniquely identify each NBA game. We should keep track of this identifier, since we might want to go back to get more data on this game in the future. For instance, we might want to get shot charts or possession-by-possession information. We would need that ‘game_id’ in order to scrape the more detailed game data.
def formatted_teamgamelogs(df):
"""Formatted stats.nba.com teamgamelogs DataFrame from raw DataFrame."""
df = formatted_columns(df)
df = parse_matchup(df)
return df.set_index('game_id')
games = formatted_teamgamelogs(raw)
games.head()
Now let’s see what happens if we select all rows in the DataFrame
with that first ‘game_id’.
games.loc[21601224]
Interesting. Each game has two rows in the DataFrame
, one for each team. One of the teams is home, and the other is away. One of the teams won, and the other lost. Each team has their box score recorded on its own row.
This explains why we could drop the columns ‘BLKA’ and ‘PFD’ earlier. For example, the ‘BLKA’ from the Rockets’ row is the same as the ‘BLK’ column from the Timberwolves’ row. The ‘BLKA’ and ‘PFD’ information is redundant.
Let’s look at how manu unique ‘game_id’ values there are. You can think of these as unique matchups.
len(games.index.unique())
There are half as many matchups as there are team games played. That’s because there are two teams per matchup.
home_games = games[games['ha'] == 'H']
home_games.shape
away_games = games[games['ha'] == 'A']
away_games.shape
Creating a New Table of Matchups
What we really want for analysis is a table of matchups. We want to predict matchups, so that is the right unit of data. We can build a new table of matchups, where each matchup is identified by the ‘game_id’. We can also keep track of which team was the home eam, and which was the away team. We can also put both teams’ box scores on the same row, as long as we keep track of which stats belong to the home team and which to the away team.
home_games.join(away_games, lsuffix='_h', rsuffix='_a').head()
This new DataFrame
is the right next step, but we need to clean it up a bit. First, we don’t need two win/loss columns. We only need one, telling us whether the home team won or the away team won.
def home_away_win(row):
home = row['wl_h']
away = row['wl_a']
assert home in ['W', 'L']
assert away in ['W', 'L']
assert home != away
if home == 'W':
return 'H'
else:
return 'A'
We can drop a bunch of redundant columns and rename the ones that apply to the matchup itself, rather than particular teams.
def drop_matchup_columns(df):
return df.drop([
'wl_h',
'wl_a',
'season_a',
'date_a',
'ha_h',
'opp_h',
'ha_a',
'opp_a',
], axis='columns')
def rename_matchup_columns(df):
return df.rename(columns={
'season_h': 'season',
'date_h': 'date',
})
def reorder_matchup_columns(df):
"""Column order for game statistics DataFrame, alternating home and away stats."""
first_cols = ['season', 'date', 'won',]
raw_cols = [col.replace('_h', '') for col in df.columns if col.endswith('_h')]
cols = list(chain.from_iterable((col+'_h', col+'_a') for col in raw_cols))
return df[first_cols + cols]
Putting everything together in one place, here is a function which will create the matchups and reformat the resulting DataFrame
. It will also add a column storing which team (home or away) won the matchup.
def matchups_from_teamgamelogs(df):
"""DataFrame with one unique game_id per row and team stats identified by home and away teams."""
home_games = df[df['ha'] == 'H']
away_games = df[df['ha'] == 'A']
matchups = home_games.join(away_games, lsuffix='_h', rsuffix='_a')
# Add new 'won' column: 'H' if home time wins or 'A' if away team wins
matchups['won'] = matchups.apply(home_away_win, axis='columns')
matchups = drop_matchup_columns(matchups)
matchups = rename_matchup_columns(matchups)
matchups = reorder_matchup_columns(matchups)
return matchups
matchups_from_teamgamelogs(games).head()
One last step is to set some categorical data types and reorder the columns. Categories make a pandas
DataFrame
faster and more memory-efficient for columns that only take on a few possible values. For instance, the ‘won’ column can only be ‘W’ or ‘L’. The function below sets certain columns to be categorical data. It also makes sure the box score stats are integers.
def format_matchups(df):
df['season'] = df['season'].astype('category')
df['season_type'] = df['season_type'].astype('category')
df['won'] = df['won'].astype('category')
df['team_h'] = df['team_h'].astype('category')
df['team_a'] = df['team_a'].astype('category')
# Set all 'object' columns to int (except for team names)
for col in df.columns:
if 'team' not in col and (col.endswith('_h') or col.endswith('_a')):
if df[col].dtype == 'object':
df[col] = df[col].astype(int)
first_cols = ['season', 'season_type', 'date', 'won']
cols = first_cols + [col for col in df.columns if col not in first_cols]
return df[cols]
Processing an Entire NBA Season
Now we are almost done. We want a function which can scrape (or read from a file) the raw data for a given season, for all possible season types (regular, playoffs or pre-season). This function should then process the raw data and create the matchups for that season. The rows for each season type should be stored so we can tell the regular season games from the other types later.
def nba_season_matchups(season=None, data_dir=None, overwrite=False):
"""All NBA matchups for a given season (regular, playoffs and pre-season)."""
matchups = None
for season_type in NBA_SEASON_TYPES:
df = raw_teamgamelogs(
season,
NBA_SEASON_TYPES[season_type],
data_dir,
overwrite,
sleep_for=2,
)
if len(df) == 0:
# no rows came back; this is probably a pre-season with no data, so just continue
continue
df = formatted_teamgamelogs(df)
df = matchups_from_teamgamelogs(df)
df['season_type'] = season_type
if matchups is None:
matchups = df.copy()
else:
matchups = matchups.append(df)
return format_matchups(matchups)
Now we can run our master function on the entire 2016-17 season.
matchups = nba_season_matchups(season, data_dir=DATA_DIR)
matchups['season_type'].value_counts()
As we already new, there are 1230 regular season matchups. We also see we got the pre-season and playoff matchups as well.
matchups.head()
Getting All Completed NBA Seasons
The next step is to get all the completed seasons since 1996-97 (the earliest season for which the web site has advanced stats). To do this, all we need to do is call our nba_season_matchups()
function above for each completed season. Let’s write a few small funcitions to help us figure out what is the latest completed NBA season.
def nba_season_from_year(y):
"""Convert integer year into stats.nba.com season string."""
return str(y) + '-' + str(int(y)+1)[2:]
def nba_year_from_season(s):
"""Convert stats.nba.com season string into integer year"""
return int(s[:4])
def current_nba_season_year():
"""Year in which current NBA season began."""
now = datetime.now()
if now.month > 8:
return now.year
else:
return now.year - 1
def current_nba_season():
"""Current NBA season."""
return nba_season_from_year(current_nba_season_year())
def completed_nba_season_year():
"""Year in which most recent completed NBA season began."""
return current_nba_season_year() - 1
def completed_nba_season():
"""Most recent completed NBA season."""
return nba_season_from_year(completed_nba_season_year())
completed_nba_season()
Now it’s easy to just call our season-by-season scraper and collect all the results in one large DataFrame
. Notice that we are using the tqdm
package here. This package creates a “progress bar” to show you what percentage of an iteration has been accomplished, along with some timing information. This is helpful when scraping, since scraping can be slow.
def completed_nba_season_matchups(data_dir=None, overwrite=False):
"""Get NBA matchups for completed seasons."""
end_year = completed_nba_season_year()
seasons = [nba_season_from_year(y) for y in range(1996, end_year+1)]
matchups = None
for season in tqdm(seasons):
df = nba_season_matchups(season, data_dir, overwrite)
if matchups is None:
matchups = df.copy()
else:
matchups = matchups.append(df)
return format_matchups(matchups)
matchups = completed_nba_season_matchups(data_dir=DATA_DIR)
matchups.shape
Sanity-Checking the Data
We have almost 27,000 matchups in our DataFrame
. Let’s break things out by season and type of game to make sure they make sense.
matchups.pivot_table(
index='season',
columns='season_type',
values='team_h',
aggfunc='count',
fill_value=0,
margins=True,
)
We only have pre-season games since the 2014-15 season.
Looking at the season totals, you can see the effect of the lockouts in 1998-99 and 2011-12. You can also see the impact of the league expansion from 29 to 30 teams in 2004 (with the addition of Charlotte). See the history of NBA seasons here.
You might think we are done, but there is one complication that we need to work through. This shows the importance of looking carefully at your data. It’s hard to go through almost 27,000 rows by eye, so you need to think carefully and make sure things make sense.
Checking the Teams
One thing we can and should check is: what teams are represented in our data?
def unique_home_away_teams(df, ha):
"""Unique home or away teams (as specified) from matchups DataFrame."""
ha = ha.lower()
if ha not in ['h', 'a']:
raise ValueError('invalid home or away symbol')
team_abbr = 'team_' + ha
team_id = 'team_id_' + ha
team_name = 'team_name_' + ha
return (
df[[team_abbr, team_id, team_name]]
.reset_index()
.drop(columns=['game_id'])
.drop_duplicates()
.sort_values(team_id)
.rename(
columns={
team_abbr: 'team',
team_id: 'team_id',
team_name: 'team_name',
}
)
.set_index('team_id')
)
def unique_teams(df):
"""Unique teams in matchups DataFrame."""
teams = unique_home_away_teams(df, 'h')
teams.append(unique_home_away_teams(df, 'a'))
return teams.drop_duplicates().sort_index()
unique_teams(matchups)
Whoa, right away we see that we have some non-NBA teams represented, from pre-season exhibition-type games. We will exclude those in a moment, since we only want to have NBA matchups in our analysis.
That’s the easy part. The harder part is that team names and abbreviations have changed over time. Let’s see what’s going on in more detail.
First, there are some simple cases that aren’t too difficult to understand. The Clippers changed their name in 2015 from the Los Angeles Clippers to the LA Clippers. The abbreviation and ‘team_id’ didn’t change, however. No big deal. There’s a similar situation with Washington.
For a team move like the Thunder, you have to keep track of the abbreviation change. Although the ‘team_id’ stays the same, the abbrevation changes from ‘SEA’ to ‘OKC’ reflecting the team move from Seattle to Oklahoma City in 2008. There’s a similar situation with the Grizzlies and the Nets.
The most complicated situation is with Charlotte and New Orleans.
Hornets and Pelicans
As descibed in detail here, the Charlotte Hornets moved to New Orleans in 2002. The Charlotte Bobcats joined the NBA as an expansion team in 2004. Prior to the 2014-15 season, the Charlotte Bobcats and the New Orleans Hornets agreed to a deal whereby New Orleans would become the Pelicans and the Hornets name and history would revert to Charlotte. This explains why the abbreviation ‘CHA’ is used for both the Charlotte Bobcats and the Charlotte Hornets, in addition to the ‘CHH’ abbreviation for the other Charlotte Hornets (the team that moved in 2002 and ultimately became the Pelicans).
Another complication is that because of Hurricane Katrina, the New Orleans Hornets had to relocate for much of the 2005-6 and 2006-7 seasons to Oklahoma City. That’s why there is an abbreviation ‘NOK’ to represent the games that were played under the label New Orleans/Oklahoma City Hornets.
Let’s sort this out and make sure we understand this in detail.
First, let build a table with the current 30 NBA teams, so we can keep track of historical abbreviations and how they map to current teams. Remember that the ‘team_id’ is meant to be a unique identifier for each team, even if the abbreviation or team name changes.
def current_teams(df):
"""Team abbreviations for the most recent completed NBA season."""
current_teams = (
df.loc[
(df['season'] == completed_nba_season()) & (df['season_type'] == 'regular'),
['team_h', 'team_id_h']
]
.drop_duplicates()
.sort_values(['team_id_h'])
.rename(
columns={
'team_h': 'curr_team',
'team_id_h': 'team_id',
}
)
.set_index('team_id')
)
return current_teams
len(current_teams(matchups))
teams = unique_teams(matchups).merge(current_teams(matchups), left_index=True, right_index=True)
Let’s build a table of any temas that have had a name change or an abbreviation change.
team_changes = teams[['team', 'curr_team']].groupby('curr_team').filter(lambda team: len(team) > 1)
team_changes
Now, let’s look at how many games, per season, were played by various Charlotte and New Orleans franchises.
regular_season_summary = (
matchups.loc[
matchups['season_type'] == 'regular',
['season', 'team_h', 'team_a']
].pivot_table(
index='season',
columns='team_h',
values='team_a',
aggfunc='count',
fill_value=0,
margins=True,
)
)
regular_season_summary.loc[:, ['CHA', 'CHH', 'NOH', 'NOK', 'NOP']]
This table makes it clear that ‘CHH’ is the original Charlotte Hornets, who became the New Orleans Hornets (and the New Orleans/Oklahoma City Hornets), before becoming the New Orleans Pelicans (‘NOP’). The abbreviation ‘CHA’ refers to the Bobcats, who then became the new Charlotte Hornets in the 2014-15 season.
“Fixing” the Team Data
What might surprise you about this, is that the NBA team data are wrong!
If you look at the team table above, you’ll see that ‘CHH’ has the same ‘team_id’ as ‘CHA’. Now, you may think this is correct, since the Bobcats and the New Orleans Hornets did a deal returning the history of the original Charlotte Hornets (‘CHH’) to Charlotte. That’s why ‘CHH’ and ‘CHA’ have the same ‘team_id’.
On the other hand, if you goal is to track the performance of group of players, as their franchise changed cities, this is incorrect. The guys who played on the ‘CHH’ squad in the 2001-2 season should be connected with the ‘NOH’ squad in the 2002-3 season.
The NBA team data as recorded don’t reflect that.
To keep track of this, we will do something that you should never do lightly. We are going to override our team data table to make sure the ‘CHH’ games are grouped with the current Pelicans franchise. We are effectively going to undo the Charlotte/New Orleans deal to transfer the Hornets history.
teams.loc[teams['team'] =='CHH', 'curr_team'] = 'NOP'
teams.loc[teams['team'].isin(['CHA', 'CHH', 'NOH', 'NOK', 'NOP']), :]
Of course, we’re not overriding the raw data. But it’s important to keep track anytime you change data.
Let’s write a function to record what we’re doing.
def team_info(df):
"""DataFrame of NBA team IDs, unadjusted abbreviations, names and adjusted abbrevations."""
teams = unique_teams(df).merge(current_teams(df), left_index=True, right_index=True)
teams.loc[teams['team'] =='CHH', 'curr_team'] = 'NOP' # OVERRIDING Hornets/Pelicans
return teams
Lastly, we will write a function to exclude non-NBA matchups. Again, these were pre-season exhibition-style games. To exclude those games all we need to do is filter out any teams whose ‘team_id’ isn’t one of the current 30 NBA teams.
def exclude_non_nba_matchups(df):
"""Filter out matchups involving non-NBA teams."""
nba_teams = current_teams(df)
non_nba_matchups = list(
df.loc[~df['team_id_h'].isin(nba_teams.index)].index
)
non_nba_matchups.extend(list(
df.loc[~df['team_id_a'].isin(nba_teams.index)].index)
)
return df.loc[~df.index.isin(non_nba_matchups), :]
Let’s test this to see how many games get filtered out.
matchups.shape
exclude_non_nba_matchups(matchups).shape
The function filtered out 25 games.
Let’s put all these team cleanup steps together. First, we will write a function to give us a mapping between the historical team abbreviation, and the “correct” current team abbreviation. In this case, “correct” means adjusted for the Hornets/Pelicans issue.
def team_mapping(df):
"""Dict mapping historical NBA team abbreviation to abbreviation of current franchise."""
teams = team_info(df) # this creates the lookup table with Hornets/Pelicans issue corrected
list_of = teams.to_dict(orient='list')
return dict(zip(list_of['team'], list_of['curr_team']))
def update_matchups(df):
"""Prepare matchups for use in analytics."""
nba_teams = current_teams(df)
df = exclude_non_nba_matchups(df) # filters any matchups that include non-NBA teams
abbr_to_current = team_mapping(df)
home_map = df['team_h'].map(abbr_to_current).rename('team_curr_h')
away_map = df['team_a'].map(abbr_to_current).rename('team_curr_a')
# Add new columns for current team abbreviations
df = pd.concat([df.copy(), home_map, away_map], axis='columns').reset_index()
df['team_curr_h'] = df['team_curr_h'].astype('category')
df['team_curr_a'] = df['team_curr_a'].astype('category')
return df.set_index('game_id')
Notice that the above function doesn’t actually replace the team abbreviations in our matchups table. Rather, it creates two new columns for the current teams (home and away). That way, we can see what the original information was and make sure it maps correctly in our analysis. This is safer than just overwriting the old team abbreviations.
We will use the current team abbreviations in building our matchup prediction models. This will allow us to track changes in team quality over time without getting confused by name changes or franchise moves.
One last step is to write a function that will save our final, processed DataFrame
for future analysis. This DataFrame
is relatively large, and we’ve put some effort into formatting it. So, instead of using CSV, we’ll use another built-in format that pandas
can write for us: Python’s pickle
format. The pickle
format is an efficient, compressed format that will store all the information in our DataFrame
, including formatting.
As noted in the pickle
documentation, you should be very careful about unpickling an object that you obtained from the Internet or via email. Only unpickle objects that you have previously pickled yourself or you obtained from somebody you really, really trust. By the way, it turns out that hackers have even figured out ways to inject malware into CSV files, so the general and best advice is: always be very careful opening any type of file that you’ve downloaded or received by email!
def save_matchups(df, output_dir):
"""Save pickle file of NBA matchups prepared for analytics."""
seasons = list(df['season'].unique())
start_season = min(seasons).replace('-', '_')
end_season = max(seasons).replace('-', '_')
PKL_TEMPLATE = 'stats_nba_com-matchups-{start_season}-{end_season}'
pklfilename = (
PKL_TEMPLATE.format(start_season=start_season, end_season=end_season) +
'.pkl'
)
pklfile = output_dir.joinpath(pklfilename)
if pklfile.exists():
timestamp = str(datetime.now().strftime("%Y-%m-%d-%H-%M"))
backupfilename = (
PKL_TEMPLATE.format(start_season=start_season, end_season=end_season) +
'.bak.{timestamp}'.format(timestamp=timestamp) +
'.pkl'
)
backupfile = output_dir.joinpath(backupfilename)
pklfile.rename(backupfile)
df.to_pickle(pklfile)
Finishing Up
Now, we can finally put everything together into one master function which reads all the matchups for since 1996 and saves it to a pickle
file, ready for later analysis.
def nba_stats_matchups(data_dir=None, output_dir=None, overwrite=False):
"""Get and prepare stats.nba.com matchups for all completed seasons since 1996."""
df = completed_nba_season_matchups(data_dir=data_dir, overwrite=overwrite)
df = update_matchups(df)
if output_dir:
save_matchups(df, output_dir)
return df
<
div class=”cell border-box-sizing code_cell rendered”>
matchups = nba_stats_matchups(data_dir=DATA_DIR, output_dir=OUTPUT_DIR, overwrite=False)
To conclude the data gathering, here’s a summary of all the information we’ve obtained.
matchups.info()
We’ve accomplished a lot. Now that we have data, we’re ready to move on to the matchup predictions in future posts. To conclude, let’s look at one important fact we can immediately learn from the data.
A Quick Look at Home Court Advantage in the NBA
Let’s calculate what home court has been worth in terms of regular season win percentage between the 1996-97 and 2016-17 seasons.
matchups.loc[matchups['season_type'] == 'regular', 'won'].value_counts()
def count_home_away_wins(df, ha, season_type='regular'):
return df.loc[df['season_type'] == season_type, 'won'].value_counts()[ha]
home_wins = count_home_away_wins(matchups, 'H')
away_wins = count_home_away_wins(matchups, 'A')
home_wins/(home_wins+away_wins)
away_wins/(home_wins+away_wins)
The home team has won about 60% of the time on average during the NBA regular season. Any prediction model is going to have to address this fact. We’ll look a lot more closely at this topic in future posts.