We are going to use machine learning and statistics to predict NBA matchups. To do this, we are going to need data on NBA games, and lots of it. So let’s get all the team matchups and box scores from stats.nba.com, and make them ready for use.

This post has two purposes. The first is to show you how to do the actual web scraping. The second purpose is to show you how to examine data before you us it. Data are almost always a bit messy and need to be handled with care. It’s important to take some time to look at data and to make sure it’s clean before use.

This post has a lot of code. If you don’t want to spend time going through it, you can just run the code if you want. The code in this post will scrape and store the data for you. You won’t need to run this code again after you get the data.

A Jupyter Notebook with the code and portions of the text from this post can be found here.

I think the most useful information in this post is learning how to assess the data quality, and make smart decisions about how to address potential problems. If you decide to just skim the code and focus on the decisions, hopefully this example will still be useful for you. You may also find it interesting to see how real-world NBA facts (such as expansion teams, team moves and name changes) actually impact the data!

The Data

We are going to scrape Team Box Scores for completed NBA seasons from stats.nba.com. We are going to start with tradtional box scores, which give us plenty of statistics to consider, as well as all the matchup information we need. If you explore the site, you’ll see there are also other more advanced data available. We’ll consider these other data in future posts.

The advanced data are only available going back to the 1996-97 season, so we’ll limit our traditional box scores to the same time period. The site actually has traditional box scores going back to the 1946-47 season. Such old data are interesting, but aren’t useful for predicting current matchups. We have to consider how much the game has changed over the years. Although there have been a number of NBA rule changes since 1996 and game style has evolved, I think data over this time period are still useful today.

The data from 1996-97 to 2016-17 comprise 21 complete seasons, including pre-season data from recent years. This is plenty of data for our purposes. Notice that we’re not going to download data from the current 2017-18 season yet. We are going to use the historical data for building prediction models, since we know how the seasons turned out. After we build the model, we can apply it to the current season and see how it does over the remainder of the season.

Why Write a Web Scraper

As I mentioned, there’s a lot of code in this post. You may wonder why it’s worth it to build a web scraper.

First, in many cases it’s the easiest way to get the data you need. Even if it takes a few hours to build a scraper, it’s a lot faster than manually downloading it. Consider the excellent site, Basketball Reference. Basketball Reference offers you the ability to download games into a CSV file. Here is a link to the 2016-17 season games.

Notice that this site shows games grouped by month. You can download the games by looking at the “Share & More” drop-down menu next to the month schedule header. Each month requires a separate download. So, if you want to download all the games for the 21 complete seasons since 1996-97, you’d need to manually download around 180 CSV files (there are 9 months in the typical NBA season). Then, to create a master spreadsheet of matchup results, you’d need to open them up in Microsoft Excel or Google Sheets and put all the data together in one place.

Also, if you click on one of the box score links (for example, the October 25, 2016 matchup between the Knicks and the Cavaliers), you’ll see very detailed box score information, but no easy way to extract and download all the box scores at once. Scraping the original data is much faster and more practical. As you’ll see, it’s not actually that hard.

Lastly, writing a scraper is essential if you want to update the data relatively frequently. If you create a matchup prediction model, you’ll want to run it after every game, to update the probabilities for future matchups. Imagine if you had to type information into a spreadsheet or download a few new spreadsheets every night during the NBA season?

The web scraper we’ll develop here will also be able to scrape current season games. So once you have it working, you can run each day during the NBA season to get the lastest matchup results and team box scores automatically.

Scraping a JavaScript-Rendered Web Site

stats.nba.com is a JavaScript-rendered site. As discussed in our previous post on web-scraping, these sites work differently from HTML-rendered sites.

We can still use the Requests package to communicate with the web site, but we can’t use BeautifulSoup like we did previously. BeautifulSoup is great for scraping HTML-rendered sites, but it doesn’t work with JavaScript-rendered sites.

Scraping the box scores isn’t actually very hard, but it has one very tricky aspect. The URL to send the GET request isn’t the same URL that you see in your browser. Remember that the whole point of JavaScript-rendered pages is the web site returns JavaScript code to your browser, which then runs the JavaScript to get the data to render in the browser window. The URL you want to scrape is really that second URL, which returns the data.

Let’s look at the specific links for the box scores. The URL for the link to the page we can view is:

https://stats.nba.com/teams/boxscores-traditional/?Season=2016-17&SeasonType=Regular%20Season

It’s clear from the URL that this page has traditional box scores for the 2016-17 regular season. It’s pretty clear that we can access different seasons by changing the parameters at the end of the URL.

Now here’s the trick. The URL which returns the actual data is:

https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00
&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0
&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season
&ShotClockRange=&VsConference=&VsDivision=

That’s all actually one line, which I’ve broken up for clarity. You can open up the a browser tab with the raw data by clicking this link. Notice that this URL has a lot of parameters.

So, how did I figure out that this is the correct URL?

Use the Browser Developer Tools

As mentioned in the prior post on web scraping, your browser’s developer tools make it easier to see what is going on. In that post, we used the Elements section of the developer tools to look at the different HTML tags. To scrape this JavaScript-rendered page, we will instead use the Network section. Here are instructions for how to use these tools in Chrome. Here are similar instructions for Firefox. I think these browsers are the easiest to use to understand JavaScript-rendered sites.

Under the Network section, check out the XHR tab. XHR is short for XMLHttpRequest, but the details don’t matter. The point is that if you load (or reload) the page (with the “normal” URL above), you will see a few lines show up under the XHR tab, one of which starts with teamgamelogs. That’s the item we want.

If you select that teamgamelogs item, in either Chrome or Firefox, you can explore what’s in that object. I’ll move on the scraping now, but hopefully this brief introduction will serve to explain how you could figure this out for yourself on a different web page. As we go through the steps below, see if you can follow along and find the corresponding information in your browser’s developer tools on that XHR tab.

If you explore other pages on stats.nba.com with the developer tools open to the XHR tab, you’ll be able to apply these techniques to figure out how to scrape those other pages, too.

Scraping the Site

Let’s do all the necessary imports.

from itertools import chain
from pathlib import Path
from time import sleep
from datetime import datetime

We are going to use Requests to scrape the JavaScript.

import requests
from tqdm import tqdm
tqdm.monitor_interval = 0

We will use pandas and numpy to explore and organize the data toward the end.

import numpy as np
import pandas as pd
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

We will store the data (both in raw from and processed for later analysis) on disk. The code below will create directories on your computer if they don’t exist already. Feel free to change the directories below if you’re running this code on your computer.

PROJECT_DIR = Path.cwd().parent 
DATA_DIR = PROJECT_DIR / 'data' / 'scraped' / 'teamgamelogs'
DATA_DIR.mkdir(exist_ok=True, parents=True)
OUTPUT_DIR = PROJECT_DIR / 'data' / 'prepared'
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

USER_AGENT = (
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) ' +
    'AppleWebKit/537.36 (KHTML, like Gecko) ' +
    'Chrome/61.0.3163.100 Safari/537.36'
)

REQUEST_HEADERS = {
    'user-agent': USER_AGENT,
}

Below is the base URL for the team box scores. This type of URL is also called a web API endpoint. Recall that an API is an application programming interface, which makes it easier to communicate with a web site and obtain information.

The NBA data defines the NBA itself as league ’00’.

NBA_URL = 'http://stats.nba.com/stats/teamgamelogs'
NBA_ID = '00'

The site can give us information for the regular season, playoffs and pre-season.

NBA_SEASON_TYPES = {
    'regular': 'Regular Season',
    'playoffs': 'Playoffs',
    'preseason': 'Pre Season',
}

To explore, we will start with the 2016-17 season, the most recent completed season. We will end up scraping all completed seasons since 1996-97.

season = '2016-17'
season_type = NBA_SEASON_TYPES['regular']

nba_params = {
    'LeagueID': NBA_ID,
    'Season': season,
    'SeasonType': season_type,
}

r = requests.get(NBA_URL, params=nba_params, headers=REQUEST_HEADERS, allow_redirects=False, timeout=15)
r.status_code

Figuring out the JSON

The GET request worked. Let’s look at what type of data was returned.

r.headers['content-type']

'application/json; charset=utf-8'

The data are in JavaScript Object Notation, or JSON. Remember, if you want to look at the raw data in JSON format, you can open up the a browser tab pointed to the web API endpoint by clicking this link, or you can explore with your browser’s development tools.

Requests makes it easy to process JSON.

json = r.json()
type(json)

dict

json.keys()

dict_keys(['resource', 'parameters', 'resultSets'])

You can see which parameters are available, and how they were filled in by the GET request.

json['parameters']

{'DateFrom': None,
 'DateTo': None,
 'GameSegment': None,
 'LastNGames': None,
 'LeagueID': '00',
 'Location': None,
 'MeasureType': None,
 'Month': None,
 'OppTeamID': None,
 'Outcome': None,
 'PORound': None,
 'PerMode': None,
 'Period': None,
 'PlayerID': None,
 'SeasonSegment': None,
 'SeasonType': 'Regular Season',
 'SeasonYear': '2016-17',
 'ShotClockRange': None,
 'TeamID': None,
 'VsConference': None,
 'VsDivision': None}

Now let’s get the results.

type(json['resultSets'])

list

len(json['resultSets'])

results = json['resultSets'][0]
type(results)

dict

results.keys()

dict_keys(['name', 'headers', 'rowSet'])

headers = results['headers']
type(headers)

list

len(headers)

There are 56 columns of data per row.

rows = results['rowSet']
type(rows)

list

len(rows)

There are 30 NBA teams, and each team played 82 games in the 2016-17 season. Since 82 times 30 is 2460, there had better be 2460 rows in the results.

type(rows[0])

list

Each row is itself a list. Let’s see how many columns are in the row.

len(rows[0])

As expected, there are 56 columns per row, same as we saw for the column headers. Let’s look at the first row in the results.

print(rows[0])

['2016-17', 1610612745, 'HOU', 'Houston Rockets', '0021601224', '2017-04-12T00:00:00', 'HOU vs. MIN', 'W', 48.0, 45, 97, 0.464, 19, 56, 0.339, 14, 22, 0.636, 16, 31, 47, 37, 10.0, 11, 7, 3, 14, 15, 123, 5.0, 1, 1, 1, 1, 141, 244, 113, 1112, 19, 3, 1400, 1704, 1236, 2216, 128, 1551, 625, 10, 280, 235, 324, 440, 159, 2104, 164, 898]

This is the Houston Rockets playing against the Minnesota Timberwolves in April 12, 2017. The Rockets won. To make sense of all the numbers, let’s put things into a more useful framework.

Scraping and Saving the Raw Data

First, let’s put what we’ve learned about scraping the JSON into a function. We are going to store our data in a pandas DataFrame to make analysis easier.

def scrape_teamgamelogs(season, season_type, sleep_for=None):
    """Process JSON from stats.nba.com teamgamelogs endpoint and return unformatted DataFrame."""
    if sleep_for:
        sleep(sleep_for) # be nice to server by sleeping if we are scraping inside a loop
    nba_params = {
        'LeagueID': NBA_ID,
        'Season': season,
        'SeasonType': season_type,
    }
    r = requests.get(
        NBA_URL,
        params=nba_params,
        headers=REQUEST_HEADERS,
        allow_redirects=False,
        timeout=15,
    )
    r.raise_for_status()
    results = r.json()['resultSets'][0]
    headers = results['headers']
    rows = results['rowSet']
    return pd.DataFrame(rows, columns=headers)

It’s import when scraping data from the web to save it to your computer, especially if it’s data that won’t change in the future. There’s no point scraping data again and again if it won’t change. In this case, we are looking at NBA games from prior seasons, so once we scrape these games we are done.

Let’s call our scraping function from inside another function. This outer function will first check if a file already exists on our computer that has the data. If the file exists, we don’t need to scrape the data; we can just load it from the file into a DataFrame.

On the other hand, if we scrape the data, our function needs to save them to a file so we don’t need to scrape the web site again. Also, our function has a parameter to overwrite the files on disk with freshly scraped data, in case we need or want to do that.

It’s important when scraping data (or collecting data of any kind) to save the raw results first. We are going to be doing a lot of processing to get the data into a useful format. We are going to end up saving those results too. However, if you have an error in your code and don’t save the raw data first, you might not be able to figure out later where you went wrong. That’s why it’s good practice to save the raw data, so you can always start over and reconstruct your results.

def raw_teamgamelogs(season, season_type, data_dir=None, overwrite=False, sleep_for=None):
    """Scrape stats.nba.com teamgamelogs or read from a CSV file if it exists."""
    if data_dir:
        csv_filename = 'stats_nba_com-teamgamelogs-{season}-{season_type}.csv'.format(
            season=season.replace('-', '_'),
            season_type=season_type.replace(' ', '_').lower(),
        )
        csvfile = data_dir.joinpath(csv_filename)
    else:
        csvfile = None
    if csvfile and not overwrite and csvfile.exists():
        df = pd.read_csv(csvfile)
    else:
        df = scrape_teamgamelogs(season, season_type, sleep_for)
        if csvfile:
            df.to_csv(csvfile, index=False)
    return df

raw = raw_teamgamelogs(season, season_type, data_dir=DATA_DIR)
raw.shape

(2460, 56)

Remember that this function has also saved the raw data for us behind the scenes. If we call it again for this season and season type, it will read it from the file, rather than scrape it from the web site.

raw.head()

	SEASON_YEAR	TEAM_ID	TEAM_ABBREVIATION	TEAM_NAME	GAME_ID	GAME_DATE	MATCHUP	WL	MIN	FGM	FGA	FG_PCT	FG3M	FG3A	FG3_PCT	FTM	FTA	FT_PCT	OREB	DREB	REB	AST	TOV	STL	BLK	BLKA	PF	PFD	PTS	PLUS_MINUS	GP_RANK	W_RANK	L_RANK	W_PCT_RANK	MIN_RANK	FGM_RANK	FGA_RANK	FG_PCT_RANK	FG3M_RANK	FG3A_RANK	FG3_PCT_RANK	FTM_RANK	FTA_RANK	FT_PCT_RANK	OREB_RANK	DREB_RANK	REB_RANK	AST_RANK	TOV_RANK	STL_RANK	BLK_RANK	BLKA_RANK	PF_RANK	PFD_RANK	PTS_RANK	PLUS_MINUS_RANK
0	2016-17	1610612745	HOU	Houston Rockets	21601224	2017-04-12T00:00:00	HOU vs. MIN	W	48.0	45	97	0.464	19	56	0.339	14	22	0.636	16	31	47	37	10.0	11	7	3	14	15	123	5.0	1	1	1	1	141	244	113	1112	19	3	1400	1704	1236	2216	128	1551	625	10	280	235	324	440	159	2104	164	898
1	2016-17	1610612737	ATL	Atlanta Hawks	21601226	2017-04-12T00:00:00	ATL @ IND	L	48.0	30	72	0.417	11	29	0.379	15	19	0.789	8	29	37	19	20.0	5	4	7	15	14	86	-18.0	1	1231	1231	1231	141	2375	2396	1886	687	800	956	1531	1657	1101	1592	1862	2040	1762	2250	1904	1213	1925	241	2221	2324	2220
2	2016-17	1610612749	MIL	Milwaukee Bucks	21601219	2017-04-12T00:00:00	MIL @ BOS	L	48.0	37	84	0.440	6	23	0.261	14	17	0.824	9	34	43	26	17.0	6	9	3	20	17	94	-18.0	1	1231	1231	1231	141	1497	1342	1526	2008	1700	2061	1704	1908	748	1342	1034	1203	533	1857	1594	109	440	1170	1757	2023	2220
3	2016-17	1610612760	OKC	Oklahoma City Thunder	21601225	2017-04-12T00:00:00	OKC vs. DEN	L	48.0	43	101	0.426	8	25	0.320	11	17	0.647	14	37	51	20	12.0	10	8	8	24	17	105	-6.0	1	1231	1231	1231	141	460	45	1757	1491	1391	1591	2097	1908	2190	307	533	263	1584	668	396	195	2138	2005	1757	1242	1564
4	2016-17	1610612746	LAC	LA Clippers	21601228	2017-04-12T00:00:00	LAC vs. SAC	W	48.0	43	83	0.518	10	28	0.357	19	25	0.760	11	32	43	29	6.0	4	6	1	15	18	115	20.0	1	1	1	1	141	460	1495	330	920	908	1184	899	870	1429	844	1375	1203	231	19	2142	537	40	241	1532	504	165

Cleaning up the Columns

Now we have the 2016-17 regular season games inside a pandas DataFrame. Let’s see what the columns are.

raw.columns

Index(['SEASON_YEAR', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M',
       'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST',
       'TOV', 'STL', 'BLK', 'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS',
       'GP_RANK', 'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK', 'FGM_RANK',
       'FGA_RANK', 'FG_PCT_RANK', 'FG3M_RANK', 'FG3A_RANK', 'FG3_PCT_RANK',
       'FTM_RANK', 'FTA_RANK', 'FT_PCT_RANK', 'OREB_RANK', 'DREB_RANK',
       'REB_RANK', 'AST_RANK', 'TOV_RANK', 'STL_RANK', 'BLK_RANK', 'BLKA_RANK',
       'PF_RANK', 'PFD_RANK', 'PTS_RANK', 'PLUS_MINUS_RANK'],
      dtype='object')

We don’t need a number of these columns. All of the ‘_RANK’ columns are designed to help the browser sort the table by different columns. The data in the ‘_RANK’ columns are just the sort orders for each row, if the user wants to sort the results by that particular column. (For example, the first row is the 244^th row by field goals made ‘FGM_RANK’).

We are going to throw away all these ‘_RANK’ columns, since we can sort in pandas by any column we wish anyway. We are also going to throw away the ‘_PCT’ columns, since we can compute any percentages we wish if we need them. For example, ‘FG_PCT’ = ‘FGM’ / ‘FGA’ for field goals, and similarly for three-point field goals (‘FG3’) and free throws (‘FT’).

We can through away a few other columsn too. The first is ‘PLUS_MINUS’, which is just margin of victory (i.e., winner points minus loser points). We can compute this later from the actual points. Next is ‘REB’ = ‘OREB’ + ‘DREB’. The last two are ‘BLKA’ (blocks allowed) and ‘PFD’ (personal fouls drawn). These are stats that apply to the opponent. As we will soon see, there are better ways to get this information.

Let’s write a function to exclude the columns we don’t need.

def drop_raw_columns(df):
    cols = [col for col in df.columns if '_RANK' not in col]
    cols = [col for col in cols if '_PCT' not in col]
    cols = [col for col in cols if col not in ['PLUS_MINUS', 'REB', 'BLKA', 'PFD']]
    return df[cols]

Next, let’s write a function to rename the columns to tidy things up.

def rename_raw_columns(df):
    df = df.rename(columns={
        'SEASON_YEAR': 'season',
        'TEAM_ABBREVIATION': 'team',
        'GAME_DATE': 'date',
    })
    df.columns = df.columns.str.lower()
    return df

If we apply these functions to the data, here’s what we’d get so far. Let’s look at the column names and the data types.

rename_raw_columns(drop_raw_columns(raw)).columns

Index(['season', 'team_id', 'team', 'team_name', 'game_id', 'date', 'matchup',
       'wl', 'min', 'fgm', 'fga', 'fg3m', 'fg3a', 'ftm', 'fta', 'oreb', 'dreb',
       'ast', 'tov', 'stl', 'blk', 'pf', 'pts'],
      dtype='object')

rename_raw_columns(drop_raw_columns(raw)).dtypes

season        object
team_id        int64
team          object
team_name     object
game_id        int64
date          object
matchup       object
wl            object
min          float64
fgm            int64
fga            int64
fg3m           int64
fg3a           int64
ftm            int64
fta            int64
oreb           int64
dreb           int64
ast            int64
tov          float64
stl            int64
blk            int64
pf             int64
pts            int64
dtype: object

For some reason, game minutes (‘MIN’) and turnovers (‘TOV’) are floating-point, not integers. Let’s keep things simple and round game minutes to the nearest minute. Also, let’s make sure the game date is a datetime type.

def type_raw_columns(df):
    df['date'] = pd.to_datetime(df['date'])
    df['min'] = df['min'].round().astype(int) # round minutes
    df['tov'] = df['tov'].astype(int)
    return df

Lastly, let’s reorder the columns to make the table a little easier to read.

def reorder_raw_columns(df):
    first_cols = [
        'season',
        'date',
        'team',
        'matchup',
        'wl',
        'pts',
        'min',
    ]
    last_cols = [
        'game_id',
        'team_id',
        'team_name',
    ]
    cols = (
        first_cols +
        [col for col in df.columns if col not in (first_cols+last_cols)] +
        last_cols
    )
    return df[cols]

We can put all these steps in one function to clean up the raw data.

def formatted_columns(df):
    """Formatted stats.nba.com teamgamelogs DataFrame."""
    # Order must be drop -> rename -> others don't matter
    df = drop_raw_columns(df)
    df = rename_raw_columns(df)
    df = type_raw_columns(df)
    df = reorder_raw_columns(df)
    return df

df = formatted_columns(raw)
df.head()

	season	date	team	matchup	wl	pts	min	fgm	fga	fg3m	fg3a	ftm	fta	oreb	dreb	ast	tov	stl	blk	pf	game_id	team_id	team_name
0	2016-17	2017-04-12	HOU	HOU vs. MIN	W	123	48	45	97	19	56	14	22	16	31	37	10	11	7	14	21601224	1610612745	Houston Rockets
1	2016-17	2017-04-12	ATL	ATL @ IND	L	86	48	30	72	11	29	15	19	8	29	19	20	5	4	15	21601226	1610612737	Atlanta Hawks
2	2016-17	2017-04-12	MIL	MIL @ BOS	L	94	48	37	84	6	23	14	17	9	34	26	17	6	9	20	21601219	1610612749	Milwaukee Bucks
3	2016-17	2017-04-12	OKC	OKC vs. DEN	L	105	48	43	101	8	25	11	17	14	37	20	12	10	8	24	21601225	1610612760	Oklahoma City Thunder
4	2016-17	2017-04-12	LAC	LAC vs. SAC	W	115	48	43	83	10	28	19	25	11	32	29	6	4	6	15	21601228	1610612746	LA Clippers

Figuring out the Matchups

Now we’re getting somewhere. What we need to do now is figure out how to represent the matchup information in a more useful way. It’s pretty clear that the ‘@’ sign means an away game, while ‘vs.’ is a home game. We can use pandas to split apart the matchup information into a more useful representation.

Let’s try out a few things.

df['matchup'].str.contains('@').head()

0    False
1     True
2     True
3    False
4    False
Name: matchup, dtype: bool

df['matchup'].str.split(' ').head()

0    [HOU, vs., MIN]
1      [ATL, @, IND]
2      [MIL, @, BOS]
3    [OKC, vs., DEN]
4    [LAC, vs., SAC]
Name: matchup, dtype: object

df['matchup'].str.split(' ').str.get(-1).head()

0    MIN
1    IND
2    BOS
3    DEN
4    SAC
Name: matchup, dtype: object

We can put this together into a new function which will remove the ‘matchup’ column and replace it with two new, more useful columns. The first is whether the game is home (‘H’) or away (‘A’). The second is the opponent abbreviation.

def parse_matchup(df):
    """Add more useful columns based upon matchup information."""
    df['ha'] = np.where(df['matchup'].str.contains('@'), 'A', 'H')
    df['opp'] = df['matchup'].str.split(' ').str.get(-1)
    # Put new columns where matchup used to be, and drop matchup
    cols = []
    for col in df.columns:
        if col not in ['matchup', 'ha', 'opp']:
            cols.append(col)
        elif col == 'matchup':
            cols.append('ha')
            cols.append('opp')
    return df[cols]

parse_matchup(df).head()

	season	date	team	ha	opp	wl	pts	min	fgm	fga	fg3m	fg3a	ftm	fta	oreb	dreb	ast	tov	stl	blk	pf	game_id	team_id	team_name
0	2016-17	2017-04-12	HOU	H	MIN	W	123	48	45	97	19	56	14	22	16	31	37	10	11	7	14	21601224	1610612745	Houston Rockets
1	2016-17	2017-04-12	ATL	A	IND	L	86	48	30	72	11	29	15	19	8	29	19	20	5	4	15	21601226	1610612737	Atlanta Hawks
2	2016-17	2017-04-12	MIL	A	BOS	L	94	48	37	84	6	23	14	17	9	34	26	17	6	9	20	21601219	1610612749	Milwaukee Bucks
3	2016-17	2017-04-12	OKC	H	DEN	L	105	48	43	101	8	25	11	17	14	37	20	12	10	8	24	21601225	1610612760	Oklahoma City Thunder
4	2016-17	2017-04-12	LAC	H	SAC	W	115	48	43	83	10	28	19	25	11	32	29	6	4	6	15	21601228	1610612746	LA Clippers

More Data Cleanup

Now things are looking good. Let’s put all this formatting into a new function, which will return the formatted DataFrame. We will also use the ‘game_id’ as the index for the DataFrame. Notice that the ‘game_id’ is an integer. This is an internal identifier that the web site database is using to uniquely identify each NBA game. We should keep track of this identifier, since we might want to go back to get more data on this game in the future. For instance, we might want to get shot charts or possession-by-possession information. We would need that ‘game_id’ in order to scrape the more detailed game data.

def formatted_teamgamelogs(df):
    """Formatted stats.nba.com teamgamelogs DataFrame from raw DataFrame."""
    df = formatted_columns(df)
    df = parse_matchup(df)
    return df.set_index('game_id')

games = formatted_teamgamelogs(raw)
games.head()

	season	date	team	ha	opp	wl	pts	min	fgm	fga	fg3m	fg3a	ftm	fta	oreb	dreb	ast	tov	stl	blk	pf	team_id	team_name
game_id
21601224	2016-17	2017-04-12	HOU	H	MIN	W	123	48	45	97	19	56	14	22	16	31	37	10	11	7	14	1610612745	Houston Rockets
21601226	2016-17	2017-04-12	ATL	A	IND	L	86	48	30	72	11	29	15	19	8	29	19	20	5	4	15	1610612737	Atlanta Hawks
21601219	2016-17	2017-04-12	MIL	A	BOS	L	94	48	37	84	6	23	14	17	9	34	26	17	6	9	20	1610612749	Milwaukee Bucks
21601225	2016-17	2017-04-12	OKC	H	DEN	L	105	48	43	101	8	25	11	17	14	37	20	12	10	8	24	1610612760	Oklahoma City Thunder
21601228	2016-17	2017-04-12	LAC	H	SAC	W	115	48	43	83	10	28	19	25	11	32	29	6	4	6	15	1610612746	LA Clippers

Now let’s see what happens if we select all rows in the DataFrame with that first ‘game_id’.

games.loc[21601224]

	season	date	team	ha	opp	wl	pts	min	fgm	fga	fg3m	fg3a	ftm	fta	oreb	dreb	ast	tov	stl	blk	pf	team_id	team_name
game_id
21601224	2016-17	2017-04-12	HOU	H	MIN	W	123	48	45	97	19	56	14	22	16	31	37	10	11	7	14	1610612745	Houston Rockets
21601224	2016-17	2017-04-12	MIN	A	HOU	L	118	48	48	91	11	23	11	15	10	39	37	15	7	3	15	1610612750	Minnesota Timberwolves

Interesting. Each game has two rows in the DataFrame, one for each team. One of the teams is home, and the other is away. One of the teams won, and the other lost. Each team has their box score recorded on its own row.

This explains why we could drop the columns ‘BLKA’ and ‘PFD’ earlier. For example, the ‘BLKA’ from the Rockets’ row is the same as the ‘BLK’ column from the Timberwolves’ row. The ‘BLKA’ and ‘PFD’ information is redundant.

Let’s look at how manu unique ‘game_id’ values there are. You can think of these as unique matchups.

len(games.index.unique())

There are half as many matchups as there are team games played. That’s because there are two teams per matchup.

home_games = games[games['ha'] == 'H']
home_games.shape

(1230, 23)

away_games = games[games['ha'] == 'A']
away_games.shape

(1230, 23)

Creating a New Table of Matchups

What we really want for analysis is a table of matchups. We want to predict matchups, so that is the right unit of data. We can build a new table of matchups, where each matchup is identified by the ‘game_id’. We can also keep track of which team was the home eam, and which was the away team. We can also put both teams’ box scores on the same row, as long as we keep track of which stats belong to the home team and which to the away team.

home_games.join(away_games, lsuffix='_h', rsuffix='_a').head()

	season_h	date_h	team_h	ha_h	opp_h	wl_h	pts_h	min_h	fgm_h	fga_h	fg3m_h	fg3a_h	ftm_h	fta_h	oreb_h	dreb_h	ast_h	tov_h	stl_h	blk_h	pf_h	team_id_h	team_name_h	season_a	date_a	team_a	ha_a	opp_a	wl_a	pts_a	min_a	fgm_a	fga_a	fg3m_a	fg3a_a	ftm_a	fta_a	oreb_a	dreb_a	ast_a	tov_a	stl_a	blk_a	pf_a	team_id_a	team_name_a
game_id
21601224	2016-17	2017-04-12	HOU	H	MIN	W	123	48	45	97	19	56	14	22	16	31	37	10	11	7	14	1610612745	Houston Rockets	2016-17	2017-04-12	MIN	A	HOU	L	118	48	48	91	11	23	11	15	10	39	37	15	7	3	15	1610612750	Minnesota Timberwolves
21601225	2016-17	2017-04-12	OKC	H	DEN	L	105	48	43	101	8	25	11	17	14	37	20	12	10	8	24	1610612760	Oklahoma City Thunder	2016-17	2017-04-12	DEN	A	OKC	W	111	48	39	85	7	25	26	32	5	38	22	17	6	8	17	1610612743	Denver Nuggets
21601228	2016-17	2017-04-12	LAC	H	SAC	W	115	48	43	83	10	28	19	25	11	32	29	6	4	6	15	1610612746	LA Clippers	2016-17	2017-04-12	SAC	A	LAC	L	95	48	40	81	5	11	10	14	7	30	27	11	2	1	18	1610612758	Sacramento Kings
21601218	2016-17	2017-04-12	CLE	H	TOR	L	83	48	28	76	10	37	17	24	10	32	20	15	5	7	14	1610612739	Cleveland Cavaliers	2016-17	2017-04-12	TOR	A	CLE	W	98	48	40	86	8	26	10	16	12	36	20	9	8	2	20	1610612761	Toronto Raptors
21601230	2016-17	2017-04-12	POR	H	NOP	L	100	48	41	85	13	30	5	13	14	39	22	25	4	2	18	1610612757	Portland Trail Blazers	2016-17	2017-04-12	NOP	A	POR	W	103	48	40	90	8	24	15	21	11	31	20	11	13	3	17	1610612740	New Orleans Pelicans

This new DataFrame is the right next step, but we need to clean it up a bit. First, we don’t need two win/loss columns. We only need one, telling us whether the home team won or the away team won.

def home_away_win(row):
    home = row['wl_h']
    away = row['wl_a']
    assert home in ['W', 'L']
    assert away in ['W', 'L']
    assert home != away
    if home == 'W':
        return 'H'
    else:
        return 'A'

We can drop a bunch of redundant columns and rename the ones that apply to the matchup itself, rather than particular teams.

def drop_matchup_columns(df):
    return df.drop([
        'wl_h',
        'wl_a',
        'season_a',
        'date_a',
        'ha_h',
        'opp_h',
        'ha_a',
        'opp_a',
    ], axis='columns')

def rename_matchup_columns(df):
    return df.rename(columns={
        'season_h': 'season',
        'date_h': 'date',
    })

def reorder_matchup_columns(df):
    """Column order for game statistics DataFrame, alternating home and away stats."""
    first_cols = ['season', 'date', 'won',]
    raw_cols = [col.replace('_h', '') for col in df.columns if col.endswith('_h')]
    cols = list(chain.from_iterable((col+'_h', col+'_a') for col in raw_cols))
    return df[first_cols + cols]

Putting everything together in one place, here is a function which will create the matchups and reformat the resulting DataFrame. It will also add a column storing which team (home or away) won the matchup.

def matchups_from_teamgamelogs(df):
    """DataFrame with one unique game_id per row and team stats identified by home and away teams."""
    home_games = df[df['ha'] == 'H']
    away_games = df[df['ha'] == 'A']
    matchups = home_games.join(away_games, lsuffix='_h', rsuffix='_a')
    # Add new 'won' column: 'H' if home time wins or 'A' if away team wins
    matchups['won'] = matchups.apply(home_away_win, axis='columns')
    matchups = drop_matchup_columns(matchups)
    matchups = rename_matchup_columns(matchups)
    matchups = reorder_matchup_columns(matchups)
    return matchups

matchups_from_teamgamelogs(games).head()

	season	date	won	team_h	team_a	pts_h	pts_a	min_h	min_a	fgm_h	fgm_a	fga_h	fga_a	fg3m_h	fg3m_a	fg3a_h	fg3a_a	ftm_h	ftm_a	fta_h	fta_a	oreb_h	oreb_a	dreb_h	dreb_a	ast_h	ast_a	tov_h	tov_a	stl_h	stl_a	blk_h	blk_a	pf_h	pf_a	team_id_h	team_id_a	team_name_h	team_name_a
game_id
21601224	2016-17	2017-04-12	H	HOU	MIN	123	118	48	48	45	48	97	91	19	11	56	23	14	11	22	15	16	10	31	39	37	37	10	15	11	7	7	3	14	15	1610612745	1610612750	Houston Rockets	Minnesota Timberwolves
21601225	2016-17	2017-04-12	A	OKC	DEN	105	111	48	48	43	39	101	85	8	7	25	25	11	26	17	32	14	5	37	38	20	22	12	17	10	6	8	8	24	17	1610612760	1610612743	Oklahoma City Thunder	Denver Nuggets
21601228	2016-17	2017-04-12	H	LAC	SAC	115	95	48	48	43	40	83	81	10	5	28	11	19	10	25	14	11	7	32	30	29	27	6	11	4	2	6	1	15	18	1610612746	1610612758	LA Clippers	Sacramento Kings
21601218	2016-17	2017-04-12	A	CLE	TOR	83	98	48	48	28	40	76	86	10	8	37	26	17	10	24	16	10	12	32	36	20	20	15	9	5	8	7	2	14	20	1610612739	1610612761	Cleveland Cavaliers	Toronto Raptors
21601230	2016-17	2017-04-12	A	POR	NOP	100	103	48	48	41	40	85	90	13	8	30	24	5	15	13	21	14	11	39	31	22	20	25	11	4	13	2	3	18	17	1610612757	1610612740	Portland Trail Blazers	New Orleans Pelicans

One last step is to set some categorical data types and reorder the columns. Categories make a pandas DataFrame faster and more memory-efficient for columns that only take on a few possible values. For instance, the ‘won’ column can only be ‘W’ or ‘L’. The function below sets certain columns to be categorical data. It also makes sure the box score stats are integers.

def format_matchups(df):
    df['season'] = df['season'].astype('category')
    df['season_type'] = df['season_type'].astype('category')
    df['won'] = df['won'].astype('category')
    df['team_h'] = df['team_h'].astype('category')
    df['team_a'] = df['team_a'].astype('category')
    # Set all 'object' columns to int (except for team names)
    for col in df.columns:
        if 'team' not in col and (col.endswith('_h') or col.endswith('_a')):
            if df[col].dtype == 'object':
                df[col] = df[col].astype(int)
    first_cols = ['season', 'season_type', 'date', 'won']
    cols = first_cols + [col for col in df.columns if col not in first_cols]
    return df[cols]

Processing an Entire NBA Season

Now we are almost done. We want a function which can scrape (or read from a file) the raw data for a given season, for all possible season types (regular, playoffs or pre-season). This function should then process the raw data and create the matchups for that season. The rows for each season type should be stored so we can tell the regular season games from the other types later.

def nba_season_matchups(season=None, data_dir=None, overwrite=False):
    """All NBA matchups for a given season (regular, playoffs and pre-season).""" 
    matchups = None
    for season_type in NBA_SEASON_TYPES:
        df = raw_teamgamelogs(
            season,
            NBA_SEASON_TYPES[season_type],
            data_dir,
            overwrite,
            sleep_for=2,
        )
        if len(df) == 0:
            # no rows came back; this is probably a pre-season with no data, so just continue
            continue
        df = formatted_teamgamelogs(df)
        df = matchups_from_teamgamelogs(df)
        df['season_type'] = season_type
        if matchups is None:
            matchups = df.copy()
        else:
            matchups = matchups.append(df)
    return format_matchups(matchups)

Now we can run our master function on the entire 2016-17 season.

matchups = nba_season_matchups(season, data_dir=DATA_DIR)
matchups['season_type'].value_counts()

regular      1230
preseason     102
playoffs       79
Name: season_type, dtype: int64

As we already new, there are 1230 regular season matchups. We also see we got the pre-season and playoff matchups as well.

matchups.head()

	season	season_type	date	won	team_h	team_a	pts_h	pts_a	min_h	min_a	fgm_h	fgm_a	fga_h	fga_a	fg3m_h	fg3m_a	fg3a_h	fg3a_a	ftm_h	ftm_a	fta_h	fta_a	oreb_h	oreb_a	dreb_h	dreb_a	ast_h	ast_a	tov_h	tov_a	stl_h	stl_a	blk_h	blk_a	pf_h	pf_a	team_id_h	team_id_a	team_name_h	team_name_a
game_id
21601224	2016-17	regular	2017-04-12	H	HOU	MIN	123	118	48	48	45	48	97	91	19	11	56	23	14	11	22	15	16	10	31	39	37	37	10	15	11	7	7	3	14	15	1610612745	1610612750	Houston Rockets	Minnesota Timberwolves
21601225	2016-17	regular	2017-04-12	A	OKC	DEN	105	111	48	48	43	39	101	85	8	7	25	25	11	26	17	32	14	5	37	38	20	22	12	17	10	6	8	8	24	17	1610612760	1610612743	Oklahoma City Thunder	Denver Nuggets
21601228	2016-17	regular	2017-04-12	H	LAC	SAC	115	95	48	48	43	40	83	81	10	5	28	11	19	10	25	14	11	7	32	30	29	27	6	11	4	2	6	1	15	18	1610612746	1610612758	LA Clippers	Sacramento Kings
21601218	2016-17	regular	2017-04-12	A	CLE	TOR	83	98	48	48	28	40	76	86	10	8	37	26	17	10	24	16	10	12	32	36	20	20	15	9	5	8	7	2	14	20	1610612739	1610612761	Cleveland Cavaliers	Toronto Raptors
21601230	2016-17	regular	2017-04-12	A	POR	NOP	100	103	48	48	41	40	85	90	13	8	30	24	5	15	13	21	14	11	39	31	22	20	25	11	4	13	2	3	18	17	1610612757	1610612740	Portland Trail Blazers	New Orleans Pelicans

Getting All Completed NBA Seasons

The next step is to get all the completed seasons since 1996-97 (the earliest season for which the web site has advanced stats). To do this, all we need to do is call our nba_season_matchups() function above for each completed season. Let’s write a few small funcitions to help us figure out what is the latest completed NBA season.

def nba_season_from_year(y):
    """Convert integer year into stats.nba.com season string."""
    return str(y) + '-' + str(int(y)+1)[2:]

def nba_year_from_season(s):
    """Convert stats.nba.com season string into integer year"""
    return int(s[:4])

def current_nba_season_year():
    """Year in which current NBA season began."""
    now = datetime.now()
    if now.month > 8:
        return now.year
    else:
        return now.year - 1

def current_nba_season():
    """Current NBA season."""
    return nba_season_from_year(current_nba_season_year())

def completed_nba_season_year():
    """Year in which most recent completed NBA season began."""
    return current_nba_season_year() - 1

def completed_nba_season():
    """Most recent completed NBA season."""
    return nba_season_from_year(completed_nba_season_year())

completed_nba_season()

'2016-17'

Now it’s easy to just call our season-by-season scraper and collect all the results in one large DataFrame. Notice that we are using the tqdm package here. This package creates a “progress bar” to show you what percentage of an iteration has been accomplished, along with some timing information. This is helpful when scraping, since scraping can be slow.

def completed_nba_season_matchups(data_dir=None, overwrite=False):
    """Get NBA matchups for completed seasons."""
    end_year = completed_nba_season_year()
    seasons = [nba_season_from_year(y) for y in range(1996, end_year+1)]
    matchups = None
    for season in tqdm(seasons):
        df = nba_season_matchups(season, data_dir, overwrite)
        if matchups is None:
            matchups = df.copy()
        else:
            matchups = matchups.append(df)
    return format_matchups(matchups)

matchups = completed_nba_season_matchups(data_dir=DATA_DIR)

100%|██████████| 21/21 [00:22<00:00,  1.07s/it]

matchups.shape

(26812, 40)

Sanity-Checking the Data

We have almost 27,000 matchups in our DataFrame. Let’s break things out by season and type of game to make sure they make sense.

matchups.pivot_table(
    index='season',
    columns='season_type',
    values='team_h',
    aggfunc='count',
    fill_value=0,
    margins=True,
)

season_type	playoffs	preseason	regular	All
season
1996-97	72	0	1189	1261
1997-98	71	0	1189	1260
1998-99	66	0	725	791
1999-00	75	0	1189	1264
2000-01	71	0	1189	1260
2001-02	71	0	1189	1260
2002-03	88	0	1189	1277
2003-04	82	0	1189	1271
2004-05	84	0	1230	1314
2005-06	89	0	1230	1319
2006-07	79	0	1230	1309
2007-08	86	0	1230	1316
2008-09	85	0	1230	1315
2009-10	82	0	1230	1312
2010-11	81	0	1230	1311
2011-12	84	0	990	1074
2012-13	85	0	1229	1314
2013-14	89	0	1230	1319
2014-15	81	118	1230	1429
2015-16	86	109	1230	1425
2016-17	79	102	1230	1411
All	1686	329	24797	26812

We only have pre-season games since the 2014-15 season.

Looking at the season totals, you can see the effect of the lockouts in 1998-99 and 2011-12. You can also see the impact of the league expansion from 29 to 30 teams in 2004 (with the addition of Charlotte). See the history of NBA seasons here.

You might think we are done, but there is one complication that we need to work through. This shows the importance of looking carefully at your data. It’s hard to go through almost 27,000 rows by eye, so you need to think carefully and make sure things make sense.

Checking the Teams

One thing we can and should check is: what teams are represented in our data?

def unique_home_away_teams(df, ha):
    """Unique home or away teams (as specified) from matchups DataFrame."""
    ha = ha.lower()
    if ha not in ['h', 'a']:
        raise ValueError('invalid home or away symbol')
    team_abbr = 'team_' + ha
    team_id = 'team_id_' + ha
    team_name = 'team_name_' + ha
    return (
        df[[team_abbr, team_id, team_name]]
            .reset_index()
            .drop(columns=['game_id'])
            .drop_duplicates()
            .sort_values(team_id)
            .rename(
                columns={
                    team_abbr: 'team',
                    team_id: 'team_id',
                    team_name: 'team_name',
                }
            )
            .set_index('team_id')
    )

def unique_teams(df):
    """Unique teams in matchups DataFrame."""
    teams = unique_home_away_teams(df, 'h')
    teams.append(unique_home_away_teams(df, 'a'))
    return teams.drop_duplicates().sort_index()

unique_teams(matchups)

	team	team_name
team_id
94	MLN	Milano Olimpia Milano
12304	FCB	FC Barcelona Lassa
12315	RMD	Real Madrid
12315	RMD	Madrid Real Madrid
12321	FBU	Istanbul Fenerbahce Ulker
12323	ALB	Berlin Alba Berlin
12325	FLA	Flamengo Flamengo
1610612737	ATL	Atlanta Hawks
1610612738	BOS	Boston Celtics
1610612739	CLE	Cleveland Cavaliers
1610612740	NOK	New Orleans Hornets
1610612740	NOP	New Orleans Pelicans
1610612740	NOH	New Orleans Hornets
1610612741	CHI	Chicago Bulls
1610612742	DAL	Dallas Mavericks
1610612743	DEN	Denver Nuggets
1610612744	GSW	Golden State Warriors
1610612745	HOU	Houston Rockets
1610612746	LAC	Los Angeles Clippers
1610612746	LAC	LA Clippers
1610612747	LAL	Los Angeles Lakers
1610612748	MIA	Miami Heat
1610612749	MIL	Milwaukee Bucks
1610612750	MIN	Minnesota Timberwolves
1610612751	NJN	New Jersey Nets
1610612751	BKN	Brooklyn Nets
1610612752	NYK	New York Knicks
1610612753	ORL	Orlando Magic
1610612754	IND	Indiana Pacers
1610612755	PHI	Philadelphia 76ers
1610612756	PHX	Phoenix Suns
1610612757	POR	Portland Trail Blazers
1610612758	SAC	Sacramento Kings
1610612759	SAS	San Antonio Spurs
1610612760	OKC	Oklahoma City Thunder
1610612760	SEA	Seattle SuperSonics
1610612761	TOR	Toronto Raptors
1610612762	UTA	Utah Jazz
1610612763	MEM	Memphis Grizzlies
1610612763	VAN	Vancouver Grizzlies
1610612764	WAS	Washington Bullets
1610612764	WAS	Washington Wizards
1610612765	DET	Detroit Pistons
1610612766	CHA	Charlotte Bobcats
1610612766	CHA	Charlotte Hornets
1610612766	CHH	Charlotte Hornets

Whoa, right away we see that we have some non-NBA teams represented, from pre-season exhibition-type games. We will exclude those in a moment, since we only want to have NBA matchups in our analysis.

That’s the easy part. The harder part is that team names and abbreviations have changed over time. Let’s see what’s going on in more detail.

First, there are some simple cases that aren’t too difficult to understand. The Clippers changed their name in 2015 from the Los Angeles Clippers to the LA Clippers. The abbreviation and ‘team_id’ didn’t change, however. No big deal. There’s a similar situation with Washington.

For a team move like the Thunder, you have to keep track of the abbreviation change. Although the ‘team_id’ stays the same, the abbrevation changes from ‘SEA’ to ‘OKC’ reflecting the team move from Seattle to Oklahoma City in 2008. There’s a similar situation with the Grizzlies and the Nets.

The most complicated situation is with Charlotte and New Orleans.

Hornets and Pelicans

As descibed in detail here, the Charlotte Hornets moved to New Orleans in 2002. The Charlotte Bobcats joined the NBA as an expansion team in 2004. Prior to the 2014-15 season, the Charlotte Bobcats and the New Orleans Hornets agreed to a deal whereby New Orleans would become the Pelicans and the Hornets name and history would revert to Charlotte. This explains why the abbreviation ‘CHA’ is used for both the Charlotte Bobcats and the Charlotte Hornets, in addition to the ‘CHH’ abbreviation for the other Charlotte Hornets (the team that moved in 2002 and ultimately became the Pelicans).

Another complication is that because of Hurricane Katrina, the New Orleans Hornets had to relocate for much of the 2005-6 and 2006-7 seasons to Oklahoma City. That’s why there is an abbreviation ‘NOK’ to represent the games that were played under the label New Orleans/Oklahoma City Hornets.

Let’s sort this out and make sure we understand this in detail.

First, let build a table with the current 30 NBA teams, so we can keep track of historical abbreviations and how they map to current teams. Remember that the ‘team_id’ is meant to be a unique identifier for each team, even if the abbreviation or team name changes.

def current_teams(df):
    """Team abbreviations for the most recent completed NBA season."""
    current_teams = (
        df.loc[
            (df['season'] == completed_nba_season()) & (df['season_type'] == 'regular'),
            ['team_h', 'team_id_h']
        ]
        .drop_duplicates()
        .sort_values(['team_id_h'])
        .rename(
            columns={
                'team_h': 'curr_team',
                'team_id_h': 'team_id',
            }
        )
        .set_index('team_id')
    )
    return current_teams

len(current_teams(matchups))

teams = unique_teams(matchups).merge(current_teams(matchups), left_index=True, right_index=True)

Let’s build a table of any temas that have had a name change or an abbreviation change.

team_changes = teams[['team', 'curr_team']].groupby('curr_team').filter(lambda team: len(team) > 1)
team_changes

	team	curr_team
team_id
1610612740	NOK	NOP
1610612740	NOP	NOP
1610612740	NOH	NOP
1610612746	LAC	LAC
1610612746	LAC	LAC
1610612751	NJN	BKN
1610612751	BKN	BKN
1610612760	OKC	OKC
1610612760	SEA	OKC
1610612763	MEM	MEM
1610612763	VAN	MEM
1610612764	WAS	WAS
1610612764	WAS	WAS
1610612766	CHA	CHA
1610612766	CHA	CHA
1610612766	CHH	CHA

Now, let’s look at how many games, per season, were played by various Charlotte and New Orleans franchises.

regular_season_summary = (
    matchups.loc[
        matchups['season_type'] == 'regular',
        ['season', 'team_h', 'team_a']
    ].pivot_table(
        index='season',
        columns='team_h',
        values='team_a',
        aggfunc='count',
        fill_value=0,
        margins=True,
    )
)

regular_season_summary.loc[:, ['CHA', 'CHH', 'NOH', 'NOK', 'NOP']]

team_h	CHA	CHH	NOH	NOK	NOP
season
1996-97	0	41	0	0	0
1997-98	0	41	0	0	0
1998-99	0	25	0	0	0
1999-00	0	41	0	0	0
2000-01	0	41	0	0	0
2001-02	0	41	0	0	0
2002-03	0	0	41	0	0
2003-04	0	0	41	0	0
2004-05	41	0	41	0	0
2005-06	41	0	0	41	0
2006-07	41	0	0	41	0
2007-08	41	0	41	0	0
2008-09	41	0	41	0	0
2009-10	41	0	41	0	0
2010-11	41	0	41	0	0
2011-12	33	0	33	0	0
2012-13	41	0	41	0	0
2013-14	41	0	0	0	41
2014-15	41	0	0	0	41
2015-16	41	0	0	0	41
2016-17	41	0	0	0	41
All	525	230	361	82	164

This table makes it clear that ‘CHH’ is the original Charlotte Hornets, who became the New Orleans Hornets (and the New Orleans/Oklahoma City Hornets), before becoming the New Orleans Pelicans (‘NOP’). The abbreviation ‘CHA’ refers to the Bobcats, who then became the new Charlotte Hornets in the 2014-15 season.

“Fixing” the Team Data

What might surprise you about this, is that the NBA team data are wrong!

If you look at the team table above, you’ll see that ‘CHH’ has the same ‘team_id’ as ‘CHA’. Now, you may think this is correct, since the Bobcats and the New Orleans Hornets did a deal returning the history of the original Charlotte Hornets (‘CHH’) to Charlotte. That’s why ‘CHH’ and ‘CHA’ have the same ‘team_id’.

On the other hand, if you goal is to track the performance of group of players, as their franchise changed cities, this is incorrect. The guys who played on the ‘CHH’ squad in the 2001-2 season should be connected with the ‘NOH’ squad in the 2002-3 season.

The NBA team data as recorded don’t reflect that.

To keep track of this, we will do something that you should never do lightly. We are going to override our team data table to make sure the ‘CHH’ games are grouped with the current Pelicans franchise. We are effectively going to undo the Charlotte/New Orleans deal to transfer the Hornets history.

teams.loc[teams['team'] =='CHH', 'curr_team'] = 'NOP'

teams.loc[teams['team'].isin(['CHA', 'CHH', 'NOH', 'NOK', 'NOP']), :]

	team	team_name	curr_team
team_id
1610612740	NOK	New Orleans Hornets	NOP
1610612740	NOP	New Orleans Pelicans	NOP
1610612740	NOH	New Orleans Hornets	NOP
1610612766	CHA	Charlotte Bobcats	CHA
1610612766	CHA	Charlotte Hornets	CHA
1610612766	CHH	Charlotte Hornets	NOP

Of course, we’re not overriding the raw data. But it’s important to keep track anytime you change data.

Let’s write a function to record what we’re doing.

def team_info(df):
    """DataFrame of NBA team IDs, unadjusted abbreviations, names and adjusted abbrevations."""
    teams = unique_teams(df).merge(current_teams(df), left_index=True, right_index=True)
    teams.loc[teams['team'] =='CHH', 'curr_team'] = 'NOP' # OVERRIDING Hornets/Pelicans
    return teams

Lastly, we will write a function to exclude non-NBA matchups. Again, these were pre-season exhibition-style games. To exclude those games all we need to do is filter out any teams whose ‘team_id’ isn’t one of the current 30 NBA teams.

def exclude_non_nba_matchups(df):
    """Filter out matchups involving non-NBA teams."""
    nba_teams = current_teams(df)
    non_nba_matchups = list(
        df.loc[~df['team_id_h'].isin(nba_teams.index)].index
    )
    non_nba_matchups.extend(list(
        df.loc[~df['team_id_a'].isin(nba_teams.index)].index)
    )
    return df.loc[~df.index.isin(non_nba_matchups), :]

Let’s test this to see how many games get filtered out.

matchups.shape

(26812, 40)

exclude_non_nba_matchups(matchups).shape

(26787, 40)

The function filtered out 25 games.

Let’s put all these team cleanup steps together. First, we will write a function to give us a mapping between the historical team abbreviation, and the “correct” current team abbreviation. In this case, “correct” means adjusted for the Hornets/Pelicans issue.

def team_mapping(df):
    """Dict mapping historical NBA team abbreviation to abbreviation of current franchise."""
    teams = team_info(df) # this creates the lookup table with Hornets/Pelicans issue corrected
    list_of = teams.to_dict(orient='list')
    return dict(zip(list_of['team'], list_of['curr_team']))

def update_matchups(df):
    """Prepare matchups for use in analytics."""
    nba_teams = current_teams(df)
    df = exclude_non_nba_matchups(df) # filters any matchups that include non-NBA teams
    abbr_to_current = team_mapping(df)
    home_map = df['team_h'].map(abbr_to_current).rename('team_curr_h')
    away_map = df['team_a'].map(abbr_to_current).rename('team_curr_a')
    # Add new columns for current team abbreviations 
    df = pd.concat([df.copy(), home_map, away_map], axis='columns').reset_index()
    df['team_curr_h'] = df['team_curr_h'].astype('category')
    df['team_curr_a'] = df['team_curr_a'].astype('category')
    return df.set_index('game_id')

Notice that the above function doesn’t actually replace the team abbreviations in our matchups table. Rather, it creates two new columns for the current teams (home and away). That way, we can see what the original information was and make sure it maps correctly in our analysis. This is safer than just overwriting the old team abbreviations.

We will use the current team abbreviations in building our matchup prediction models. This will allow us to track changes in team quality over time without getting confused by name changes or franchise moves.

One last step is to write a function that will save our final, processed DataFrame for future analysis. This DataFrame is relatively large, and we’ve put some effort into formatting it. So, instead of using CSV, we’ll use another built-in format that pandas can write for us: Python’s pickle format. The pickle format is an efficient, compressed format that will store all the information in our DataFrame, including formatting.

As noted in the pickle documentation, you should be very careful about unpickling an object that you obtained from the Internet or via email. Only unpickle objects that you have previously pickled yourself or you obtained from somebody you really, really trust. By the way, it turns out that hackers have even figured out ways to inject malware into CSV files, so the general and best advice is: always be very careful opening any type of file that you’ve downloaded or received by email!

def save_matchups(df, output_dir):
    """Save pickle file of NBA matchups prepared for analytics."""
    seasons = list(df['season'].unique())
    start_season = min(seasons).replace('-', '_')
    end_season = max(seasons).replace('-', '_')
    PKL_TEMPLATE = 'stats_nba_com-matchups-{start_season}-{end_season}'
    pklfilename = (
        PKL_TEMPLATE.format(start_season=start_season, end_season=end_season) +
        '.pkl'
    )
    pklfile = output_dir.joinpath(pklfilename)
    if pklfile.exists():
        timestamp = str(datetime.now().strftime("%Y-%m-%d-%H-%M"))
        backupfilename = (
            PKL_TEMPLATE.format(start_season=start_season, end_season=end_season) +
            '.bak.{timestamp}'.format(timestamp=timestamp) +
            '.pkl'
        )
        backupfile = output_dir.joinpath(backupfilename)
        pklfile.rename(backupfile)
    df.to_pickle(pklfile)

Finishing Up

Now, we can finally put everything together into one master function which reads all the matchups for since 1996 and saves it to a pickle file, ready for later analysis.

def nba_stats_matchups(data_dir=None, output_dir=None, overwrite=False):
    """Get and prepare stats.nba.com matchups for all completed seasons since 1996."""
    df = completed_nba_season_matchups(data_dir=data_dir, overwrite=overwrite)
    df = update_matchups(df)
    if output_dir:
        save_matchups(df, output_dir)
    return df

div class=”cell border-box-sizing code_cell rendered”>

matchups = nba_stats_matchups(data_dir=DATA_DIR, output_dir=OUTPUT_DIR, overwrite=False)

To conclude the data gathering, here’s a summary of all the information we’ve obtained.

matchups.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26787 entries, 29601187 to 11600001
Data columns (total 42 columns):
season         26787 non-null category
season_type    26787 non-null category
date           26787 non-null datetime64[ns]
won            26787 non-null category
team_h         26787 non-null category
team_a         26787 non-null category
pts_h          26787 non-null int64
pts_a          26787 non-null int64
min_h          26787 non-null int64
min_a          26787 non-null int64
fgm_h          26787 non-null int64
fgm_a          26787 non-null int64
fga_h          26787 non-null int64
fga_a          26787 non-null int64
fg3m_h         26787 non-null int64
fg3m_a         26787 non-null int64
fg3a_h         26787 non-null int64
fg3a_a         26787 non-null int64
ftm_h          26787 non-null int64
ftm_a          26787 non-null int64
fta_h          26787 non-null int64
fta_a          26787 non-null int64
oreb_h         26787 non-null int64
oreb_a         26787 non-null int64
dreb_h         26787 non-null int64
dreb_a         26787 non-null int64
ast_h          26787 non-null int64
ast_a          26787 non-null int64
tov_h          26787 non-null int64
tov_a          26787 non-null int64
stl_h          26787 non-null int64
stl_a          26787 non-null int64
blk_h          26787 non-null int64
blk_a          26787 non-null int64
pf_h           26787 non-null int64
pf_a           26787 non-null int64
team_id_h      26787 non-null int64
team_id_a      26787 non-null int64
team_name_h    26787 non-null object
team_name_a    26787 non-null object
team_curr_h    26787 non-null category
team_curr_a    26787 non-null category
dtypes: category(7), datetime64[ns](1), int64(32), object(2)
memory usage: 7.5+ MB

We’ve accomplished a lot. Now that we have data, we’re ready to move on to the matchup predictions in future posts. To conclude, let’s look at one important fact we can immediately learn from the data.

A Quick Look at Home Court Advantage in the NBA

Let’s calculate what home court has been worth in terms of regular season win percentage between the 1996-97 and 2016-17 seasons.

matchups.loc[matchups['season_type'] == 'regular', 'won'].value_counts()

H    14830
A     9967
Name: won, dtype: int64

def count_home_away_wins(df, ha, season_type='regular'):
    return df.loc[df['season_type'] == season_type, 'won'].value_counts()[ha]

home_wins = count_home_away_wins(matchups, 'H')
away_wins = count_home_away_wins(matchups, 'A')

home_wins/(home_wins+away_wins)

0.59805621647779972

away_wins/(home_wins+away_wins)

0.40194378352220028

The home team has won about 60% of the time on average during the NBA regular season. Any prediction model is going to have to address this fact. We’ll look a lot more closely at this topic in future posts.

Web Scraping NBA Team Matchups and Box Scores

The Data

Why Write a Web Scraper

Scraping a JavaScript-Rendered Web Site

Use the Browser Developer Tools

Scraping the Site

Figuring out the JSON

Scraping and Saving the Raw Data

Cleaning up the Columns

Figuring out the Matchups

More Data Cleanup

Creating a New Table of Matchups

Processing an Entire NBA Season

Getting All Completed NBA Seasons

Sanity-Checking the Data

Checking the Teams

Hornets and Pelicans

“Fixing” the Team Data

Finishing Up

A Quick Look at Home Court Advantage in the NBA

Related

The Data

Why Write a Web Scraper

Scraping a JavaScript-Rendered Web Site

Use the Browser Developer Tools

Scraping the Site

Figuring out the JSON

Scraping and Saving the Raw Data

Cleaning up the Columns

Figuring out the Matchups

More Data Cleanup

Creating a New Table of Matchups

Processing an Entire NBA Season

Getting All Completed NBA Seasons

Sanity-Checking the Data

Checking the Teams

Hornets and Pelicans

“Fixing” the Team Data

Finishing Up

A Quick Look at Home Court Advantage in the NBA

Share this:

Related