Web Scraping NBA Team Matchups and Box Scores

We are going to use machine learning and statistics to predict NBA matchups. To do this, we are going to need data on NBA games, and lots of it. So let’s get all the team matchups and box scores from stats.nba.com, and make them ready for use.

This post has two purposes. The first is to show you how to do the actual web scraping. The second purpose is to show you how to examine data before you us it. Data are almost always a bit messy and need to be handled with care. It’s important to take some time to look at data and to make sure it’s clean before use.

This post has a lot of code. If you don’t want to spend time going through it, you can just run the code if you want. The code in this post will scrape and store the data for you. You won’t need to run this code again after you get the data.

A Jupyter Notebook with the code and portions of the text from this post can be found here.

I think the most useful information in this post is learning how to assess the data quality, and make smart decisions about how to address potential problems. If you decide to just skim the code and focus on the decisions, hopefully this example will still be useful for you. You may also find it interesting to see how real-world NBA facts (such as expansion teams, team moves and name changes) actually impact the data!

The Data

We are going to scrape Team Box Scores for completed NBA seasons from stats.nba.com. We are going to start with tradtional box scores, which give us plenty of statistics to consider, as well as all the matchup information we need. If you explore the site, you’ll see there are also other more advanced data available. We’ll consider these other data in future posts.

The advanced data are only available going back to the 1996-97 season, so we’ll limit our traditional box scores to the same time period. The site actually has traditional box scores going back to the 1946-47 season. Such old data are interesting, but aren’t useful for predicting current matchups. We have to consider how much the game has changed over the years. Although there have been a number of NBA rule changes since 1996 and game style has evolved, I think data over this time period are still useful today.

The data from 1996-97 to 2016-17 comprise 21 complete seasons, including pre-season data from recent years. This is plenty of data for our purposes. Notice that we’re not going to download data from the current 2017-18 season yet. We are going to use the historical data for building prediction models, since we know how the seasons turned out. After we build the model, we can apply it to the current season and see how it does over the remainder of the season.

Why Write a Web Scraper

As I mentioned, there’s a lot of code in this post. You may wonder why it’s worth it to build a web scraper.

First, in many cases it’s the easiest way to get the data you need. Even if it takes a few hours to build a scraper, it’s a lot faster than manually downloading it. Consider the excellent site, Basketball Reference. Basketball Reference offers you the ability to download games into a CSV file. Here is a link to the 2016-17 season games.

Notice that this site shows games grouped by month. You can download the games by looking at the “Share & More” drop-down menu next to the month schedule header. Each month requires a separate download. So, if you want to download all the games for the 21 complete seasons since 1996-97, you’d need to manually download around 180 CSV files (there are 9 months in the typical NBA season). Then, to create a master spreadsheet of matchup results, you’d need to open them up in Microsoft Excel or Google Sheets and put all the data together in one place.

Also, if you click on one of the box score links (for example, the October 25, 2016 matchup between the Knicks and the Cavaliers), you’ll see very detailed box score information, but no easy way to extract and download all the box scores at once. Scraping the original data is much faster and more practical. As you’ll see, it’s not actually that hard.

Lastly, writing a scraper is essential if you want to update the data relatively frequently. If you create a matchup prediction model, you’ll want to run it after every game, to update the probabilities for future matchups. Imagine if you had to type information into a spreadsheet or download a few new spreadsheets every night during the NBA season?

The web scraper we’ll develop here will also be able to scrape current season games. So once you have it working, you can run each day during the NBA season to get the lastest matchup results and team box scores automatically.

Scraping a JavaScript-Rendered Web Site

stats.nba.com is a JavaScript-rendered site. As discussed in our previous post on web-scraping, these sites work differently from HTML-rendered sites.

We can still use the Requests package to communicate with the web site, but we can’t use BeautifulSoup like we did previously. BeautifulSoup is great for scraping HTML-rendered sites, but it doesn’t work with JavaScript-rendered sites.

Scraping the box scores isn’t actually very hard, but it has one very tricky aspect. The URL to send the GET request isn’t the same URL that you see in your browser. Remember that the whole point of JavaScript-rendered pages is the web site returns JavaScript code to your browser, which then runs the JavaScript to get the data to render in the browser window. The URL you want to scrape is really that second URL, which returns the data.

Let’s look at the specific links for the box scores. The URL for the link to the page we can view is:

https://stats.nba.com/teams/boxscores-traditional/?Season=2016-17&SeasonType=Regular%20Season

It’s clear from the URL that this page has traditional box scores for the 2016-17 regular season. It’s pretty clear that we can access different seasons by changing the parameters at the end of the URL.

Now here’s the trick. The URL which returns the actual data is:

https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00
&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0
&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season
&ShotClockRange=&VsConference=&VsDivision=

That’s all actually one line, which I’ve broken up for clarity. You can open up the a browser tab with the raw data by clicking this link. Notice that this URL has a lot of parameters.

So, how did I figure out that this is the correct URL?

Use the Browser Developer Tools

As mentioned in the prior post on web scraping, your browser’s developer tools make it easier to see what is going on. In that post, we used the Elements section of the developer tools to look at the different HTML tags. To scrape this JavaScript-rendered page, we will instead use the Network section. Here are instructions for how to use these tools in Chrome. Here are similar instructions for Firefox. I think these browsers are the easiest to use to understand JavaScript-rendered sites.

Under the Network section, check out the XHR tab. XHR is short for XMLHttpRequest, but the details don’t matter. The point is that if you load (or reload) the page (with the “normal” URL above), you will see a few lines show up under the XHR tab, one of which starts with teamgamelogs. That’s the item we want.

If you select that teamgamelogs item, in either Chrome or Firefox, you can explore what’s in that object. I’ll move on the scraping now, but hopefully this brief introduction will serve to explain how you could figure this out for yourself on a different web page. As we go through the steps below, see if you can follow along and find the corresponding information in your browser’s developer tools on that XHR tab.

If you explore other pages on stats.nba.com with the developer tools open to the XHR tab, you’ll be able to apply these techniques to figure out how to scrape those other pages, too.

Scraping the Site

Let’s do all the necessary imports.

from itertools import chain
from pathlib import Path
from time import sleep
from datetime import datetime

We are going to use Requests to scrape the JavaScript.

import requests
from tqdm import tqdm
tqdm.monitor_interval = 0

We will use pandas and numpy to explore and organize the data toward the end.

import numpy as np
import pandas as pd
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

We will store the data (both in raw from and processed for later analysis) on disk. The code below will create directories on your computer if they don’t exist already. Feel free to change the directories below if you’re running this code on your computer.

PROJECT_DIR = Path.cwd().parent 
DATA_DIR = PROJECT_DIR / 'data' / 'scraped' / 'teamgamelogs'
DATA_DIR.mkdir(exist_ok=True, parents=True)
OUTPUT_DIR = PROJECT_DIR / 'data' / 'prepared'
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
USER_AGENT = (
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) ' +
    'AppleWebKit/537.36 (KHTML, like Gecko) ' +
    'Chrome/61.0.3163.100 Safari/537.36'
)

REQUEST_HEADERS = {
    'user-agent': USER_AGENT,
}

Below is the base URL for the team box scores. This type of URL is also called a web API endpoint. Recall that an API is an application programming interface, which makes it easier to communicate with a web site and obtain information.

The NBA data defines the NBA itself as league ’00’.

NBA_URL = 'http://stats.nba.com/stats/teamgamelogs'
NBA_ID = '00'

The site can give us information for the regular season, playoffs and pre-season.

NBA_SEASON_TYPES = {
    'regular': 'Regular Season',
    'playoffs': 'Playoffs',
    'preseason': 'Pre Season',
}

To explore, we will start with the 2016-17 season, the most recent completed season. We will end up scraping all completed seasons since 1996-97.

season = '2016-17'
season_type = NBA_SEASON_TYPES['regular']
nba_params = {
    'LeagueID': NBA_ID,
    'Season': season,
    'SeasonType': season_type,
}
r = requests.get(NBA_URL, params=nba_params, headers=REQUEST_HEADERS, allow_redirects=False, timeout=15)
r.status_code
200

Figuring out the JSON

The GET request worked. Let’s look at what type of data was returned.

r.headers['content-type']
'application/json; charset=utf-8'

The data are in JavaScript Object Notation, or JSON. Remember, if you want to look at the raw data in JSON format, you can open up the a browser tab pointed to the web API endpoint by clicking this link, or you can explore with your browser’s development tools.

Requests makes it easy to process JSON.

json = r.json()
type(json)
dict
json.keys()
dict_keys(['resource', 'parameters', 'resultSets'])

You can see which parameters are available, and how they were filled in by the GET request.

json['parameters']
{'DateFrom': None,
 'DateTo': None,
 'GameSegment': None,
 'LastNGames': None,
 'LeagueID': '00',
 'Location': None,
 'MeasureType': None,
 'Month': None,
 'OppTeamID': None,
 'Outcome': None,
 'PORound': None,
 'PerMode': None,
 'Period': None,
 'PlayerID': None,
 'SeasonSegment': None,
 'SeasonType': 'Regular Season',
 'SeasonYear': '2016-17',
 'ShotClockRange': None,
 'TeamID': None,
 'VsConference': None,
 'VsDivision': None}

Now let’s get the results.

type(json['resultSets'])
list
len(json['resultSets'])
1
results = json['resultSets'][0]
type(results)
dict
results.keys()
dict_keys(['name', 'headers', 'rowSet'])
headers = results['headers']
type(headers)
list
len(headers)
56

There are 56 columns of data per row.

rows = results['rowSet']
type(rows)
list
len(rows)
2460

There are 30 NBA teams, and each team played 82 games in the 2016-17 season. Since 82 times 30 is 2460, there had better be 2460 rows in the results.

type(rows[0])
list

Each row is itself a list. Let’s see how many columns are in the row.

len(rows[0])
56

As expected, there are 56 columns per row, same as we saw for the column headers. Let’s look at the first row in the results.

print(rows[0])
['2016-17', 1610612745, 'HOU', 'Houston Rockets', '0021601224', '2017-04-12T00:00:00', 'HOU vs. MIN', 'W', 48.0, 45, 97, 0.464, 19, 56, 0.339, 14, 22, 0.636, 16, 31, 47, 37, 10.0, 11, 7, 3, 14, 15, 123, 5.0, 1, 1, 1, 1, 141, 244, 113, 1112, 19, 3, 1400, 1704, 1236, 2216, 128, 1551, 625, 10, 280, 235, 324, 440, 159, 2104, 164, 898]

This is the Houston Rockets playing against the Minnesota Timberwolves in April 12, 2017. The Rockets won. To make sense of all the numbers, let’s put things into a more useful framework.

Scraping and Saving the Raw Data

First, let’s put what we’ve learned about scraping the JSON into a function. We are going to store our data in a pandas DataFrame to make analysis easier.

def scrape_teamgamelogs(season, season_type, sleep_for=None):
    """Process JSON from stats.nba.com teamgamelogs endpoint and return unformatted DataFrame."""
    if sleep_for:
        sleep(sleep_for) # be nice to server by sleeping if we are scraping inside a loop
    nba_params = {
        'LeagueID': NBA_ID,
        'Season': season,
        'SeasonType': season_type,
    }
    r = requests.get(
        NBA_URL,
        params=nba_params,
        headers=REQUEST_HEADERS,
        allow_redirects=False,
        timeout=15,
    )
    r.raise_for_status()
    results = r.json()['resultSets'][0]
    headers = results['headers']
    rows = results['rowSet']
    return pd.DataFrame(rows, columns=headers)

It’s import when scraping data from the web to save it to your computer, especially if it’s data that won’t change in the future. There’s no point scraping data again and again if it won’t change. In this case, we are looking at NBA games from prior seasons, so once we scrape these games we are done.

Let’s call our scraping function from inside another function. This outer function will first check if a file already exists on our computer that has the data. If the file exists, we don’t need to scrape the data; we can just load it from the file into a DataFrame.

On the other hand, if we scrape the data, our function needs to save them to a file so we don’t need to scrape the web site again. Also, our function has a parameter to overwrite the files on disk with freshly scraped data, in case we need or want to do that.

It’s important when scraping data (or collecting data of any kind) to save the raw results first. We are going to be doing a lot of processing to get the data into a useful format. We are going to end up saving those results too. However, if you have an error in your code and don’t save the raw data first, you might not be able to figure out later where you went wrong. That’s why it’s good practice to save the raw data, so you can always start over and reconstruct your results.

def raw_teamgamelogs(season, season_type, data_dir=None, overwrite=False, sleep_for=None):
    """Scrape stats.nba.com teamgamelogs or read from a CSV file if it exists."""
    if data_dir:
        csv_filename = 'stats_nba_com-teamgamelogs-{season}-{season_type}.csv'.format(
            season=season.replace('-', '_'),
            season_type=season_type.replace(' ', '_').lower(),
        )
        csvfile = data_dir.joinpath(csv_filename)
    else:
        csvfile = None
    if csvfile and not overwrite and csvfile.exists():
        df = pd.read_csv(csvfile)
    else:
        df = scrape_teamgamelogs(season, season_type, sleep_for)
        if csvfile:
            df.to_csv(csvfile, index=False)
    return df 
raw = raw_teamgamelogs(season, season_type, data_dir=DATA_DIR)
raw.shape
(2460, 56)

Remember that this function has also saved the raw data for us behind the scenes. If we call it again for this season and season type, it will read it from the file, rather than scrape it from the web site.

raw.head()
SEASON_YEAR TEAM_ID TEAM_ABBREVIATION TEAM_NAME GAME_ID GAME_DATE MATCHUP WL MIN FGM FGA FG_PCT FG3M FG3A FG3_PCT FTM FTA FT_PCT OREB DREB REB AST TOV STL BLK BLKA PF PFD PTS PLUS_MINUS GP_RANK W_RANK L_RANK W_PCT_RANK MIN_RANK FGM_RANK FGA_RANK FG_PCT_RANK FG3M_RANK FG3A_RANK FG3_PCT_RANK FTM_RANK FTA_RANK FT_PCT_RANK OREB_RANK DREB_RANK REB_RANK AST_RANK TOV_RANK STL_RANK BLK_RANK BLKA_RANK PF_RANK PFD_RANK PTS_RANK PLUS_MINUS_RANK
0 2016-17 1610612745 HOU Houston Rockets 21601224 2017-04-12T00:00:00 HOU vs. MIN W 48.0 45 97 0.464 19 56 0.339 14 22 0.636 16 31 47 37 10.0 11 7 3 14 15 123 5.0 1 1 1 1 141 244 113 1112 19 3 1400 1704 1236 2216 128 1551 625 10 280 235 324 440 159 2104 164 898
1 2016-17 1610612737 ATL Atlanta Hawks 21601226 2017-04-12T00:00:00 ATL @ IND L 48.0 30 72 0.417 11 29 0.379 15 19 0.789 8 29 37 19 20.0 5 4 7 15 14 86 -18.0 1 1231 1231 1231 141 2375 2396 1886 687 800 956 1531 1657 1101 1592 1862 2040 1762 2250 1904 1213 1925 241 2221 2324 2220
2 2016-17 1610612749 MIL Milwaukee Bucks 21601219 2017-04-12T00:00:00 MIL @ BOS L 48.0 37 84 0.440 6 23 0.261 14 17 0.824 9 34 43 26 17.0 6 9 3 20 17 94 -18.0 1 1231 1231 1231 141 1497 1342 1526 2008 1700 2061 1704 1908 748 1342 1034 1203 533 1857 1594 109 440 1170 1757 2023 2220
3 2016-17 1610612760 OKC Oklahoma City Thunder 21601225 2017-04-12T00:00:00 OKC vs. DEN L 48.0 43 101 0.426 8 25 0.320 11 17 0.647 14 37 51 20 12.0 10 8 8 24 17 105 -6.0 1 1231 1231 1231 141 460 45 1757 1491 1391 1591 2097 1908 2190 307 533 263 1584 668 396 195 2138 2005 1757 1242 1564
4 2016-17 1610612746 LAC LA Clippers 21601228 2017-04-12T00:00:00 LAC vs. SAC W 48.0 43 83 0.518 10 28 0.357 19 25 0.760 11 32 43 29 6.0 4 6 1 15 18 115 20.0 1 1 1 1 141 460 1495 330 920 908 1184 899 870 1429 844 1375 1203 231 19 2142 537 40 241 1532 504 165

Cleaning up the Columns

Now we have the 2016-17 regular season games inside a pandas DataFrame. Let’s see what the columns are.

raw.columns
Index(['SEASON_YEAR', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M',
       'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST',
       'TOV', 'STL', 'BLK', 'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS',
       'GP_RANK', 'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK', 'FGM_RANK',
       'FGA_RANK', 'FG_PCT_RANK', 'FG3M_RANK', 'FG3A_RANK', 'FG3_PCT_RANK',
       'FTM_RANK', 'FTA_RANK', 'FT_PCT_RANK', 'OREB_RANK', 'DREB_RANK',
       'REB_RANK', 'AST_RANK', 'TOV_RANK', 'STL_RANK', 'BLK_RANK', 'BLKA_RANK',
       'PF_RANK', 'PFD_RANK', 'PTS_RANK', 'PLUS_MINUS_RANK'],
      dtype='object')

We don’t need a number of these columns. All of the ‘_RANK’ columns are designed to help the browser sort the table by different columns. The data in the ‘_RANK’ columns are just the sort orders for each row, if the user wants to sort the results by that particular column. (For example, the first row is the 244th row by field goals made ‘FGM_RANK’).

We are going to throw away all these ‘_RANK’ columns, since we can sort in pandas by any column we wish anyway. We are also going to throw away the ‘_PCT’ columns, since we can compute any percentages we wish if we need them. For example, ‘FG_PCT’ = ‘FGM’ / ‘FGA’ for field goals, and similarly for three-point field goals (‘FG3’) and free throws (‘FT’).

We can through away a few other columsn too. The first is ‘PLUS_MINUS’, which is just margin of victory (i.e., winner points minus loser points). We can compute this later from the actual points. Next is ‘REB’ = ‘OREB’ + ‘DREB’. The last two are ‘BLKA’ (blocks allowed) and ‘PFD’ (personal fouls drawn). These are stats that apply to the opponent. As we will soon see, there are better ways to get this information.

Let’s write a function to exclude the columns we don’t need.

def drop_raw_columns(df):
    cols = [col for col in df.columns if '_RANK' not in col]
    cols = [col for col in cols if '_PCT' not in col]
    cols = [col for col in cols if col not in ['PLUS_MINUS', 'REB', 'BLKA', 'PFD']]
    return df[cols]

Next, let’s write a function to rename the columns to tidy things up.

def rename_raw_columns(df):
    df = df.rename(columns={
        'SEASON_YEAR': 'season',
        'TEAM_ABBREVIATION': 'team',
        'GAME_DATE': 'date',
    })
    df.columns = df.columns.str.lower()
    return df

If we apply these functions to the data, here’s what we’d get so far. Let’s look at the column names and the data types.

rename_raw_columns(drop_raw_columns(raw)).columns
Index(['season', 'team_id', 'team', 'team_name', 'game_id', 'date', 'matchup',
       'wl', 'min', 'fgm', 'fga', 'fg3m', 'fg3a', 'ftm', 'fta', 'oreb', 'dreb',
       'ast', 'tov', 'stl', 'blk', 'pf', 'pts'],
      dtype='object')
rename_raw_columns(drop_raw_columns(raw)).dtypes
season        object
team_id        int64
team          object
team_name     object
game_id        int64
date          object
matchup       object
wl            object
min          float64
fgm            int64
fga            int64
fg3m           int64
fg3a           int64
ftm            int64
fta            int64
oreb           int64
dreb           int64
ast            int64
tov          float64
stl            int64
blk            int64
pf             int64
pts            int64
dtype: object

For some reason, game minutes (‘MIN’) and turnovers (‘TOV’) are floating-point, not integers. Let’s keep things simple and round game minutes to the nearest minute. Also, let’s make sure the game date is a datetime type.

def type_raw_columns(df):
    df['date'] = pd.to_datetime(df['date'])
    df['min'] = df['min'].round().astype(int) # round minutes
    df['tov'] = df['tov'].astype(int)
    return df

Lastly, let’s reorder the columns to make the table a little easier to read.

def reorder_raw_columns(df):
    first_cols = [
        'season',
        'date',
        'team',
        'matchup',
        'wl',
        'pts',
        'min',
    ]
    last_cols = [
        'game_id',
        'team_id',
        'team_name',
    ]
    cols = (
        first_cols +
        [col for col in df.columns if col not in (first_cols+last_cols)] +
        last_cols
    )
    return df[cols]

We can put all these steps in one function to clean up the raw data.

def formatted_columns(df):
    """Formatted stats.nba.com teamgamelogs DataFrame."""
    # Order must be drop -> rename -> others don't matter
    df = drop_raw_columns(df)
    df = rename_raw_columns(df)
    df = type_raw_columns(df)
    df = reorder_raw_columns(df)
    return df
df = formatted_columns(raw)
df.head()
season date team matchup wl pts min fgm fga fg3m fg3a ftm fta oreb dreb ast tov stl blk pf game_id team_id team_name
0 2016-17 2017-04-12 HOU HOU vs. MIN W 123 48 45 97 19 56 14 22 16 31 37 10 11 7 14 21601224 1610612745 Houston Rockets
1 2016-17 2017-04-12 ATL ATL @ IND L 86 48 30 72 11 29 15 19 8 29 19 20 5 4 15 21601226 1610612737 Atlanta Hawks
2 2016-17 2017-04-12 MIL MIL @ BOS L 94 48 37 84 6 23 14 17 9 34 26 17 6 9 20 21601219 1610612749 Milwaukee Bucks
3 2016-17 2017-04-12 OKC OKC vs. DEN L 105 48 43 101 8 25 11 17 14 37 20 12 10 8 24 21601225 1610612760 Oklahoma City Thunder
4 2016-17 2017-04-12 LAC LAC vs. SAC W 115 48 43 83 10 28 19 25 11 32 29 6 4 6 15 21601228 1610612746 LA Clippers

Figuring out the Matchups

Now we’re getting somewhere. What we need to do now is figure out how to represent the matchup information in a more useful way. It’s pretty clear that the ‘@’ sign means an away game, while ‘vs.’ is a home game. We can use pandas to split apart the matchup information into a more useful representation.

Let’s try out a few things.

df['matchup'].str.contains('@').head()
0    False
1     True
2     True
3    False
4    False
Name: matchup, dtype: bool
df['matchup'].str.split(' ').head()
0    [HOU, vs., MIN]
1      [ATL, @, IND]
2      [MIL, @, BOS]
3    [OKC, vs., DEN]
4    [LAC, vs., SAC]
Name: matchup, dtype: object
df['matchup'].str.split(' ').str.get(-1).head()
0    MIN
1    IND
2    BOS
3    DEN
4    SAC
Name: matchup, dtype: object

We can put this together into a new function which will remove the ‘matchup’ column and replace it with two new, more useful columns. The first is whether the game is home (‘H’) or away (‘A’). The second is the opponent abbreviation.

def parse_matchup(df):
    """Add more useful columns based upon matchup information."""
    df['ha'] = np.where(df['matchup'].str.contains('@'), 'A', 'H')
    df['opp'] = df['matchup'].str.split(' ').str.get(-1)
    # Put new columns where matchup used to be, and drop matchup
    cols = []
    for col in df.columns:
        if col not in ['matchup', 'ha', 'opp']:
            cols.append(col)
        elif col == 'matchup':
            cols.append('ha')
            cols.append('opp')
    return df[cols]
parse_matchup(df).head()
season date team ha opp wl pts min fgm fga fg3m fg3a ftm fta oreb dreb ast tov stl blk pf game_id team_id team_name
0 2016-17 2017-04-12 HOU H MIN W 123 48 45 97 19 56 14 22 16 31 37 10 11 7 14 21601224 1610612745 Houston Rockets
1 2016-17 2017-04-12 ATL A IND L 86 48 30 72 11 29 15 19 8 29 19 20 5 4 15 21601226 1610612737 Atlanta Hawks
2 2016-17 2017-04-12 MIL A BOS L 94 48 37 84 6 23 14 17 9 34 26 17 6 9 20 21601219 1610612749 Milwaukee Bucks
3 2016-17 2017-04-12 OKC H DEN L 105 48 43 101 8 25 11 17 14 37 20 12 10 8 24 21601225 1610612760 Oklahoma City Thunder
4 2016-17 2017-04-12 LAC H SAC W 115 48 43 83 10 28 19 25 11 32 29 6 4 6 15 21601228 1610612746 LA Clippers

More Data Cleanup

Now things are looking good. Let’s put all this formatting into a new function, which will return the formatted DataFrame. We will also use the ‘game_id’ as the index for the DataFrame. Notice that the ‘game_id’ is an integer. This is an internal identifier that the web site database is using to uniquely identify each NBA game. We should keep track of this identifier, since we might want to go back to get more data on this game in the future. For instance, we might want to get shot charts or possession-by-possession information. We would need that ‘game_id’ in order to scrape the more detailed game data.

def formatted_teamgamelogs(df):
    """Formatted stats.nba.com teamgamelogs DataFrame from raw DataFrame."""
    df = formatted_columns(df)
    df = parse_matchup(df)
    return df.set_index('game_id')
games = formatted_teamgamelogs(raw)
games.head()
season date team ha opp wl pts min fgm fga fg3m fg3a ftm fta oreb dreb ast tov stl blk pf team_id team_name
game_id
21601224 2016-17 2017-04-12 HOU H MIN W 123 48 45 97 19 56 14 22 16 31 37 10 11 7 14 1610612745 Houston Rockets
21601226 2016-17 2017-04-12 ATL A IND L 86 48 30 72 11 29 15 19 8 29 19 20 5 4 15 1610612737 Atlanta Hawks
21601219 2016-17 2017-04-12 MIL A BOS L 94 48 37 84 6 23 14 17 9 34 26 17 6 9 20 1610612749 Milwaukee Bucks
21601225 2016-17 2017-04-12 OKC H DEN L 105 48 43 101 8 25 11 17 14 37 20 12 10 8 24 1610612760 Oklahoma City Thunder
21601228 2016-17 2017-04-12 LAC H SAC W 115 48 43 83 10 28 19 25 11 32 29 6 4 6 15 1610612746 LA Clippers

Now let’s see what happens if we select all rows in the DataFrame with that first ‘game_id’.

games.loc[21601224]
season date team ha opp wl pts min fgm fga fg3m fg3a ftm fta oreb dreb ast tov stl blk pf team_id team_name
game_id
21601224 2016-17 2017-04-12 HOU H MIN W 123 48 45 97 19 56 14 22 16 31 37 10 11 7 14 1610612745 Houston Rockets
21601224 2016-17 2017-04-12 MIN A HOU L 118 48 48 91 11 23 11 15 10 39 37 15 7 3 15 1610612750 Minnesota Timberwolves

Interesting. Each game has two rows in the DataFrame, one for each team. One of the teams is home, and the other is away. One of the teams won, and the other lost. Each team has their box score recorded on its own row.

This explains why we could drop the columns ‘BLKA’ and ‘PFD’ earlier. For example, the ‘BLKA’ from the Rockets’ row is the same as the ‘BLK’ column from the Timberwolves’ row. The ‘BLKA’ and ‘PFD’ information is redundant.

Let’s look at how manu unique ‘game_id’ values there are. You can think of these as unique matchups.

len(games.index.unique())
1230

There are half as many matchups as there are team games played. That’s because there are two teams per matchup.

home_games = games[games['ha'] == 'H']
home_games.shape
(1230, 23)
away_games = games[games['ha'] == 'A']
away_games.shape
(1230, 23)

Creating a New Table of Matchups

What we really want for analysis is a table of matchups. We want to predict matchups, so that is the right unit of data. We can build a new table of matchups, where each matchup is identified by the ‘game_id’. We can also keep track of which team was the home eam, and which was the away team. We can also put both teams’ box scores on the same row, as long as we keep track of which stats belong to the home team and which to the away team.

home_games.join(away_games, lsuffix='_h', rsuffix='_a').head()
season_h date_h team_h ha_h opp_h wl_h pts_h min_h fgm_h fga_h fg3m_h fg3a_h ftm_h fta_h oreb_h dreb_h ast_h tov_h stl_h blk_h pf_h team_id_h team_name_h season_a date_a team_a ha_a opp_a wl_a pts_a min_a fgm_a fga_a fg3m_a fg3a_a ftm_a fta_a oreb_a dreb_a ast_a tov_a stl_a blk_a pf_a team_id_a team_name_a
game_id
21601224 2016-17 2017-04-12 HOU H MIN W 123 48 45 97 19 56 14 22 16 31 37 10 11 7 14 1610612745 Houston Rockets 2016-17 2017-04-12 MIN A HOU L 118 48 48 91 11 23 11 15 10 39 37 15 7 3 15 1610612750 Minnesota Timberwolves
21601225 2016-17 2017-04-12 OKC H DEN L 105 48 43 101 8 25 11 17 14 37 20 12 10 8 24 1610612760 Oklahoma City Thunder 2016-17 2017-04-12 DEN A OKC W 111 48 39 85 7 25 26 32 5 38 22 17 6 8 17 1610612743 Denver Nuggets
21601228 2016-17 2017-04-12 LAC H SAC W 115 48 43 83 10 28 19 25 11 32 29 6 4 6 15 1610612746 LA Clippers 2016-17 2017-04-12 SAC A LAC L 95 48 40 81 5 11 10 14 7 30 27 11 2 1 18 1610612758 Sacramento Kings
21601218 2016-17 2017-04-12 CLE H TOR L 83 48 28 76 10 37 17 24 10 32 20 15 5 7 14 1610612739 Cleveland Cavaliers 2016-17 2017-04-12 TOR A CLE W 98 48 40 86 8 26 10 16 12 36 20 9 8 2 20 1610612761 Toronto Raptors
21601230 2016-17 2017-04-12 POR H NOP L 100 48 41 85 13 30 5 13 14 39 22 25 4 2 18 1610612757 Portland Trail Blazers 2016-17 2017-04-12 NOP A POR W 103 48 40 90 8 24 15 21 11 31 20 11 13 3 17 1610612740 New Orleans Pelicans

This new DataFrame is the right next step, but we need to clean it up a bit. First, we don’t need two win/loss columns. We only need one, telling us whether the home team won or the away team won.

def home_away_win(row):
    home = row['wl_h']
    away = row['wl_a']
    assert home in ['W', 'L']
    assert away in ['W', 'L']
    assert home != away
    if home == 'W':
        return 'H'
    else:
        return 'A'

We can drop a bunch of redundant columns and rename the ones that apply to the matchup itself, rather than particular teams.

def drop_matchup_columns(df):
    return df.drop([
        'wl_h',
        'wl_a',
        'season_a',
        'date_a',
        'ha_h',
        'opp_h',
        'ha_a',
        'opp_a',
    ], axis='columns')
def rename_matchup_columns(df):
    return df.rename(columns={
        'season_h': 'season',
        'date_h': 'date',
    })
def reorder_matchup_columns(df):
    """Column order for game statistics DataFrame, alternating home and away stats."""
    first_cols = ['season', 'date', 'won',]
    raw_cols = [col.replace('_h', '') for col in df.columns if col.endswith('_h')]
    cols = list(chain.from_iterable((col+'_h', col+'_a') for col in raw_cols))
    return df[first_cols + cols]

Putting everything together in one place, here is a function which will create the matchups and reformat the resulting DataFrame. It will also add a column storing which team (home or away) won the matchup.

def matchups_from_teamgamelogs(df):
    """DataFrame with one unique game_id per row and team stats identified by home and away teams."""
    home_games = df[df['ha'] == 'H']
    away_games = df[df['ha'] == 'A']
    matchups = home_games.join(away_games, lsuffix='_h', rsuffix='_a')
    # Add new 'won' column: 'H' if home time wins or 'A' if away team wins
    matchups['won'] = matchups.apply(home_away_win, axis='columns')
    matchups = drop_matchup_columns(matchups)
    matchups = rename_matchup_columns(matchups)
    matchups = reorder_matchup_columns(matchups)
    return matchups
matchups_from_teamgamelogs(games).head()
season date won team_h team_a pts_h pts_a min_h min_a fgm_h fgm_a fga_h fga_a fg3m_h fg3m_a fg3a_h fg3a_a ftm_h ftm_a fta_h fta_a oreb_h oreb_a dreb_h dreb_a ast_h ast_a tov_h tov_a stl_h stl_a blk_h blk_a pf_h pf_a team_id_h team_id_a team_name_h team_name_a
game_id
21601224 2016-17 2017-04-12 H HOU MIN 123 118 48 48 45 48 97 91 19 11 56 23 14 11 22 15 16 10 31 39 37 37 10 15 11 7 7 3 14 15 1610612745 1610612750 Houston Rockets Minnesota Timberwolves
21601225 2016-17 2017-04-12 A OKC DEN 105 111 48 48 43 39 101 85 8 7 25 25 11 26 17 32 14 5 37 38 20 22 12 17 10 6 8 8 24 17 1610612760 1610612743 Oklahoma City Thunder Denver Nuggets
21601228 2016-17 2017-04-12 H LAC SAC 115 95 48 48 43 40 83 81 10 5 28 11 19 10 25 14 11 7 32 30 29 27 6 11 4 2 6 1 15 18 1610612746 1610612758 LA Clippers Sacramento Kings
21601218 2016-17 2017-04-12 A CLE TOR 83 98 48 48 28 40 76 86 10 8 37 26 17 10 24 16 10 12 32 36 20 20 15 9 5 8 7 2 14 20 1610612739 1610612761 Cleveland Cavaliers Toronto Raptors
21601230 2016-17 2017-04-12 A POR NOP 100 103 48 48 41 40 85 90 13 8 30 24 5 15 13 21 14 11 39 31 22 20 25 11 4 13 2 3 18 17 1610612757 1610612740 Portland Trail Blazers New Orleans Pelicans

One last step is to set some categorical data types and reorder the columns. Categories make a pandas DataFrame faster and more memory-efficient for columns that only take on a few possible values. For instance, the ‘won’ column can only be ‘W’ or ‘L’. The function below sets certain columns to be categorical data. It also makes sure the box score stats are integers.

def format_matchups(df):
    df['season'] = df['season'].astype('category')
    df['season_type'] = df['season_type'].astype('category')
    df['won'] = df['won'].astype('category')
    df['team_h'] = df['team_h'].astype('category')
    df['team_a'] = df['team_a'].astype('category')
    # Set all 'object' columns to int (except for team names)
    for col in df.columns:
        if 'team' not in col and (col.endswith('_h') or col.endswith('_a')):
            if df[col].dtype == 'object':
                df[col] = df[col].astype(int)
    first_cols = ['season', 'season_type', 'date', 'won']
    cols = first_cols + [col for col in df.columns if col not in first_cols]
    return df[cols]

Processing an Entire NBA Season

Now we are almost done. We want a function which can scrape (or read from a file) the raw data for a given season, for all possible season types (regular, playoffs or pre-season). This function should then process the raw data and create the matchups for that season. The rows for each season type should be stored so we can tell the regular season games from the other types later.

def nba_season_matchups(season=None, data_dir=None, overwrite=False):
    """All NBA matchups for a given season (regular, playoffs and pre-season).""" 
    matchups = None
    for season_type in NBA_SEASON_TYPES:
        df = raw_teamgamelogs(
            season,
            NBA_SEASON_TYPES[season_type],
            data_dir,
            overwrite,
            sleep_for=2,
        )
        if len(df) == 0:
            # no rows came back; this is probably a pre-season with no data, so just continue
            continue
        df = formatted_teamgamelogs(df)
        df = matchups_from_teamgamelogs(df)
        df['season_type'] = season_type
        if matchups is None:
            matchups = df.copy()
        else:
            matchups = matchups.append(df)
    return format_matchups(matchups)

Now we can run our master function on the entire 2016-17 season.

matchups = nba_season_matchups(season, data_dir=DATA_DIR)
matchups['season_type'].value_counts()
regular      1230
preseason     102
playoffs       79
Name: season_type, dtype: int64

As we already new, there are 1230 regular season matchups. We also see we got the pre-season and playoff matchups as well.

matchups.head()
season season_type date won team_h team_a pts_h pts_a min_h min_a fgm_h fgm_a fga_h fga_a fg3m_h fg3m_a fg3a_h fg3a_a ftm_h ftm_a fta_h fta_a oreb_h oreb_a dreb_h dreb_a ast_h ast_a tov_h tov_a stl_h stl_a blk_h blk_a pf_h pf_a team_id_h team_id_a team_name_h team_name_a
game_id
21601224 2016-17 regular 2017-04-12 H HOU MIN 123 118 48 48 45 48 97 91 19 11 56 23 14 11 22 15 16 10 31 39 37 37 10 15 11 7 7 3 14 15 1610612745 1610612750 Houston Rockets Minnesota Timberwolves
21601225 2016-17 regular 2017-04-12 A OKC DEN 105 111 48 48 43 39 101 85 8 7 25 25 11 26 17 32 14 5 37 38 20 22 12 17 10 6 8 8 24 17 1610612760 1610612743 Oklahoma City Thunder Denver Nuggets
21601228 2016-17 regular 2017-04-12 H LAC SAC 115 95 48 48 43 40 83 81 10 5 28 11 19 10 25 14 11 7 32 30 29 27 6 11 4 2 6 1 15 18 1610612746 1610612758 LA Clippers Sacramento Kings
21601218 2016-17 regular 2017-04-12 A CLE TOR 83 98 48 48 28 40 76 86 10 8 37 26 17 10 24 16 10 12 32 36 20 20 15 9 5 8 7 2 14 20 1610612739 1610612761 Cleveland Cavaliers Toronto Raptors
21601230 2016-17 regular 2017-04-12 A POR NOP 100 103 48 48 41 40 85 90 13 8 30 24 5 15 13 21 14 11 39 31 22 20 25 11 4 13 2 3 18 17 1610612757 1610612740 Portland Trail Blazers New Orleans Pelicans

Getting All Completed NBA Seasons

The next step is to get all the completed seasons since 1996-97 (the earliest season for which the web site has advanced stats). To do this, all we need to do is call our nba_season_matchups() function above for each completed season. Let’s write a few small funcitions to help us figure out what is the latest completed NBA season.

def nba_season_from_year(y):
    """Convert integer year into stats.nba.com season string."""
    return str(y) + '-' + str(int(y)+1)[2:]
def nba_year_from_season(s):
    """Convert stats.nba.com season string into integer year"""
    return int(s[:4])
def current_nba_season_year():
    """Year in which current NBA season began."""
    now = datetime.now()
    if now.month > 8:
        return now.year
    else:
        return now.year - 1
def current_nba_season():
    """Current NBA season."""
    return nba_season_from_year(current_nba_season_year())
def completed_nba_season_year():
    """Year in which most recent completed NBA season began."""
    return current_nba_season_year() - 1
def completed_nba_season():
    """Most recent completed NBA season."""
    return nba_season_from_year(completed_nba_season_year())
completed_nba_season()
'2016-17'

Now it’s easy to just call our season-by-season scraper and collect all the results in one large DataFrame. Notice that we are using the tqdm package here. This package creates a “progress bar” to show you what percentage of an iteration has been accomplished, along with some timing information. This is helpful when scraping, since scraping can be slow.

def completed_nba_season_matchups(data_dir=None, overwrite=False):
    """Get NBA matchups for completed seasons."""
    end_year = completed_nba_season_year()
    seasons = [nba_season_from_year(y) for y in range(1996, end_year+1)]
    matchups = None
    for season in tqdm(seasons):
        df = nba_season_matchups(season, data_dir, overwrite)
        if matchups is None:
            matchups = df.copy()
        else:
            matchups = matchups.append(df)
    return format_matchups(matchups)
matchups = completed_nba_season_matchups(data_dir=DATA_DIR)
100%|██████████| 21/21 [00:22<00:00,  1.07s/it]
matchups.shape
(26812, 40)

Sanity-Checking the Data

We have almost 27,000 matchups in our DataFrame. Let’s break things out by season and type of game to make sure they make sense.

matchups.pivot_table(
    index='season',
    columns='season_type',
    values='team_h',
    aggfunc='count',
    fill_value=0,
    margins=True,
)
season_type playoffs preseason regular All
season
1996-97 72 0 1189 1261
1997-98 71 0 1189 1260
1998-99 66 0 725 791
1999-00 75 0 1189 1264
2000-01 71 0 1189 1260
2001-02 71 0 1189 1260
2002-03 88 0 1189 1277
2003-04 82 0 1189 1271
2004-05 84 0 1230 1314
2005-06 89 0 1230 1319
2006-07 79 0 1230 1309
2007-08 86 0 1230 1316
2008-09 85 0 1230 1315
2009-10 82 0 1230 1312
2010-11 81 0 1230 1311
2011-12 84 0 990 1074
2012-13 85 0 1229 1314
2013-14 89 0 1230 1319
2014-15 81 118 1230 1429
2015-16 86 109 1230 1425
2016-17 79 102 1230 1411
All 1686 329 24797 26812

We only have pre-season games since the 2014-15 season.

Looking at the season totals, you can see the effect of the lockouts in 1998-99 and 2011-12. You can also see the impact of the league expansion from 29 to 30 teams in 2004 (with the addition of Charlotte). See the history of NBA seasons here.

You might think we are done, but there is one complication that we need to work through. This shows the importance of looking carefully at your data. It’s hard to go through almost 27,000 rows by eye, so you need to think carefully and make sure things make sense.

Checking the Teams

One thing we can and should check is: what teams are represented in our data?

def unique_home_away_teams(df, ha):
    """Unique home or away teams (as specified) from matchups DataFrame."""
    ha = ha.lower()
    if ha not in ['h', 'a']:
        raise ValueError('invalid home or away symbol')
    team_abbr = 'team_' + ha
    team_id = 'team_id_' + ha
    team_name = 'team_name_' + ha
    return (
        df[[team_abbr, team_id, team_name]]
            .reset_index()
            .drop(columns=['game_id'])
            .drop_duplicates()
            .sort_values(team_id)
            .rename(
                columns={
                    team_abbr: 'team',
                    team_id: 'team_id',
                    team_name: 'team_name',
                }
            )
            .set_index('team_id')
    )
def unique_teams(df):
    """Unique teams in matchups DataFrame."""
    teams = unique_home_away_teams(df, 'h')
    teams.append(unique_home_away_teams(df, 'a'))
    return teams.drop_duplicates().sort_index()
unique_teams(matchups)
team team_name
team_id
94 MLN Milano Olimpia Milano
12304 FCB FC Barcelona Lassa
12315 RMD Real Madrid
12315 RMD Madrid Real Madrid
12321 FBU Istanbul Fenerbahce Ulker
12323 ALB Berlin Alba Berlin
12325 FLA Flamengo Flamengo
1610612737 ATL Atlanta Hawks
1610612738 BOS Boston Celtics
1610612739 CLE Cleveland Cavaliers
1610612740 NOK New Orleans Hornets
1610612740 NOP New Orleans Pelicans
1610612740 NOH New Orleans Hornets
1610612741 CHI Chicago Bulls
1610612742 DAL Dallas Mavericks
1610612743 DEN Denver Nuggets
1610612744 GSW Golden State Warriors
1610612745 HOU Houston Rockets
1610612746 LAC Los Angeles Clippers
1610612746 LAC LA Clippers
1610612747 LAL Los Angeles Lakers
1610612748 MIA Miami Heat
1610612749 MIL Milwaukee Bucks
1610612750 MIN Minnesota Timberwolves
1610612751 NJN New Jersey Nets
1610612751 BKN Brooklyn Nets
1610612752 NYK New York Knicks
1610612753 ORL Orlando Magic
1610612754 IND Indiana Pacers
1610612755 PHI Philadelphia 76ers
1610612756 PHX Phoenix Suns
1610612757 POR Portland Trail Blazers
1610612758 SAC Sacramento Kings
1610612759 SAS San Antonio Spurs
1610612760 OKC Oklahoma City Thunder
1610612760 SEA Seattle SuperSonics
1610612761 TOR Toronto Raptors
1610612762 UTA Utah Jazz
1610612763 MEM Memphis Grizzlies
1610612763 VAN Vancouver Grizzlies
1610612764 WAS Washington Bullets
1610612764 WAS Washington Wizards
1610612765 DET Detroit Pistons
1610612766 CHA Charlotte Bobcats
1610612766 CHA Charlotte Hornets
1610612766 CHH Charlotte Hornets

Whoa, right away we see that we have some non-NBA teams represented, from pre-season exhibition-type games. We will exclude those in a moment, since we only want to have NBA matchups in our analysis.

That’s the easy part. The harder part is that team names and abbreviations have changed over time. Let’s see what’s going on in more detail.

First, there are some simple cases that aren’t too difficult to understand. The Clippers changed their name in 2015 from the Los Angeles Clippers to the LA Clippers. The abbreviation and ‘team_id’ didn’t change, however. No big deal. There’s a similar situation with Washington.

For a team move like the Thunder, you have to keep track of the abbreviation change. Although the ‘team_id’ stays the same, the abbrevation changes from ‘SEA’ to ‘OKC’ reflecting the team move from Seattle to Oklahoma City in 2008. There’s a similar situation with the Grizzlies and the Nets.

The most complicated situation is with Charlotte and New Orleans.

Hornets and Pelicans

As descibed in detail here, the Charlotte Hornets moved to New Orleans in 2002. The Charlotte Bobcats joined the NBA as an expansion team in 2004. Prior to the 2014-15 season, the Charlotte Bobcats and the New Orleans Hornets agreed to a deal whereby New Orleans would become the Pelicans and the Hornets name and history would revert to Charlotte. This explains why the abbreviation ‘CHA’ is used for both the Charlotte Bobcats and the Charlotte Hornets, in addition to the ‘CHH’ abbreviation for the other Charlotte Hornets (the team that moved in 2002 and ultimately became the Pelicans).

Another complication is that because of Hurricane Katrina, the New Orleans Hornets had to relocate for much of the 2005-6 and 2006-7 seasons to Oklahoma City. That’s why there is an abbreviation ‘NOK’ to represent the games that were played under the label New Orleans/Oklahoma City Hornets.

Let’s sort this out and make sure we understand this in detail.

First, let build a table with the current 30 NBA teams, so we can keep track of historical abbreviations and how they map to current teams. Remember that the ‘team_id’ is meant to be a unique identifier for each team, even if the abbreviation or team name changes.

def current_teams(df):
    """Team abbreviations for the most recent completed NBA season."""
    current_teams = (
        df.loc[
            (df['season'] == completed_nba_season()) & (df['season_type'] == 'regular'),
            ['team_h', 'team_id_h']
        ]
        .drop_duplicates()
        .sort_values(['team_id_h'])
        .rename(
            columns={
                'team_h': 'curr_team',
                'team_id_h': 'team_id',
            }
        )
        .set_index('team_id')
    )
    return current_teams
len(current_teams(matchups))
30
teams = unique_teams(matchups).merge(current_teams(matchups), left_index=True, right_index=True)

Let’s build a table of any temas that have had a name change or an abbreviation change.

team_changes = teams[['team', 'curr_team']].groupby('curr_team').filter(lambda team: len(team) > 1)
team_changes
team curr_team
team_id
1610612740 NOK NOP
1610612740 NOP NOP
1610612740 NOH NOP
1610612746 LAC LAC
1610612746 LAC LAC
1610612751 NJN BKN
1610612751 BKN BKN
1610612760 OKC OKC
1610612760 SEA OKC
1610612763 MEM MEM
1610612763 VAN MEM
1610612764 WAS WAS
1610612764 WAS WAS
1610612766 CHA CHA
1610612766 CHA CHA
1610612766 CHH CHA

Now, let’s look at how many games, per season, were played by various Charlotte and New Orleans franchises.

regular_season_summary = (
    matchups.loc[
        matchups['season_type'] == 'regular',
        ['season', 'team_h', 'team_a']
    ].pivot_table(
        index='season',
        columns='team_h',
        values='team_a',
        aggfunc='count',
        fill_value=0,
        margins=True,
    )
)
regular_season_summary.loc[:, ['CHA', 'CHH', 'NOH', 'NOK', 'NOP']]
team_h CHA CHH NOH NOK NOP
season
1996-97 0 41 0 0 0
1997-98 0 41 0 0 0
1998-99 0 25 0 0 0
1999-00 0 41 0 0 0
2000-01 0 41 0 0 0
2001-02 0 41 0 0 0
2002-03 0 0 41 0 0
2003-04 0 0 41 0 0
2004-05 41 0 41 0 0
2005-06 41 0 0 41 0
2006-07 41 0 0 41 0
2007-08 41 0 41 0 0
2008-09 41 0 41 0 0
2009-10 41 0 41 0 0
2010-11 41 0 41 0 0
2011-12 33 0 33 0 0
2012-13 41 0 41 0 0
2013-14 41 0 0 0 41
2014-15 41 0 0 0 41
2015-16 41 0 0 0 41
2016-17 41 0 0 0 41
All 525 230 361 82 164

This table makes it clear that ‘CHH’ is the original Charlotte Hornets, who became the New Orleans Hornets (and the New Orleans/Oklahoma City Hornets), before becoming the New Orleans Pelicans (‘NOP’). The abbreviation ‘CHA’ refers to the Bobcats, who then became the new Charlotte Hornets in the 2014-15 season.

“Fixing” the Team Data

What might surprise you about this, is that the NBA team data are wrong!

If you look at the team table above, you’ll see that ‘CHH’ has the same ‘team_id’ as ‘CHA’. Now, you may think this is correct, since the Bobcats and the New Orleans Hornets did a deal returning the history of the original Charlotte Hornets (‘CHH’) to Charlotte. That’s why ‘CHH’ and ‘CHA’ have the same ‘team_id’.

On the other hand, if you goal is to track the performance of group of players, as their franchise changed cities, this is incorrect. The guys who played on the ‘CHH’ squad in the 2001-2 season should be connected with the ‘NOH’ squad in the 2002-3 season.

The NBA team data as recorded don’t reflect that.

To keep track of this, we will do something that you should never do lightly. We are going to override our team data table to make sure the ‘CHH’ games are grouped with the current Pelicans franchise. We are effectively going to undo the Charlotte/New Orleans deal to transfer the Hornets history.

teams.loc[teams['team'] =='CHH', 'curr_team'] = 'NOP'
teams.loc[teams['team'].isin(['CHA', 'CHH', 'NOH', 'NOK', 'NOP']), :]
team team_name curr_team
team_id
1610612740 NOK New Orleans Hornets NOP
1610612740 NOP New Orleans Pelicans NOP
1610612740 NOH New Orleans Hornets NOP
1610612766 CHA Charlotte Bobcats CHA
1610612766 CHA Charlotte Hornets CHA
1610612766 CHH Charlotte Hornets NOP

Of course, we’re not overriding the raw data. But it’s important to keep track anytime you change data.

Let’s write a function to record what we’re doing.

def team_info(df):
    """DataFrame of NBA team IDs, unadjusted abbreviations, names and adjusted abbrevations."""
    teams = unique_teams(df).merge(current_teams(df), left_index=True, right_index=True)
    teams.loc[teams['team'] =='CHH', 'curr_team'] = 'NOP' # OVERRIDING Hornets/Pelicans
    return teams

Lastly, we will write a function to exclude non-NBA matchups. Again, these were pre-season exhibition-style games. To exclude those games all we need to do is filter out any teams whose ‘team_id’ isn’t one of the current 30 NBA teams.

def exclude_non_nba_matchups(df):
    """Filter out matchups involving non-NBA teams."""
    nba_teams = current_teams(df)
    non_nba_matchups = list(
        df.loc[~df['team_id_h'].isin(nba_teams.index)].index
    )
    non_nba_matchups.extend(list(
        df.loc[~df['team_id_a'].isin(nba_teams.index)].index)
    )
    return df.loc[~df.index.isin(non_nba_matchups), :]

Let’s test this to see how many games get filtered out.

matchups.shape
(26812, 40)
exclude_non_nba_matchups(matchups).shape
(26787, 40)

The function filtered out 25 games.

Let’s put all these team cleanup steps together. First, we will write a function to give us a mapping between the historical team abbreviation, and the “correct” current team abbreviation. In this case, “correct” means adjusted for the Hornets/Pelicans issue.

def team_mapping(df):
    """Dict mapping historical NBA team abbreviation to abbreviation of current franchise."""
    teams = team_info(df) # this creates the lookup table with Hornets/Pelicans issue corrected
    list_of = teams.to_dict(orient='list')
    return dict(zip(list_of['team'], list_of['curr_team']))
def update_matchups(df):
    """Prepare matchups for use in analytics."""
    nba_teams = current_teams(df)
    df = exclude_non_nba_matchups(df) # filters any matchups that include non-NBA teams
    abbr_to_current = team_mapping(df)
    home_map = df['team_h'].map(abbr_to_current).rename('team_curr_h')
    away_map = df['team_a'].map(abbr_to_current).rename('team_curr_a')
    # Add new columns for current team abbreviations 
    df = pd.concat([df.copy(), home_map, away_map], axis='columns').reset_index()
    df['team_curr_h'] = df['team_curr_h'].astype('category')
    df['team_curr_a'] = df['team_curr_a'].astype('category')
    return df.set_index('game_id')

Notice that the above function doesn’t actually replace the team abbreviations in our matchups table. Rather, it creates two new columns for the current teams (home and away). That way, we can see what the original information was and make sure it maps correctly in our analysis. This is safer than just overwriting the old team abbreviations.

We will use the current team abbreviations in building our matchup prediction models. This will allow us to track changes in team quality over time without getting confused by name changes or franchise moves.

One last step is to write a function that will save our final, processed DataFrame for future analysis. This DataFrame is relatively large, and we’ve put some effort into formatting it. So, instead of using CSV, we’ll use another built-in format that pandas can write for us: Python’s pickle format. The pickle format is an efficient, compressed format that will store all the information in our DataFrame, including formatting.

As noted in the pickle documentation, you should be very careful about unpickling an object that you obtained from the Internet or via email. Only unpickle objects that you have previously pickled yourself or you obtained from somebody you really, really trust. By the way, it turns out that hackers have even figured out ways to inject malware into CSV files, so the general and best advice is: always be very careful opening any type of file that you’ve downloaded or received by email!

def save_matchups(df, output_dir):
    """Save pickle file of NBA matchups prepared for analytics."""
    seasons = list(df['season'].unique())
    start_season = min(seasons).replace('-', '_')
    end_season = max(seasons).replace('-', '_')
    PKL_TEMPLATE = 'stats_nba_com-matchups-{start_season}-{end_season}'
    pklfilename = (
        PKL_TEMPLATE.format(start_season=start_season, end_season=end_season) +
        '.pkl'
    )
    pklfile = output_dir.joinpath(pklfilename)
    if pklfile.exists():
        timestamp = str(datetime.now().strftime("%Y-%m-%d-%H-%M"))
        backupfilename = (
            PKL_TEMPLATE.format(start_season=start_season, end_season=end_season) +
            '.bak.{timestamp}'.format(timestamp=timestamp) +
            '.pkl'
        )
        backupfile = output_dir.joinpath(backupfilename)
        pklfile.rename(backupfile)
    df.to_pickle(pklfile)

Finishing Up

Now, we can finally put everything together into one master function which reads all the matchups for since 1996 and saves it to a pickle file, ready for later analysis.

def nba_stats_matchups(data_dir=None, output_dir=None, overwrite=False):
    """Get and prepare stats.nba.com matchups for all completed seasons since 1996."""
    df = completed_nba_season_matchups(data_dir=data_dir, overwrite=overwrite)
    df = update_matchups(df)
    if output_dir:
        save_matchups(df, output_dir)
    return df

<

div class=”cell border-box-sizing code_cell rendered”>

matchups = nba_stats_matchups(data_dir=DATA_DIR, output_dir=OUTPUT_DIR, overwrite=False)

To conclude the data gathering, here’s a summary of all the information we’ve obtained.

matchups.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 26787 entries, 29601187 to 11600001
Data columns (total 42 columns):
season         26787 non-null category
season_type    26787 non-null category
date           26787 non-null datetime64[ns]
won            26787 non-null category
team_h         26787 non-null category
team_a         26787 non-null category
pts_h          26787 non-null int64
pts_a          26787 non-null int64
min_h          26787 non-null int64
min_a          26787 non-null int64
fgm_h          26787 non-null int64
fgm_a          26787 non-null int64
fga_h          26787 non-null int64
fga_a          26787 non-null int64
fg3m_h         26787 non-null int64
fg3m_a         26787 non-null int64
fg3a_h         26787 non-null int64
fg3a_a         26787 non-null int64
ftm_h          26787 non-null int64
ftm_a          26787 non-null int64
fta_h          26787 non-null int64
fta_a          26787 non-null int64
oreb_h         26787 non-null int64
oreb_a         26787 non-null int64
dreb_h         26787 non-null int64
dreb_a         26787 non-null int64
ast_h          26787 non-null int64
ast_a          26787 non-null int64
tov_h          26787 non-null int64
tov_a          26787 non-null int64
stl_h          26787 non-null int64
stl_a          26787 non-null int64
blk_h          26787 non-null int64
blk_a          26787 non-null int64
pf_h           26787 non-null int64
pf_a           26787 non-null int64
team_id_h      26787 non-null int64
team_id_a      26787 non-null int64
team_name_h    26787 non-null object
team_name_a    26787 non-null object
team_curr_h    26787 non-null category
team_curr_a    26787 non-null category
dtypes: category(7), datetime64[ns](1), int64(32), object(2)
memory usage: 7.5+ MB

We’ve accomplished a lot. Now that we have data, we’re ready to move on to the matchup predictions in future posts. To conclude, let’s look at one important fact we can immediately learn from the data.

A Quick Look at Home Court Advantage in the NBA

Let’s calculate what home court has been worth in terms of regular season win percentage between the 1996-97 and 2016-17 seasons.

matchups.loc[matchups['season_type'] == 'regular', 'won'].value_counts()
H    14830
A     9967
Name: won, dtype: int64
def count_home_away_wins(df, ha, season_type='regular'):
    return df.loc[df['season_type'] == season_type, 'won'].value_counts()[ha]
home_wins = count_home_away_wins(matchups, 'H')
away_wins = count_home_away_wins(matchups, 'A')
home_wins/(home_wins+away_wins)
0.59805621647779972
away_wins/(home_wins+away_wins)
0.40194378352220028

The home team has won about 60% of the time on average during the NBA regular season. Any prediction model is going to have to address this fact. We’ll look a lot more closely at this topic in future posts.