Scraping NBA team information from Wikipedia (Revisited)

map of NBA arenas

In this post, we are going to take another look at scraping NBA team information from Wikipedia. We will also see how generate a map of NBA arena locations.

In an earlier post, we scraped the table using the Requests and BeautifulSoup packages.

Unfortunately, somebody modified the table in mid-December. As a result, the original code in that notebook no longer works.

Web content changes all the time, which will occasionally break web scraping code. This is particularly true of Wikipedia, where pages are open to edits by the community.

In this particular case, we could just move on and ignore the table changes. The NBA team data are basically unchanged. We could just use the saved CSV file from the prior scraping. That’s a major reason why you should always save the result of web scraping.

On the other hand, I think this is a good opportunity to try to scrape the table in a more robust and general way. You will also see examples of some useful pandas techniques to clean up the Wikipedia data. I think you will find these techniques useful in your own sports analytics projects.

I also wanted to do something useful with the Wikipedia information, beyond using it as an example to learn web scraping. Later in this post, we’ll discuss why arena location data can be useful in sports analytics. Drawing a map is a perfect way to learn how to start using geographic data in Python.

What Changed

The change the person made to the table was relatively simple. This person decided to group together certain cells in the table for the two New York teams (the Knicks and the Nets) and the LA teams (the Clippers and the Lakers). In particular, this person added HTML rowspan tags in the City columns, as well as in the Arena column for the LA teams.

Think of rowspan and colspan tags in an HTML table as being similar to merged cells in a spreadsheet program like Microsoft Excel or Google Sheets.

You can look at the Wikipedia page prior to the table change here, and compare to the current table. Try to use your browser’s inspection tools to find the rowspan tags that changed.

We are going to figure out how to read these merged cells, and “unspan” them to make the table layout simpler.

Scraping the Table, Again

We are going to use a general approach for scraping HTML tables. This approach will work for Wikipedia and other web pages, and will automatically handle the spanning that broke our original code.

The scraping code is part of the pracpred package, which you can find on GitHub or find on PyPI. You can install the package using the command pip install pracpred in your sports analytics environment.

A Jupyter Notebook with the code and portions of the text of this post can be found here.

import pracpred.scrape as pps

As usual, we will do our data analysis using pandas. We will also use the Matplotlib Basemap package for plotting a map at the end of this notebook.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib.colors import rgb2hex
%matplotlib notebook
from pathlib import Path
import warnings

You’ll notice we are going to use the warnings module from the Python standard library. This is purely cosmetic, because as you’ll see toward the end of this notebook, Basemap emits some warning messages that I want to suppress.

PARENT_DIR = Path.cwd().parent

Getting the Raw HTML Table

If you inspect the HTML for the Wikipedia page, you’ll see that it has 5 HTML tables. We only want the one for the NBA teams. If you inspect this table in your browser, you’ll see that it has the HTML tag <table class="navbox wikitable">. We can specify this class to make sure we only get back the table we want.

URL = 'https://en.wikipedia.org/wiki/National_Basketball_Association'
NBA_TEAM_INFO = 'navbox wikitable'
USER_AGENT = (
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) ' +
    'AppleWebKit/537.36 (KHTML, like Gecko) ' +
    'Chrome/61.0.3163.100 Safari/537.36'
)

REQUEST_HEADERS = {
    'user-agent': USER_AGENT,
}

Now we can call the scraping code. You can find the source code on GitHub here. The package defines two Python classes, HTMLTables and HTMLTable. The HTMLTables class is basically a wrapper on top of Requests and BeautifulSoup. This class gets and stores the HTML for one or more tables from a URL. The HTMLTable class has the code to unspan the table and convert it to a pandas DataFrame.

tables = pps.HTMLTables(URL, table_class=NBA_TEAM_INFO, headers=REQUEST_HEADERS)
len(tables)
1
tables[0].shape
(33, 9)

We got back one table, which has 33 rows and 9 columns. Notice that the table dimensions are the largest number of rows in any column, and the largest number of columns in any row. This is the key to getting the unspanning to work. We want to view the table as a grid of cells to remove the spanning structure.

Now, let’s convert the HTML table to a pandas DataFrame. For this particular table, we want to have any spanned cells repeat the values when we unspan the table.

raw = tables[0].to_df(repeat_span=True)
raw
0 1 2 3 4 5 6 7 8
0 Division Team City Arena Capacity Coordinates Founded Joined Head coach
1 Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference
2 Atlantic Boston Celtics Boston, MA TD Garden 18,624 42°21′59″N 71°03′44″W / 42.366303°N 71.06222… 1946 1946 Brad Stevens
3 Atlantic Brooklyn Nets New York City, NY Barclays Center 17,732 40°40′58″N 73°58′29″W / 40.68265°N 73.974689… 1967* 1976 Kenny Atkinson
4 Atlantic New York Knicks New York City, NY Madison Square Garden 19,812 40°45′02″N 73°59′37″W / 40.750556°N 73.99361… 1946 1946 Jeff Hornacek
5 Atlantic Philadelphia 76ers Philadelphia, PA Wells Fargo Center 21,600 39°54′04″N 75°10′19″W / 39.901111°N 75.17194… 1946* 1949 Brett Brown
6 Atlantic Toronto Raptors Toronto, ON Air Canada Centre 19,800 43°38′36″N 79°22′45″W / 43.643333°N 79.37916… 1995 1995 Dwane Casey
7 Central Chicago Bulls Chicago, IL United Center 20,917 41°52′50″N 87°40′27″W / 41.880556°N 87.67416… 1966 1966 Fred Hoiberg
8 Central Cleveland Cavaliers Cleveland, OH Quicken Loans Arena 20,562 41°29′47″N 81°41′17″W / 41.496389°N 81.68805… 1970 1970 Tyronn Lue
9 Central Detroit Pistons Detroit, MI Little Caesars Arena 20,491 42°41′49″N 83°14′44″W / 42.696944°N 83.24555… 1941* 1948 Stan Van Gundy
10 Central Indiana Pacers Indianapolis, IN Bankers Life Fieldhouse 17,923 39°45′50″N 86°09′20″W / 39.763889°N 86.15555… 1967 1976 Nate McMillan
11 Central Milwaukee Bucks Milwaukee, WI Bradley Center 18,717 43°02′37″N 87°55′01″W / 43.043611°N 87.91694… 1968 1968 Joe Prunty
12 Southeast Atlanta Hawks Atlanta, GA Philips Arena 15,711 33°45′26″N 84°23′47″W / 33.757222°N 84.39638… 1946* 1949 Mike Budenholzer
13 Southeast Charlotte Hornets Charlotte, NC Spectrum Center 19,077 35°13′30″N 80°50′21″W / 35.225°N 80.839167°W… 1988* 1988* Steve Clifford
14 Southeast Miami Heat Miami, FL American Airlines Arena 19,600 25°46′53″N 80°11′17″W / 25.781389°N 80.18805… 1988 1988 Erik Spoelstra
15 Southeast Orlando Magic Orlando, FL Amway Center 18,846 28°32′21″N 81°23′01″W / 28.539167°N 81.38361… 1989 1989 Frank Vogel
16 Southeast Washington Wizards Washington, D.C. Capital One Arena 20,356 38°53′53″N 77°01′15″W / 38.898056°N 77.02083… 1961* 1961* Scott Brooks
17 Western Conference Western Conference Western Conference Western Conference Western Conference Western Conference Western Conference Western Conference Western Conference
18 Northwest Denver Nuggets Denver, CO Pepsi Center 19,520 39°44′55″N 105°00′27″W / 39.748611°N 105.007… 1967 1976 Michael Malone
19 Northwest Minnesota Timberwolves Minneapolis, MN Target Center 19,356 44°58′46″N 93°16′34″W / 44.979444°N 93.27611… 1989 1989 Tom Thibodeau
20 Northwest Oklahoma City Thunder Oklahoma City, OK Chesapeake Energy Arena 18,203 35°27′48″N 97°30′54″W / 35.463333°N 97.515°W… 1967* 1967* Billy Donovan
21 Northwest Portland Trail Blazers Portland, OR Moda Center 19,441 45°31′54″N 122°40′00″W / 45.531667°N 122.666… 1970 1970 Terry Stotts
22 Northwest Utah Jazz Salt Lake City, UT Vivint Smart Home Arena 19,911 40°46′06″N 111°54′04″W / 40.768333°N 111.901… 1974* 1974* Quin Snyder
23 Pacific Golden State Warriors Oakland, CA Oracle Arena 19,596 37°45′01″N 122°12′11″W / 37.750278°N 122.203… 1946* 1946* Steve Kerr
24 Pacific Los Angeles Clippers Los Angeles, CA Staples Center 19,060 34°02′35″N 118°16′02″W / 34.043056°N 118.267… 1970* 1970* Doc Rivers
25 Pacific Los Angeles Lakers Los Angeles, CA Staples Center 18,997 34°02′35″N 118°16′02″W / 34.043056°N 118.267… 1947* 1948 Luke Walton
26 Pacific Phoenix Suns Phoenix, AZ Talking Stick Resort Arena 18,055 33°26′45″N 112°04′17″W / 33.445833°N 112.071… 1968 1968 Jay Triano
27 Pacific Sacramento Kings Sacramento, CA Golden 1 Center 17,500 38°38′57″N 121°31′05″W / 38.649167°N 121.518… 1923* 1948 Dave Joerger
28 Southwest Dallas Mavericks Dallas, TX American Airlines Center 19,200 32°47′26″N 96°48′37″W / 32.790556°N 96.81027… 1980 1980 Rick Carlisle
29 Southwest Houston Rockets Houston, TX Toyota Center 18,055 29°45′03″N 95°21′44″W / 29.750833°N 95.36222… 1967* 1967* Mike D’Antoni
30 Southwest Memphis Grizzlies Memphis, TN FedExForum 18,119 35°08′18″N 90°03′02″W / 35.138333°N 90.05055… 1995* 1995* J. B. Bickerstaff
31 Southwest New Orleans Pelicans New Orleans, LA Smoothie King Center 16,867 29°56′56″N 90°04′55″W / 29.948889°N 90.08194… 2002* 2002* Alvin Gentry
32 Southwest San Antonio Spurs San Antonio, TX AT&T Center 18,418 29°25′37″N 98°26′15″W / 29.426944°N 98.4375°… 1967* 1976 Gregg Popovich

Cleaning Up the Table

Now let’s clean up the raw information in the table.

Column Headers

First, notice that our generic scraping function doesn’t know anything about what columns are in the table. We need to create useful column headers.

def setup_columns(raw):
    df = raw.copy()
    df.columns = df.loc[0, :]
    return df.drop(df.index[0])
df = setup_columns(raw)
df
Division Team City Arena Capacity Coordinates Founded Joined Head coach
1 Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference Eastern Conference
2 Atlantic Boston Celtics Boston, MA TD Garden 18,624 42°21′59″N 71°03′44″W / 42.366303°N 71.06222… 1946 1946 Brad Stevens
3 Atlantic Brooklyn Nets New York City, NY Barclays Center 17,732 40°40′58″N 73°58′29″W / 40.68265°N 73.974689… 1967* 1976 Kenny Atkinson
4 Atlantic New York Knicks New York City, NY Madison Square Garden 19,812 40°45′02″N 73°59′37″W / 40.750556°N 73.99361… 1946 1946 Jeff Hornacek
5 Atlantic Philadelphia 76ers Philadelphia, PA Wells Fargo Center 21,600 39°54′04″N 75°10′19″W / 39.901111°N 75.17194… 1946* 1949 Brett Brown
6 Atlantic Toronto Raptors Toronto, ON Air Canada Centre 19,800 43°38′36″N 79°22′45″W / 43.643333°N 79.37916… 1995 1995 Dwane Casey
7 Central Chicago Bulls Chicago, IL United Center 20,917 41°52′50″N 87°40′27″W / 41.880556°N 87.67416… 1966 1966 Fred Hoiberg
8 Central Cleveland Cavaliers Cleveland, OH Quicken Loans Arena 20,562 41°29′47″N 81°41′17″W / 41.496389°N 81.68805… 1970 1970 Tyronn Lue
9 Central Detroit Pistons Detroit, MI Little Caesars Arena 20,491 42°41′49″N 83°14′44″W / 42.696944°N 83.24555… 1941* 1948 Stan Van Gundy
10 Central Indiana Pacers Indianapolis, IN Bankers Life Fieldhouse 17,923 39°45′50″N 86°09′20″W / 39.763889°N 86.15555… 1967 1976 Nate McMillan
11 Central Milwaukee Bucks Milwaukee, WI Bradley Center 18,717 43°02′37″N 87°55′01″W / 43.043611°N 87.91694… 1968 1968 Joe Prunty
12 Southeast Atlanta Hawks Atlanta, GA Philips Arena 15,711 33°45′26″N 84°23′47″W / 33.757222°N 84.39638… 1946* 1949 Mike Budenholzer
13 Southeast Charlotte Hornets Charlotte, NC Spectrum Center 19,077 35°13′30″N 80°50′21″W / 35.225°N 80.839167°W… 1988* 1988* Steve Clifford
14 Southeast Miami Heat Miami, FL American Airlines Arena 19,600 25°46′53″N 80°11′17″W / 25.781389°N 80.18805… 1988 1988 Erik Spoelstra
15 Southeast Orlando Magic Orlando, FL Amway Center 18,846 28°32′21″N 81°23′01″W / 28.539167°N 81.38361… 1989 1989 Frank Vogel
16 Southeast Washington Wizards Washington, D.C. Capital One Arena 20,356 38°53′53″N 77°01′15″W / 38.898056°N 77.02083… 1961* 1961* Scott Brooks
17 Western Conference Western Conference Western Conference Western Conference Western Conference Western Conference Western Conference Western Conference Western Conference
18 Northwest Denver Nuggets Denver, CO Pepsi Center 19,520 39°44′55″N 105°00′27″W / 39.748611°N 105.007… 1967 1976 Michael Malone
19 Northwest Minnesota Timberwolves Minneapolis, MN Target Center 19,356 44°58′46″N 93°16′34″W / 44.979444°N 93.27611… 1989 1989 Tom Thibodeau
20 Northwest Oklahoma City Thunder Oklahoma City, OK Chesapeake Energy Arena 18,203 35°27′48″N 97°30′54″W / 35.463333°N 97.515°W… 1967* 1967* Billy Donovan
21 Northwest Portland Trail Blazers Portland, OR Moda Center 19,441 45°31′54″N 122°40′00″W / 45.531667°N 122.666… 1970 1970 Terry Stotts
22 Northwest Utah Jazz Salt Lake City, UT Vivint Smart Home Arena 19,911 40°46′06″N 111°54′04″W / 40.768333°N 111.901… 1974* 1974* Quin Snyder
23 Pacific Golden State Warriors Oakland, CA Oracle Arena 19,596 37°45′01″N 122°12′11″W / 37.750278°N 122.203… 1946* 1946* Steve Kerr
24 Pacific Los Angeles Clippers Los Angeles, CA Staples Center 19,060 34°02′35″N 118°16′02″W / 34.043056°N 118.267… 1970* 1970* Doc Rivers
25 Pacific Los Angeles Lakers Los Angeles, CA Staples Center 18,997 34°02′35″N 118°16′02″W / 34.043056°N 118.267… 1947* 1948 Luke Walton
26 Pacific Phoenix Suns Phoenix, AZ Talking Stick Resort Arena 18,055 33°26′45″N 112°04′17″W / 33.445833°N 112.071… 1968 1968 Jay Triano
27 Pacific Sacramento Kings Sacramento, CA Golden 1 Center 17,500 38°38′57″N 121°31′05″W / 38.649167°N 121.518… 1923* 1948 Dave Joerger
28 Southwest Dallas Mavericks Dallas, TX American Airlines Center 19,200 32°47′26″N 96°48′37″W / 32.790556°N 96.81027… 1980 1980 Rick Carlisle
29 Southwest Houston Rockets Houston, TX Toyota Center 18,055 29°45′03″N 95°21′44″W / 29.750833°N 95.36222… 1967* 1967* Mike D’Antoni
30 Southwest Memphis Grizzlies Memphis, TN FedExForum 18,119 35°08′18″N 90°03′02″W / 35.138333°N 90.05055… 1995* 1995* J. B. Bickerstaff
31 Southwest New Orleans Pelicans New Orleans, LA Smoothie King Center 16,867 29°56′56″N 90°04′55″W / 29.948889°N 90.08194… 2002* 2002* Alvin Gentry
32 Southwest San Antonio Spurs San Antonio, TX AT&T Center 18,418 29°25′37″N 98°26′15″W / 29.426944°N 98.4375°… 1967* 1976 Gregg Popovich

NBA Conference Information

Next, notice that the Eastern and Western Conference repeat across the entire row. What we want is to remove those rows, and create a new column showing the conference for each team.

def cleanup_nba_conferences(df):
    df['temporary'] = df['Division']
    df = df.set_index('temporary')
    eastern = df.index.get_loc('Eastern Conference')
    western = df.index.get_loc('Western Conference')
    df.loc[eastern+1:western, 'Conference'] = 'Eastern'
    df.loc[western+1:, 'Conference'] = 'Western'
    df = df.drop(df.index[eastern]).drop(df.index[western])
    df['Conference'] = df['Conference'].astype('category')
    df['Division'] = df['Division'].astype('category')
    return df.reset_index(drop=True)
df = cleanup_nba_conferences(df)
df
Division Team City Arena Capacity Coordinates Founded Joined Head coach Conference
0 Atlantic Boston Celtics Boston, MA TD Garden 18,624 42°21′59″N 71°03′44″W / 42.366303°N 71.06222… 1946 1946 Brad Stevens Eastern
1 Atlantic Brooklyn Nets New York City, NY Barclays Center 17,732 40°40′58″N 73°58′29″W / 40.68265°N 73.974689… 1967* 1976 Kenny Atkinson Eastern
2 Atlantic New York Knicks New York City, NY Madison Square Garden 19,812 40°45′02″N 73°59′37″W / 40.750556°N 73.99361… 1946 1946 Jeff Hornacek Eastern
3 Atlantic Philadelphia 76ers Philadelphia, PA Wells Fargo Center 21,600 39°54′04″N 75°10′19″W / 39.901111°N 75.17194… 1946* 1949 Brett Brown Eastern
4 Atlantic Toronto Raptors Toronto, ON Air Canada Centre 19,800 43°38′36″N 79°22′45″W / 43.643333°N 79.37916… 1995 1995 Dwane Casey Eastern
5 Central Chicago Bulls Chicago, IL United Center 20,917 41°52′50″N 87°40′27″W / 41.880556°N 87.67416… 1966 1966 Fred Hoiberg Eastern
6 Central Cleveland Cavaliers Cleveland, OH Quicken Loans Arena 20,562 41°29′47″N 81°41′17″W / 41.496389°N 81.68805… 1970 1970 Tyronn Lue Eastern
7 Central Detroit Pistons Detroit, MI Little Caesars Arena 20,491 42°41′49″N 83°14′44″W / 42.696944°N 83.24555… 1941* 1948 Stan Van Gundy Eastern
8 Central Indiana Pacers Indianapolis, IN Bankers Life Fieldhouse 17,923 39°45′50″N 86°09′20″W / 39.763889°N 86.15555… 1967 1976 Nate McMillan Eastern
9 Central Milwaukee Bucks Milwaukee, WI Bradley Center 18,717 43°02′37″N 87°55′01″W / 43.043611°N 87.91694… 1968 1968 Joe Prunty Eastern
10 Southeast Atlanta Hawks Atlanta, GA Philips Arena 15,711 33°45′26″N 84°23′47″W / 33.757222°N 84.39638… 1946* 1949 Mike Budenholzer Eastern
11 Southeast Charlotte Hornets Charlotte, NC Spectrum Center 19,077 35°13′30″N 80°50′21″W / 35.225°N 80.839167°W… 1988* 1988* Steve Clifford Eastern
12 Southeast Miami Heat Miami, FL American Airlines Arena 19,600 25°46′53″N 80°11′17″W / 25.781389°N 80.18805… 1988 1988 Erik Spoelstra Eastern
13 Southeast Orlando Magic Orlando, FL Amway Center 18,846 28°32′21″N 81°23′01″W / 28.539167°N 81.38361… 1989 1989 Frank Vogel Eastern
14 Southeast Washington Wizards Washington, D.C. Capital One Arena 20,356 38°53′53″N 77°01′15″W / 38.898056°N 77.02083… 1961* 1961* Scott Brooks Eastern
15 Northwest Denver Nuggets Denver, CO Pepsi Center 19,520 39°44′55″N 105°00′27″W / 39.748611°N 105.007… 1967 1976 Michael Malone Western
16 Northwest Minnesota Timberwolves Minneapolis, MN Target Center 19,356 44°58′46″N 93°16′34″W / 44.979444°N 93.27611… 1989 1989 Tom Thibodeau Western
17 Northwest Oklahoma City Thunder Oklahoma City, OK Chesapeake Energy Arena 18,203 35°27′48″N 97°30′54″W / 35.463333°N 97.515°W… 1967* 1967* Billy Donovan Western
18 Northwest Portland Trail Blazers Portland, OR Moda Center 19,441 45°31′54″N 122°40′00″W / 45.531667°N 122.666… 1970 1970 Terry Stotts Western
19 Northwest Utah Jazz Salt Lake City, UT Vivint Smart Home Arena 19,911 40°46′06″N 111°54′04″W / 40.768333°N 111.901… 1974* 1974* Quin Snyder Western
20 Pacific Golden State Warriors Oakland, CA Oracle Arena 19,596 37°45′01″N 122°12′11″W / 37.750278°N 122.203… 1946* 1946* Steve Kerr Western
21 Pacific Los Angeles Clippers Los Angeles, CA Staples Center 19,060 34°02′35″N 118°16′02″W / 34.043056°N 118.267… 1970* 1970* Doc Rivers Western
22 Pacific Los Angeles Lakers Los Angeles, CA Staples Center 18,997 34°02′35″N 118°16′02″W / 34.043056°N 118.267… 1947* 1948 Luke Walton Western
23 Pacific Phoenix Suns Phoenix, AZ Talking Stick Resort Arena 18,055 33°26′45″N 112°04′17″W / 33.445833°N 112.071… 1968 1968 Jay Triano Western
24 Pacific Sacramento Kings Sacramento, CA Golden 1 Center 17,500 38°38′57″N 121°31′05″W / 38.649167°N 121.518… 1923* 1948 Dave Joerger Western
25 Southwest Dallas Mavericks Dallas, TX American Airlines Center 19,200 32°47′26″N 96°48′37″W / 32.790556°N 96.81027… 1980 1980 Rick Carlisle Western
26 Southwest Houston Rockets Houston, TX Toyota Center 18,055 29°45′03″N 95°21′44″W / 29.750833°N 95.36222… 1967* 1967* Mike D’Antoni Western
27 Southwest Memphis Grizzlies Memphis, TN FedExForum 18,119 35°08′18″N 90°03′02″W / 35.138333°N 90.05055… 1995* 1995* J. B. Bickerstaff Western
28 Southwest New Orleans Pelicans New Orleans, LA Smoothie King Center 16,867 29°56′56″N 90°04′55″W / 29.948889°N 90.08194… 2002* 2002* Alvin Gentry Western
29 Southwest San Antonio Spurs San Antonio, TX AT&T Center 18,418 29°25′37″N 98°26′15″W / 29.426944°N 98.4375°… 1967* 1976 Gregg Popovich Western

City and Postal Code

Next, we want to split the city and the postal code into two separate columns.

To do this, we need to use pandas string-handling methods.

def split_city_postal(df):
    df['Postal'] = df['City'].str.rsplit(',', n=1).str.get(1).str.replace('.', '').str.strip()
    df['City'] = df['City'].str.rsplit(',', n=1).str.get(0)
    return df
df = split_city_postal(df)
df.head()
Division Team City Arena Capacity Coordinates Founded Joined Head coach Conference Postal
0 Atlantic Boston Celtics Boston TD Garden 18,624 42°21′59″N 71°03′44″W / 42.366303°N 71.06222… 1946 1946 Brad Stevens Eastern MA
1 Atlantic Brooklyn Nets New York City Barclays Center 17,732 40°40′58″N 73°58′29″W / 40.68265°N 73.974689… 1967* 1976 Kenny Atkinson Eastern NY
2 Atlantic New York Knicks New York City Madison Square Garden 19,812 40°45′02″N 73°59′37″W / 40.750556°N 73.99361… 1946 1946 Jeff Hornacek Eastern NY
3 Atlantic Philadelphia 76ers Philadelphia Wells Fargo Center 21,600 39°54′04″N 75°10′19″W / 39.901111°N 75.17194… 1946* 1949 Brett Brown Eastern PA
4 Atlantic Toronto Raptors Toronto Air Canada Centre 19,800 43°38′36″N 79°22′45″W / 43.643333°N 79.37916… 1995 1995 Dwane Casey Eastern ON

Arena Latitude and Longitude

Lastly, we need to clean up the arena latitude and longitude. This is a little tricky, since there is a lot of content packed into the Coordinates column in the DataFrame. Let’s focus on one row to see what’s going on.

row = list(df.loc[df['Team'] == 'Boston Celtics', 'Coordinates'].str.split('/'))
row
[['42°21′59″N 71°03′44″W\ufeff ',
  ' \ufeff42.366303°N 71.062228°W\ufeff ',
  ' 42.366303; -71.062228\ufeff (Boston Celtics)']]

There are 3 elements per row, with different formats for the latitude and longitude. In case you were wondering, the \ufeff appearing in the text strings are a special Unicode character. We are going to ignore the first two elements and just get the third.

We need to split this third element into latitude and longitude by the semi-colon (;) and extract the numbers. Again, we will use pandas string-handling methods, along with regular expressions. Regular expressions are a very general way to find and extract text in many computer languages, including Python. In this particular case, the regular expression just gets numbers with a decimal point, potentially starting with a negative sign.

def get_arena_lat_lon(df):
    df['Coordinates'] = df['Coordinates'].str.split('/').str.get(2).str.split(';')
    df['Latitude'] = df['Coordinates'].str.get(0).astype(float)
    df['Longitude'] = df['Coordinates'].str.get(1).str.extract('(-+[\d]*\.[\d]*)', expand=False).astype(float)
    return df
df = get_arena_lat_lon(df)
df.head()
Division Team City Arena Capacity Coordinates Founded Joined Head coach Conference Postal Latitude Longitude
0 Atlantic Boston Celtics Boston TD Garden 18,624 [ 42.366303, -71.062228 (Boston Celtics)] 1946 1946 Brad Stevens Eastern MA 42.366303 -71.062228
1 Atlantic Brooklyn Nets New York City Barclays Center 17,732 [ 40.68265, -73.974689 (Brooklyn Nets)] 1967* 1976 Kenny Atkinson Eastern NY 40.682650 -73.974689
2 Atlantic New York Knicks New York City Madison Square Garden 19,812 [ 40.750556, -73.993611 (New York Knicks)] 1946 1946 Jeff Hornacek Eastern NY 40.750556 -73.993611
3 Atlantic Philadelphia 76ers Philadelphia Wells Fargo Center 21,600 [ 39.901111, -75.171944 (Philadelphia 76ers)] 1946* 1949 Brett Brown Eastern PA 39.901111 -75.171944
4 Atlantic Toronto Raptors Toronto Air Canada Centre 19,800 [ 43.643333, -79.379167 (Toronto Raptors)] 1995 1995 Dwane Casey Eastern ON 43.643333 -79.379167

Putting It All Together

Now we’ll just put all these steps into one function. This function will combine all the steps, do a few more simple cleanups, and drop columns that we don’t need at the end. We also want to save the final, cleaned-up results.

def wiki_teams_info(raw):
    df = setup_columns(raw)
    df = cleanup_nba_conferences(df)
    df = split_city_postal(df)
    df = get_arena_lat_lon(df)
    df['Capacity'] = df['Capacity'].str.replace(',', '').astype(int)
    df['Founded'] = df['Founded'].str.replace('*', '').astype(int)
    df['Joined'] = df['Joined'].str.replace('*', '').astype(int)
    cols = [
        'Team',
        'Conference',
        'Division',
        'City',
        'Postal',
        'Arena',
        'Capacity',
        'Latitude',
        'Longitude',
        'Founded',
        'Joined',
        'Head coach',
    ]
    return df[cols].reset_index(drop=True)
df = wiki_teams_info(raw)
df
Team Conference Division City Postal Arena Capacity Latitude Longitude Founded Joined Head coach
0 Boston Celtics Eastern Atlantic Boston MA TD Garden 18624 42.366303 -71.062228 1946 1946 Brad Stevens
1 Brooklyn Nets Eastern Atlantic New York City NY Barclays Center 17732 40.682650 -73.974689 1967 1976 Kenny Atkinson
2 New York Knicks Eastern Atlantic New York City NY Madison Square Garden 19812 40.750556 -73.993611 1946 1946 Jeff Hornacek
3 Philadelphia 76ers Eastern Atlantic Philadelphia PA Wells Fargo Center 21600 39.901111 -75.171944 1946 1949 Brett Brown
4 Toronto Raptors Eastern Atlantic Toronto ON Air Canada Centre 19800 43.643333 -79.379167 1995 1995 Dwane Casey
5 Chicago Bulls Eastern Central Chicago IL United Center 20917 41.880556 -87.674167 1966 1966 Fred Hoiberg
6 Cleveland Cavaliers Eastern Central Cleveland OH Quicken Loans Arena 20562 41.496389 -81.688056 1970 1970 Tyronn Lue
7 Detroit Pistons Eastern Central Detroit MI Little Caesars Arena 20491 42.696944 -83.245556 1941 1948 Stan Van Gundy
8 Indiana Pacers Eastern Central Indianapolis IN Bankers Life Fieldhouse 17923 39.763889 -86.155556 1967 1976 Nate McMillan
9 Milwaukee Bucks Eastern Central Milwaukee WI Bradley Center 18717 43.043611 -87.916944 1968 1968 Joe Prunty
10 Atlanta Hawks Eastern Southeast Atlanta GA Philips Arena 15711 33.757222 -84.396389 1946 1949 Mike Budenholzer
11 Charlotte Hornets Eastern Southeast Charlotte NC Spectrum Center 19077 35.225000 -80.839167 1988 1988 Steve Clifford
12 Miami Heat Eastern Southeast Miami FL American Airlines Arena 19600 25.781389 -80.188056 1988 1988 Erik Spoelstra
13 Orlando Magic Eastern Southeast Orlando FL Amway Center 18846 28.539167 -81.383611 1989 1989 Frank Vogel
14 Washington Wizards Eastern Southeast Washington DC Capital One Arena 20356 38.898056 -77.020833 1961 1961 Scott Brooks
15 Denver Nuggets Western Northwest Denver CO Pepsi Center 19520 39.748611 -105.007500 1967 1976 Michael Malone
16 Minnesota Timberwolves Western Northwest Minneapolis MN Target Center 19356 44.979444 -93.276111 1989 1989 Tom Thibodeau
17 Oklahoma City Thunder Western Northwest Oklahoma City OK Chesapeake Energy Arena 18203 35.463333 -97.515000 1967 1967 Billy Donovan
18 Portland Trail Blazers Western Northwest Portland OR Moda Center 19441 45.531667 -122.666667 1970 1970 Terry Stotts
19 Utah Jazz Western Northwest Salt Lake City UT Vivint Smart Home Arena 19911 40.768333 -111.901111 1974 1974 Quin Snyder
20 Golden State Warriors Western Pacific Oakland CA Oracle Arena 19596 37.750278 -122.203056 1946 1946 Steve Kerr
21 Los Angeles Clippers Western Pacific Los Angeles CA Staples Center 19060 34.043056 -118.267222 1970 1970 Doc Rivers
22 Los Angeles Lakers Western Pacific Los Angeles CA Staples Center 18997 34.043056 -118.267222 1947 1948 Luke Walton
23 Phoenix Suns Western Pacific Phoenix AZ Talking Stick Resort Arena 18055 33.445833 -112.071389 1968 1968 Jay Triano
24 Sacramento Kings Western Pacific Sacramento CA Golden 1 Center 17500 38.649167 -121.518056 1923 1948 Dave Joerger
25 Dallas Mavericks Western Southwest Dallas TX American Airlines Center 19200 32.790556 -96.810278 1980 1980 Rick Carlisle
26 Houston Rockets Western Southwest Houston TX Toyota Center 18055 29.750833 -95.362222 1967 1967 Mike D’Antoni
27 Memphis Grizzlies Western Southwest Memphis TN FedExForum 18119 35.138333 -90.050556 1995 1995 J. B. Bickerstaff
28 New Orleans Pelicans Western Southwest New Orleans LA Smoothie King Center 16867 29.948889 -90.081944 2002 2002 Alvin Gentry
29 San Antonio Spurs Western Southwest San Antonio TX AT&T Center 18418 29.426944 -98.437500 1967 1976 Gregg Popovich
OUTPUT_DIR = PARENT_DIR / 'data' / 'scraped'
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
CSVFILE = OUTPUT_DIR.joinpath('wiki-nba_team_info.csv')
df.to_csv(CSVFILE, index=False)

You can use these HTML scraping tools and techniques in your own sports analytics projects. But we’re not done yet.

A Map of NBA Arenas

We haven’t done anything useful with the NBA team data from Wikipedia. One nice thing we can do is to draw a map of NBA arena locations using the latitude and longitude information.

There’s a more practical use for this arena data. Most serious strength of schedule analysis in the NBA looks at road games, rest and distance traveled. In a future post, we’ll see how to incorporate this geographic information to estimate travel distance between games.

Another HTML Table to Scrape

We are going to use Python’s Basemap package to draw a map of North America with NBA arenas. We are also going to going to fill in the U.S. states having NBA arenas using a different color for each NBA Division. Sorry Toronto and Washington fans. Any coloring for Washington, D.C. wouldn’t be visible anyway, and this example won’t fill in the province of Ontario.

In order to do this coloring, we need to use shapefiles. These files contain information about the shapes of various geographic features (in this case, U.S. states). We will overlay these shapes on our map filled with the correct color. The shapefiles we will use come from the U.S. Census Bureau.

In order to use these particular shapefiles, we need to be able to move between state names and postal abbreviations. Our arena data has only the postal abbreviations, and the shapefiles use state names.

There are plenty of ways to get this information (including typing it in to a Python program yourself). However, since this technical guide is about scraping HTML tables, we can use it as another opportunity to scrape a Wikipedia table.

Let’s scrape Wikipedia’s list of U.S. state abbreviations.

ABBR_URL = 'https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations'

We can use the same HTML table scraping function as before. In this case, we want to use a table class of 'sortable' to get the right table.

abbr_tables = pps.HTMLTables(ABBR_URL, headers=REQUEST_HEADERS, table_class='sortable')
len(abbr_tables)
1
abbr_df = abbr_tables[0].to_df()
abbr_df
0 1 2 3 4 5 6 7 8 9
0 Codes: ISO ISO 3166 codes (2-letter, 3-l… NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Name and status of region NaN ISO ANSI NaN USPS USCG GPO AP Other abbreviations
2 NaN NaN NaN
3 United States of America Federal state US USA 840 US 00 U.S. U.S. U.S.A.
4 Alabama State US-AL AL 01 AL AL Ala. Ala.
5 Alaska State US-AK AK 02 AK AK Alaska Alaska Alas.
6 Arizona State US-AZ AZ 04 AZ AZ Ariz. Ariz. Az.
7 Arkansas State US-AR AR 05 AR AR Ark. Ark.
8 California State US-CA CA 06 CA CF Calif. Calif. Ca., Cal.
9 Colorado State US-CO CO 08 CO CL Colo. Colo. Col.
10 Connecticut State US-CT CT 09 CT CT Conn. Conn. Ct.
11 Delaware State US-DE DE 10 DE DL Del. Del. De.
12 District of Columbia Federal district US-DC DC 11 DC DC D.C. D.C. Wash. D.C.
13 Florida State US-FL FL 12 FL FL Fla. Fla. Fl., Flor.
14 Georgia State US-GA GA 13 GA GA Ga. Ga.
15 Hawaii State US-HI HI 15 HI HA Hawaii Hawaii H.I.
16 Idaho State US-ID ID 16 ID ID Idaho Idaho Id., Ida.
17 Illinois State US-IL IL 17 IL IL Ill. Ill. Il., Ills., Ill’s
18 Indiana State US-IN IN 18 IN IN Ind. Ind. In.
19 Iowa State US-IA IA 19 IA IA Iowa Iowa Ia., Ioa.
20 Kansas State US-KS KS 20 KS KA Kans. Kan. Ks., Ka.
21 Kentucky State (Commonwealth) US-KY KY 21 KY KY Ky. Ky. Ken., Kent.
22 Louisiana State US-LA LA 22 LA LA La. La.
23 Maine State US-ME ME 23 ME ME Maine Maine Me.
24 Maryland State US-MD MD 24 MD MD Md. Md.
25 Massachusetts State (Commonwealth) US-MA MA 25 MA MS Mass. Mass.
26 Michigan State US-MI MI 26 MI MC Mich. Mich.
27 Minnesota State US-MN MN 27 MN MN Minn. Minn. Mn.
28 Mississippi State US-MS MS 28 MS MI Miss. Miss.
29 Missouri State US-MO MO 29 MO MO Mo. Mo.
51 Washington State US-WA WA 53 WA WN Wash. Wash. Wa., Wn.
52 West Virginia State US-WV WV 54 WV WV W. Va. W.Va. W.V., W. Virg.
53 Wisconsin State US-WI WI 55 WI WS Wis. Wis. Wi., Wisc.
54 Wyoming State US-WY WY 56 WY WY Wyo. Wyo. Wy.
55 American Samoa Insular area (Territory) AS ASM 016 US-AS AS 60 AS AS A.S.
56 Guam Insular area (Territory) GU GUM 316 US-GU GU 66 GU GU Guam
57 Northern Mariana Islands Insular area (Commonwealth) MP MNP 580 US-MP MP 69 MP CM M.P. CNMI
58 Puerto Rico Insular area (Territory) PR PRI 630 US-PR PR 72 PR PR P.R.
59 U.S. Virgin Islands Insular area (Territory) VI VIR 850 US-VI VI 78 VI VI V.I. U.S.V.I.
60 U.S. Minor Outlying Islands Insular areas UM UMI 581 US-UM UM 74
61 Baker Island island UM-81 81 XB
62 Howland Island island UM-84 84 XH
63 Jarvis Island island UM-86 86 XQ
64 Johnston Atoll atoll UM-67 67 XU
65 Kingman Reef atoll UM-89 89 XM
66 Midway Islands atoll UM-71 71 QM
67 Navassa Island island UM-76 76 XV
68 Palmyra Atoll atoll UM-95 95 XL
69 Wake Island atoll UM-79 79 QW
70 Micronesia Freely associated state FM FSM 583 FM 64 FM
71 Marshall Islands Freely associated state MH MHL 584 MH 68 MH
72 Palau Freely associated state PW PLW 585 PW 70 PW
73 U.S. Armed Forces – Americas US military mail code AA
74 U.S. Armed Forces – Europe US military mail code AE
75 U.S. Armed Forces – Pacific US military mail code AP
76 Northern Mariana Islands Obsolete postal code CM
77 Panama Canal Zone Obsolete postal code PZ PCZ 594 CZ
78 Nebraska Obsolete postal code NB
79 Philippine Islands Obsolete postal code PH PHL 608 PI
80 Trust Territory of the Pacific Islands Obsolete postal code PC PCI 582 TT

81 rows × 10 columns

Cleaning an Ugly Table

This is a relatively ugly table. Notice that many of the cells are blank. One of the reasons I wanted to use this example is to show how this general web scraping framework works, even for ugly tables.

In this case, we just want the name, status and USPS columns. We can also filter out any obsolete postal codes.

def usps_abbrs(raw):
    df = raw.drop(raw.index[:4]).reset_index(drop=True)
    df = df.iloc[:, [0, 1, 5]]
    df.columns = ['Name', 'Status', 'USPS']
    df = df.loc[(df['USPS'] != '') & (~df['Status'].str.contains('Obsolete')), ['Name', 'Status', 'USPS']]
    return df.reset_index(drop=True)
usps_df = usps_abbrs(abbr_df)
usps_df.tail(20)
Name Status USPS
42 Tennessee State TN
43 Texas State TX
44 Utah State UT
45 Vermont State VT
46 Virginia State (Commonwealth) VA
47 Washington State WA
48 West Virginia State WV
49 Wisconsin State WI
50 Wyoming State WY
51 American Samoa Insular area (Territory) AS
52 Guam Insular area (Territory) GU
53 Northern Mariana Islands Insular area (Commonwealth) MP
54 Puerto Rico Insular area (Territory) PR
55 U.S. Virgin Islands Insular area (Territory) VI
56 Micronesia Freely associated state FM
57 Marshall Islands Freely associated state MH
58 Palau Freely associated state PW
59 U.S. Armed Forces – Americas US military mail code AA
60 U.S. Armed Forces – Europe US military mail code AE
61 U.S. Armed Forces – Pacific US military mail code AP

This simple table is just what we need. Now, we can build a function which will return the postal abbreviation given the state name.

def state_abbr_mapper(usps_df):
    name_usps = usps_df[['Name', 'USPS']].set_index('Name').to_dict(orient='Index')
    def inner(name):
        return name_usps[name]['USPS']
    return inner
name2abbr = state_abbr_mapper(usps_df)
name2abbr('Alabama')
'AL'

Take another look at what this function does. We used an inner function, which we return. This inner function “remembers” the DataFrame which was passed in to the outer function. It will be very easy to use this simple function as a wrapper to the DataFrame in our map-drawing code below.

Drawing the Map

Now we can start putting the pieces of the map together.

First, we need a function to create a Basemap of the lower 48 U.S. states, along with portions of Canada and Mexico.

def draw_basemap():
    """Lambert Conformal map of lower 48 U.S. states with portions of Canada and Mexico."""
    m = Basemap(
        llcrnrlon=-119,
        llcrnrlat=22,
        urcrnrlon=-64,
        urcrnrlat=49,
        projection='lcc',
        lat_1=32,
        lat_2=45,
        lon_0=-95,
    )
    m.fillcontinents(color='lightgray')
    return m

Reading the Shapefiles

Next, we read in our shapefiles for the U.S. states.

def read_shape_files(m):
    """ Get U.S. state shape boundaries."""
    # Shapefiles downloaded from https://www.census.gov/geo/maps-data/data/prev_cartbndry_names.html
    MAP_DATA_DIR = PARENT_DIR / 'data'
    SHAPEFILE = MAP_DATA_DIR.joinpath('st99_d00')
    return m.readshapefile(
        shapefile=str(SHAPEFILE),
        name='states',
        drawbounds=True,
        color='white',
        linewidth=1,
    )

Making a Colormap for NBA Divisions

Next, we create a colormap with a distinct color for each of the NBA Divisions.

def make_colormap(divisions, colormap='Set3'):
    """Create colormap with distinct value for each NBA division."""
    cmap = plt.get_cmap(colormap, len(divisions))  
    return {div: cmap(divisions.index(div))[:3] for div in divisions}

We want to assign a color to each state that has an NBA arena. Of course, we can’t use the U.S. state shapefiles for Toronto.

def get_state_colors(df):
    colors = make_colormap(list(df['Division'].str.strip().unique()))
    state_color = {}
    for abbr in list(df['Postal'].str.strip().unique()):
        div = list(df.loc[df['Postal'] == abbr, 'Division'].unique())
        assert len(div) == 1 # there can only be one Division applicable for teams from one U.S. state
        div = str(div[0])
        color = colors[div]
        state_color[abbr] = rgb2hex(color)
    return state_color

Next, we need to get the information from the shapefile for each U.S. state. This is where we need to use our function to look up the postal abbreviation given a U.S. state name from the shapefile.

def get_state_polygons(m):
    state_polygons = {}
    for info, shape in zip(m.states_info, m.states):
        abbr = name2abbr(info['NAME'])
        if abbr in state_polygons:
            state_polygons[abbr].append(Polygon(shape, True))
        else:
            state_polygons[abbr] = [Polygon(shape, True)]
    return state_polygons

Putting It All Together and Drawing the Map

To draw the map, we need to perform the following steps:

  • Create the Basemap;
  • Read in the state shapefiles and assign the colors for the states that need to be filled in;
  • Fill in the states with the correct colors
  • Draw markers for the arenas using the latitude and longitude information
  • Create text labels for the arenas using the team names
  • Show the map

As I mentioned above, Basemap may emit some warnings when you run this code. The warnings I get are harmless, and I’ve filtered them out using the warnings module. You can run this code without the warnings module if you want, and the map should still be fine.

def draw_nba_map(df):
    """Draw map with locations of NBA arenas."""
    
    fig, ax = plt.subplots(figsize=(12,8))
    m = draw_basemap()
    state_shapes = read_shape_files(m)
    state_polygons = get_state_polygons(m)
    state_colors = get_state_colors(df)
    
    # Color in states (skip Ontario and Washington, DC)
    for abbr in state_colors:
        if abbr not in ['ON', 'DC']:
            ax.add_collection(PatchCollection(
                state_polygons[abbr],
                facecolor=state_colors[abbr],
                edgecolor='white',
                linewidth=1,
                zorder=2)
            )

    # Display markers and labels for arenas
    cities = set()
    for _, row in df.iterrows():
        city = row['City']
        x, y = m(row['Longitude'], row['Latitude'])
        m.plot(x=x, y=y, color='black', marker='o', markersize=5)
        team = row['Team'].split()[-1]
        
        # If a city has already been plotted, offset the text so labels don't overlap
        if city in cities:
            label_x = x+40000
            label_y = y+40000
        else:
            label_x = x+40000
            label_y = y-40000
            cities.add(city)
        plt.text(x=label_x, y=label_y, s=team, fontsize='smaller')

    # Remove the box surrounding the plot
    for spine in ax.spines.values():
        spine.set_visible(False)
    plt.show()
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    draw_nba_map(df)
map of NBA arenas
NBA Arenas

This map isn’t perfect. It doesn’t use shapefiles for Canada, so it doesn’t make clear that the Raptors are part of the Atlantic Division. It also doesn’t make clear that the Wizards are part of the Southeast Division. And it doesn’t have a legend to show what the fill colors mean.

Still, it’s great to be able to create a nice-looking map in a few dozen lines of Python. This simple example only scratches the surface of what you can do with geographical data in Python.

about contact pp tos