NBA Data Analysis#

The dataset used, available at https://www.kaggle.com/datasets/justinas/nba-players-data, contains over than two decades of information about NBA players.

%use dataframe
%use lets-plot
val raw_df = DataFrame.readCSV("../resources/example-datasets/datasets/all_seasons.csv")
raw_df.columnNames()
[untitled, player_name, team_abbreviation, age, player_height, player_weight, college, country, draft_year, draft_round, draft_number, gp, pts, reb, ast, net_rating, oreb_pct, dreb_pct, usg_pct, ts_pct, ast_pct, season]

As showed above, the dataset includes demographic variables like age, height, weight and place of birth, as well as biographical details such as the team they played for, draft year and round. Additionally, it contains basic box score statistics such as games played, average number of points, rebounds, assists, etc.

Let’s look through the data types that has been interpreted by DataFrame:

raw_df.schema()
untitled: Int
player_name: String
team_abbreviation: String
age: Double
player_height: Double
player_weight: Double
college: String?
country: String
draft_year: String
draft_round: String
draft_number: String
gp: Int
pts: Double
reb: Double
ast: Double
net_rating: Double
oreb_pct: Double
dreb_pct: Double
usg_pct: Double
ts_pct: Double
ast_pct: Double
season: String

At a first sight we notice that the column untitled is useless, so we will remove it. We also notice that it would be more convenient to store ages as Int instead of Double. For all the columns concerning draft, the type is string because of Undrafted players, so we will keep the type String for convenience.

Let’s apply this chain of changes and assign the new dataframe to a new object

val df = raw_df.remove { untitled }
    .convert { age }.toInt()

As the type suggests, some college rows are missing; we can check how many of them are missing with

// check if any record are missing
df.describe().filter { it["nulls"] != 0 }.select("name", "nulls")

DataFrame: rowsCount = 1, columnsCount = 2

Data Analysis#

Drafts#

Let us analyze players drafted and undrafted for each season.

val drafts = df.groupBy { season.map { it.split('-')[0] } }.aggregate {
    count { draft_year != "Undrafted" } into "drafted"
    count { draft_year == "Undrafted" } into "undrafted"
}.convert { season }.toInt()

drafts.head(5)

DataFrame: rowsCount = 5, columnsCount = 3

We can visualize this difference over the years

val drafted = geomPoint() { y="drafted" } +
    geomLine(color="darkgreen") { y="drafted" }
    
val undrafted =  geomPoint() { y="undrafted"} +
    geomLine(color="orange") { y="undrafted"}
    
ggplot(drafts.toMap()) { x =  "season" } + drafted + undrafted +
    scaleXContinuous(breaks = (1996..2021 step 3).toList()) +
    scaleYLog10()

We can see that the number of drafted players are three times more than undrafted players in each season. There is an increase in the trend of undrafted players in 2017-18 season, because that was the year when the “two way contract” rule applied, which help undrafted players secure deals with NBA franchises.

Height and Weight#

We can summarize player’s physical data with the Body Mass Index (BMI) value for each player. Before of this, we must track a player changes during the years, so we will compute an average of weight and height.

val physical_data = df.select { player_name and player_height and player_weight }
    .groupBy { player_name }.mean()
    .add("BMI") { player_weight / (Math.pow(player_height / 100, 2.0))}

physical_data[0..5]

DataFrame: rowsCount = 6, columnsCount = 4

We can then plot the distribution of height, adding the global male height average of 171 cm.

ggplot(physical_data.toMap()) { x="player_height"} +
    geomHistogram(binWidth = 2, fill="#00798c") +
    geomVLine(xintercept = 171, size = 2.0, color="#d1495b") +
    geomVLine(xintercept = physical_data.player_height.mean(),
              size = 2.0, color="#edae49") +
    labs(title="Distribution of Height", x="Height (cm)", y="Count") +
    theme(title = elementText(hjust = 0.5))
    

Where the red line is the male average height, and the golden one is the NBA average height.

It can be useful to see how’s the correlation between weight and height, and we can compute it with Pearson’s correlation coefficient, computed as: $\(\rho_{X,Y} = \frac{{\sum_{i=1}^{n} (X_i - \overline{X})(Y_i - \overline{Y})}}{{\sqrt{\sum_{i=1}^{n} (X_i - \overline{X})^2 \sum_{i=1}^{n} (Y_i - \overline{Y})^2}}}\)$

val corrHeightWeight = ggplot(physical_data.toMap()) { x="player_weight" ; y="player_height"} +
    geomPoint(color = "#233d4d") +
    labs(title = "Height and Weight Correlation", x = "Weight (kg)", y = "Height") +
    theme(title = elementText(hjust = 0.5)) +
    ggsize(500, 500)

// Adding correlation
val correlation = physical_data.corr { player_weight }
        .with { player_height }["player_height"].values().toList()[0]

println("Correlation: $correlation")
corrHeightWeight + 
    geomSmooth(method = "lm", deg = 1, color ="#92140c", size = 2.0, se = false)
Correlation: 0.8210705060051193

We can determine that height and weight are fairly strong correlated variables.

Let’s see now the top 10 players with highest BMI:

val topBMI = physical_data.sortBy { BMI.desc() }[0..10]

ggplot(topBMI.toMap()) { x = "player_name" ; y = "BMI" } +
    geomBar(stat = Stat.identity, fill = "#004643", alpha=0.7) +
    coordFlip() +
    labs(title = "Top 10 players by BMI", x = "BMI rate", y = "Player")

According to ourworldindata.org, 95% of male height lie between 163cm to 193cm. With the average of 203 cm, most of NBA Players are on 5% of entire population with height above 193cm.

We can then get the highest and the shortest player ever in the NBA:

physical_data.minBy { player_height }.concat(
    physical_data.maxBy { player_height}
)

DataFrame: rowsCount = 2, columnsCount = 4

Players Nationalities#

Being the NBA USA’s professional basketball league, most of the players are from North America. We can create a frame and a plot visualizing each year how many new players were from USA and how many of them are foreigners.

Let’s first visualize top 15 countries.

val topCountries = df.distinctBy { player_name }
    .select { player_name and country }
    .groupBy { country }
    .count()
    .sortBy { "count"<Int>().desc() }
    
ggplot(topCountries[0..15].toMap()) { x = "country" ; y="count" } +
    geomBar(stat = Stat.identity, fill = "#456990") +
    scaleYLog10() +
    labs(title = "Top 15 Players Nationalities", x = "Country", y = "Count (log)")
    

We can see during the years how many USA players have been vs. how many foreign players.

val yearNationalities = df.distinctBy { player_name }
    .groupBy { season.map { it.split('-')[0] } }
    .aggregate {
        count { country != "USA" } into "Non-USA"
        count { country == "USA" } into "USA" 
    }.cumSum()
    .gather("Non-USA", "USA").into("country", "count")
    .convert { season }.toInt()
    
yearNationalities.tail(5)

DataFrame: rowsCount = 5, columnsCount = 3

ggplot(yearNationalities.toMap()) { x="season"; y="count"} +
    geomArea(stat = Stat.identity) { fill="country"} +
    scaleXContinuous(breaks = (1996..2021 step 3).toList()) +
    theme(title = elementText(hjust = 0.5)) +
    ggtitle("All time Player's Countries")

And we can plot the overall percentage of USA and foreign players.

val countNations = df.distinctBy { player_name }
    .groupBy { country.map { it == "USA" } }.count()
    .convert("country").with { if ("country"<Boolean>()) "USA" else "non-USA" }
    .toMap()

ggplot(countNations) +
    geomPie(stat = Stat.identity,
            size = 30, stroke = 1, strokeColor = "white", hole = 0.3,
            labels = layerLabels().line("@count").size(16),
    ) { slice = "count" ; fill = "country" } +
    theme(
        line = elementBlank(),
        axis = elementBlank(),
        title = elementText(hjust=0.5)
    ).legendPositionBottom() +
    scaleFillBrewer(palette = "Pastel1") +
    ggtitle("Country Percentage")

    
    

And finally, we can visualize foreign players trend since 1996

val foreignersCount = df.groupBy { season.map { it.split('-')[0] } }
                        .count { country != "USA" }
                        .convert { season }.toInt().toMap() 

ggplot(foreignersCount) { x = asDiscrete("season") ; y = "count" } +
    geomLine(color = "#243e36", size = 2.0) +
    geomPoint(size = 5.0, color="#7ca982") +
    ggtitle("Foreign Players on NBA trend")

Not surprisingly, as the USA Basketball League, North American players are still dominating the NBA, with the USA only at 84%, but the number of the foreign players is increasing progressively. Even if they are the minority of the league, since 2019 to today (2023) NBA’s Most Valuable Player prize has been won by foreigners!

Players Statistics#

In this section we will go through in game statistics for evaluating a player excellence, analyzing points, assists and rebounds per game.

Points Per Game#

ggplot(df.toMap()) { x="pts" } +
    geomHistogram(binWidth = 2, fill = "#840032", alpha=0.8) { y="..count.." } +
    geomVLine(xintercept = df.pts.mean(), color = "#fcba04",
              size = 2.0, linetype = "dashed") +
    labs(title="Points per Game Distribution",
         x="Points-per-game",
         y="Games") +
    theme(title = elementText(hjust=.5))

We can see that exceptional performances are above 25 points per game.

We can write a simple quantile function to extract the top 1% and 10% of point per game performances.

fun quantile(perc: Double=0.99, data: List<Double>): List<Double> = 
    data.sortedDescending()
        .subList(0, (perc * data.size).toInt())
val ppgQuantile = (1..10).map {
    quantile(it.toDouble() / 100.0, df.pts.toList()).average()
}

val ppgDf = dataFrameOf(
    "Percentile" to (99 downTo 90).map { it.toDouble() / 100 },
    "PPG" to ppgQuantile
)

ppgDf

DataFrame: rowsCount = 10, columnsCount = 2

Let’s rank now the top 10 players to have a point-per-game statistic in the top 1%, and the highest number of seasons played with the team.

df.filter { pts >= ppgDf.PPG[0] }
    .update { season }.with { it.split('-')[0] }
    .convert { season }.toInt()
    .groupBy { player_name }.aggregate {
        count() into "Seasons"
        mean { pts } into "Avg PPG"
    }.sortBy { "Seasons"<Int>().desc() }[0..10]

DataFrame: rowsCount = 11, columnsCount = 3

From the above dataframe, James Harden has the highest number of seasons averaging 31.78 points per game. Now we have a clearer picture of what characterized an excellent scorer.

Rebound Per Game#

Similarly as above, we can understand what values of rebounds per game characterize the best rebounder in the league.

ggplot(df.toMap()) { x="reb" } +
    geomHistogram(binWidth = 1, fill = "#840032", alpha=0.8) { y="..count.." } +
    geomVLine(xintercept = df.reb.mean(), color = "#fcba04",
              size = 2.0, linetype = "dashed") +
    labs(title="Rebounds per Game Distribution",
         x="Rebounds-per-game",
         y="Games") +
    theme(title = elementText(hjust=.5))

On average, an NBA player take 3 to 4 rebounds per game.

We can find the best 1% to 10% rebound per games values

val rebQuantile = (1..10).map {
    quantile(it.toDouble() / 100.0, df.reb.toList()).average()
}

val rebDf = dataFrameOf(
    "Percentile" to (99 downTo 90).map { it.toDouble() / 100 },
    "reb" to rebQuantile
)

rebDf

DataFrame: rowsCount = 10, columnsCount = 2

The players that falls into the top 1% rebounder, for the highest number of seasons are

df.filter { reb >= rebDf.reb[0] }
    .update { season }.with { it.split('-')[0] }
    .convert { season }.toInt()
    .groupBy { player_name }.aggregate {
        count() into "Seasons"
        mean { reb } into "Avg RPG"
    }.sortBy { "Seasons"<Int>().desc() }[0..10]

DataFrame: rowsCount = 11, columnsCount = 3

We can see that Andre Drummond is the most consistent rebounder, but Dennis Rodman is the best rebounder since 1996.

Assists Per Game#

Lastly, we will cover Assists per Game.

ggplot(df.toMap()) { x="ast" } +
    geomHistogram(binWidth = 1, fill = "#840032", alpha=0.8) { y="..count.." } +
    geomVLine(xintercept = df.reb.mean(), color = "#fcba04",
              size = 2.0, linetype = "dashed") +
    labs(title="Assists per Game Distribution",
         x="Assists-per-game",
         y="Games") +
    theme(title = elementText(hjust=.5))

The mean value is from 3 to 4 assists per game, but the most common values are from one to two.

As above, we compute the quantiles from 1 to 10% top assists per game.

val astQuantile = (1..10).map {
    quantile(it.toDouble() / 100.0, df.ast.toList()).average()
}

val astDf = dataFrameOf(
    "Percentile" to (99 downTo 90).map { it.toDouble() / 100 },
    "ast" to astQuantile
)

astDf

DataFrame: rowsCount = 10, columnsCount = 2

And the players with the highest number of seasons averaging the 1% quartile are:

df.filter { ast >= astDf.ast[0] }
    .update { season }.with { it.split('-')[0] }
    .convert { season }.toInt()
    .groupBy { player_name }.aggregate {
        count() into "Seasons"
        mean { ast } into "Ast RPG"
    }.sortBy { "Seasons"<Int>().desc() }[0..10]

DataFrame: rowsCount = 11, columnsCount = 3

Chris Paul is the most consistent of all players from 1996 to today when it comes to assists per game, where Rajon Rondo has the best assist per game season:

df.sortBy { ast.desc() }.select { player_name and ast }[0]

DataRow: index = 0, columnsCount = 2

College Ranking#

The last section will summarize the above statics (points, rebounds, assists)

Let’s create a college ranking based on player’s total games played in NBA

val careerGames = df.groupBy { player_name }.sum { gp }

val college_rank = careerGames.join(df) { player_name match right.player_name }
    .select { player_name and college and gp }
    .distinctBy { player_name }
    .rename { gp }.into("total_games")
    .groupBy { college }.sum("total_games")
    .sortByDesc("total_games")
    .filter { college != "None" }.add("rank") { index() }
college_rank[0..10]

DataFrame: rowsCount = 11, columnsCount = 3

We can then plot player’s best points-per-game season, showing the rank of the college he comes from

val bestScorer = df.groupBy { player_name }.mean { pts }.sortBy { pts.desc() }
                    .join(
                        df.distinctBy { player_name }
                            .select { player_name and college } 
                    ) { player_name match right.player_name }
                    .filter { college != "None" }
                    .join(
                        college_rank.select { college and rank }
                    ) { college match right.college }


bestScorer[0..10]

DataFrame: rowsCount = 11, columnsCount = 4

val tooltipOptions = layerTooltips()
                        .line("college|@college")
                        .line("rank|@rank ")
    
    
ggplot(bestScorer[0..15].toMap()) { x="player_name" ; y="pts" } +
    geomBar(stat = Stat.identity, tooltips = tooltipOptions) { fill="rank" } +
    coordFlip() +
    labs(title = "Average PPG with College Ranking",
         x = "Points-per-Game",
         y = "Player")

We can see that the higher the rank of the college is, the higher is the number of players in the top 15 scorer.

Let’s see if this is true also for rebounds and assists.

val bestAst = df.groupBy { player_name }.mean { ast }.sortBy { ast.desc() }
                    .join(
                        df.distinctBy { player_name }
                            .select { player_name and college } 
                    ) { player_name match right.player_name }
                    .filter { college != "None" }
                    .join(
                        college_rank.select { college and rank }
                    ) { college match right.college }

bestAst[0..10]

DataFrame: rowsCount = 11, columnsCount = 4

ggplot(bestAst[0..15].toMap()) { x="player_name" ; y="ast" } +
    geomBar(stat = Stat.identity, tooltips = tooltipOptions) { fill="rank" } +
    coordFlip() +
    labs(title = "Average APG with College Ranking",
         x = "Assists-per-Game",
         y = "Player")

When it comes to top 15 player’s assists per game, the college ranking distribution is very similar to the point per game one, with the top 5 with an average ranking of 25.

val bestReb = df.groupBy { player_name }.mean { reb }.sortByDesc { reb }
                    .join(
                        df.distinctBy { player_name }
                            .select { player_name and college } 
                    ) { player_name match right.player_name }
                    .filter { college != "None" }
                    .join(
                        college_rank.select { college and rank }
                    ) { college match right.college }

bestReb[0..10]

DataFrame: rowsCount = 11, columnsCount = 4

ggplot(bestReb[0..15].toMap()) { x="player_name" ; y="reb" } +
    geomBar(stat = Stat.identity, tooltips = tooltipOptions) { fill="rank" } +
    coordFlip() +
    labs(title = "Average RPG with College Ranking",
         x = "Rebounds-per-Game",
         y = "Player")

Dennis Rodman is the best rebounder since 1996, and in his case, the college ranking did not matter. For the rest of the top 15, the college ranking is fairly low among best rebounders (max. 68).