Skip to content

Commit fc72dc5

Browse files
authored
Update analyze-baseball-stats-with-pandas-and-matplotlib.mdx
1 parent 66d822e commit fc72dc5

1 file changed

Lines changed: 25 additions & 21 deletions

File tree

projects/analyze-baseball-stats-with-pandas-and-matplotlib/analyze-baseball-stats-with-pandas-and-matplotlib.mdx

Lines changed: 25 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,11 @@ tags:
2222

2323
## Introduction
2424

25-
In the early 2000s, the Oakland A's changed baseball forever.
25+
In the early 2000s, the [Oakland A's](https://en.wikipedia.org/wiki/Athletics_(baseball)) changed baseball forever.
2626

27-
As seen in the movie _Moneyball_, general manager Billy Bean and advisor Peter Brand, used a new strategy of data analysis to find players that were hidden gems. By diving deep into often overlooked statistics, like on-base percentage, the team was able to sign undervalued players and make a deep run into the 2002 playoffs on a shoestring budget.
27+
As seen in the movie _Moneyball_ (2011), general manager Billy Bean and advisor Peter Brand, used a new strategy of data analysis to find players that were hidden gems. By diving deep into often overlooked statistics, like on-base percentage, the team was able to sign undervalued players and make a deep run into the 2002 playoffs on a shoestring budget.
28+
29+
<Quote text="It's about getting things down to one number. Using stats to reread them, we'll find the value of players that nobody else can see. - Peter Brand, Moneyball" />
2830

2931
The success of the A's helped usher in an era of advanced data analysis in baseball. Every year, more and more stats about the game are being collected, and every club is hungry for a team of statisticians to help them crack the code.
3032

@@ -50,14 +52,15 @@ When looking at new data, it can also be helpful to look for the official data d
5052

5153
So with all of that setup out of the way, let's start diving into some data analysis!
5254

53-
### Initial Data Exploration
55+
## Initial Data Exploration
5456

5557
Let's begin exploring our data about individual players by getting a sense of the total scale.
5658

5759
Let's answer questions like:
58-
How many unique players are there?
59-
What years are covered?
60-
What is the average number of runs a player scores in a single year?
60+
61+
- _How many unique players are there?_
62+
- _What years are covered?_
63+
- _What is the average number of runs a player scores in a single year?_
6164

6265
To begin, we'll load the players dataset into Pandas. To do so, first download the Dataset and save it into the same directory as your Python script.
6366

@@ -84,7 +87,7 @@ To find the number of unique players in the dataset, we can use `batting['player
8487

8588
Alternatively, we could use `batting['playerID'].nunique()` to get the number immediately without having to use `.size`. There are usually multiple ways to achieve the same goal with Pandas!
8689

87-
### Filtering Out Inactive Players
90+
## Filtering Out Inactive Players
8891

8992
It can be helpful to look through your data before jumping into any heavy analysis because there are often quirks to the data that can be hard to spot without subject matter expertise. For example, when we looked at `batting.head()`, the very first row showed the player `aardsda01` from the year `2004` had 0 at bats, 0 runs, 0 hits, 0 strike outs, and so on. It seems like this player was on the team, but never actually played in any games.
9093

@@ -102,7 +105,7 @@ As expected, the average number of runs scored by a player shot up to `20.88` af
102105

103106
Now, is this the "correct" thing to do? Well, it entirely depends on what question you're trying to answer. In some cases, it might be very important to filter out all of 0s, and in others you'll want to leave them in. This is where it is critical to have true knowledge about the dataset that you're working with so you can understand the consequences of the decisions you make during analysis.
104107

105-
### Finding Top Performers Using Group By
108+
## Finding Top Performers Using Group By
106109

107110
If we use `.groupby()` we can find statistics for a particular player, year, or even team. For example, we can find the total number of career home runs for each player with this line of code:
108111

@@ -115,10 +118,10 @@ You might recognize some familiar names here! These are the all time home run hi
115118

116119
This line of code chains four operations together, so let’s break it down step by step:
117120

118-
`.groupby('playerID')` groups all rows that belong to the same player. Since each row represents one season, this grouping effectively collects every season of a player’s career together. If you print this object by itself, Pandas will show a `DataFrameGroupBy object`, because it doesn’t yet know how you want to summarize the data.
121+
- `.groupby('playerID')` groups all rows that belong to the same player. Since each row represents one season, this grouping effectively collects every season of a player’s career together. If you print this object by itself, Pandas will show a `DataFrameGroupBy object`, because it doesn’t yet know how you want to summarize the data.
119122
['HR'] selects only the home runs column from each group.
120-
.sum() is an aggregate function. It adds up the home runs within each player’s group, giving us each player’s career total. We could use other aggregate functions if we wanted other statistics. For example, if we wanted the average number of home runs per year, we could use `.mean()`.
121-
.sort_values(ascending=False) sorts the results so the players with the most home runs appear first.
123+
- `.sum()` is an aggregate function. It adds up the home runs within each player’s group, giving us each player’s career total. We could use other aggregate functions if we wanted other statistics. For example, if we wanted the average number of home runs per year, we could use `.mean()`.
124+
- `.sort_values(ascending=False)` sorts the results so the players with the most home runs appear first.
122125

123126
If we wanted to filter our dataset by some condition, we could do that before filtering. For example, if we wanted to see how dominant Babe Ruth was compared to his peers, we could filter for only players that played the same years as him. To do this, we'll need to find what years he started and ended his career:
124127

@@ -143,7 +146,7 @@ ruth_years.groupby('playerID')['HR'].sum().sort_values(ascending=False)
143146

144147
Wow, Babe Ruth had almost twice the number of home runs as the next best player!
145148

146-
### Graphing Data By Year
149+
## Graphing Data By Year
147150

148151
Because baseball has such a long history, it can be interesting to see how the game changed over time. If we use `.groupby()` to group years together, we can then use the graphing library Matplotlib to visualize some interesting stats. Let's give it a shot by plotting how the total number of home runs have changed over time!
149152

@@ -170,7 +173,7 @@ plt.show()
170173

171174
It's interesting that you can see the abbreviated 2020 season in this graph! They played about half as many games that year due to COVID.
172175

173-
### Your Favorite Team Vs. The League
176+
## Your Favorite Team Vs. The League
174177

175178
Another fun visualization that you can create is a comparison of your favorite team's stats to the rest of the league. I grew up in Denver, so my team is the Colorado Rockies, who have been laughably bad for the majority of my life 😅.
176179

@@ -236,7 +239,7 @@ plt.show()
236239
237240
As expected, the altitude in Denver has caused some pretty high home run numbers!
238241
239-
### Replicating Moneyball
242+
## Replicating Moneyball
240243
241244
Finally, let's take on a challenge of replicating the work Billy Bean and Peter Brand did for the Oakland A's in _Moneyball_. While they almost certainly considered many statistics, they are most famous for finding players with a high **on-base percentage** (OBP) relative to their cost.
242245
@@ -245,7 +248,7 @@ OBP is not currently listed in the Batting table, but we can calculate that valu
245248
We can find a player's salary for a given year in the Salaries table. We'll need to use a join to combine the Batting and Salaries tables.
246249
Once OBP and salary are in the same table, we can find the ratio of those two values. If we sort by that ratio, we can find the players who have the best OBP for their cost!
247250
248-
#### Calculating OBP
251+
### Calculating OBP
249252
250253
Let's begin by calculating OBP. This is the formula that we'll use. It takes into account not just hits, but walks, sacrifice flies, etc.
251254
@@ -345,13 +348,14 @@ value_df_sorted[value_df_sorted['yearID'] == 2010][[
345348
I went to look up `heywaja01` on baseball-reference.com, and it turns out that this data was from a player named Jason Heyward. In 2010, he was an All Star and got 2nd place in voting for Rookie of the Year! It certainly sounds like a player that was high value. You can also confirm that our calculation of OBP was correct!
346349

347350

348-
### Recap
351+
## Recap
349352

350353
Clearly there is a _ton_ that you can do with this dataset. In this project, we practiced the following skills in Pandas:
351-
Initial data exploration using `.describe()` to see summary statistics.
352-
Filtering the dataset by boolean values (for example, `[value_df_sorted['yearID'] == 2010`)
353-
Grouping rows together and using aggregate functions like `.sum()` or `.mean()`.
354-
Using Matplotlib to graph the results of Pandas operations.
355-
Using `.merge()` to join two tables together.
354+
355+
- Initial data exploration using `.describe()` to see summary statistics.
356+
- Filtering the dataset by boolean values (for example, `[value_df_sorted['yearID'] == 2010`)
357+
- Grouping rows together and using aggregate functions like `.sum()` or `.mean()`.
358+
- Using Matplotlib to graph the results of Pandas operations.
359+
- Using `.merge()` to join two tables together.
356360

357361
Do you have any favorite players or teams? Shohei Ohtani? Aaron Judge? The Chicago Cubs? We hope that you come up with your own questions about baseball and use your Python and Pandas skills to answer those questions!

0 commit comments

Comments
 (0)