You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: projects/analyze-baseball-stats-with-pandas-and-matplotlib/analyze-baseball-stats-with-pandas-and-matplotlib.mdx
In a moment, we'll introduce some questions that we want to answer, but for now, it can be helpful to click into a few files to get a sense of what data we're working with.
46
46
47
-
We'll primarily use the files **Batting.csv**, **People.csv**, and **Teams.csv**, but feel free to check out other files that might be interesting to you!
47
+
We'll primarily use three files:
48
+
49
+
-**Batting.csv**
50
+
-**People.csv**
51
+
-**Teams.csv**
48
52
49
53
As a brief example, this is what the top of the **Batting.csv** file looks like:
50
54
@@ -93,7 +97,9 @@ Alternatively, we could use `batting['playerID'].nunique()` to get the number im
93
97
94
98
## Filtering Out Inactive Players
95
99
96
-
It can be helpful to look through your data before jumping into any heavy analysis because there are often quirks to the data that can be hard to spot without subject matter expertise. For example, when we looked at `batting.head()`, the very first row showed the player `aardsda01` from the year `2004` had 0 at bats, 0 runs, 0 hits, 0 strike outs, and so on. It seems like this player was on the team, but never actually played in any games.
100
+
It can be helpful to look through your data before jumping into any heavy analysis because there are often quirks to the data that can be hard to spot without subject matter expertise.
101
+
102
+
For example, when we looked at `batting.head()`, the very first row showed the player `aardsda01` from the year `2004` had 0 at bats, 0 runs, 0 hits, 0 strike outs, and so on. It seems like this player was on the team, but never actually played in any games.
97
103
98
104
If this is a common occurrence, then that might drastically alter some of these summary statistics. If there are a ton of players that are in the database but have `0`s for all their stats, then that will drag down all of the averages that we're looking at.
99
105
@@ -193,7 +199,7 @@ First, let's find the total number of home runs per year. This will look very fa
We can now plot this using Matplotlib's `plot()` function. This function needs a list of X and Y values. In our case, we want the year to be on the X axis and the total number of home runs to be on the Y axis:
202
+
We can now plot this using Matplotlib's `.plot()` function. This function needs a list of X and Y values. In our case, we want the year to be on the X axis and the total number of home runs to be on the Y axis:
197
203
198
204
```py
199
205
import matplotlib.pyplot as plt
@@ -273,7 +279,7 @@ plt.ylabel('Home Runs')
273
279
plt.legend()
274
280
275
281
plt.show()
276
-
``
282
+
```
277
283
278
284
As expected, the altitude in Denver has caused some pretty high home run numbers!
0 commit comments