Deception+ — Mechanics Reference for Dynasty Index Integration

This document explains every aspect of how Deception+ works: what it measures, how it is calculated, what the output numbers mean, and how to correctly surface the metric to Dynasty Index users.

What Deception+ Measures
The Score Scale
Core Concept: Surprise
The Two-Model Architecture
Feature Inputs
Unpredictability Ratio
Standardization Formula
Alternative Metric: PPI
Data Pipeline Step by Step
Execution Modes
Output Columns
Pitch Type Taxonomy
Minimum Sample Requirements
Interpreting Extreme Scores
Validated Correlations
Known Limitations
Glossary

1. What Deception+ Measures

Deception+ quantifies how difficult a pitcher's pitch selection is to predict given full knowledge of game context, sequencing history, and batter tendencies.

It does not measure:

Pitch quality (velocity, movement, spin)
Pitch mix breadth (having many pitch types)
Deception mechanics (arm angle, tunneling)

It does measure:

Whether the pitcher's pitch choices follow learnable patterns
Whether knowing the count, runners, batter, score, and previous pitch makes the next pitch predictable

A pitcher can have a two-pitch arsenal and still score extremely high if they use those two pitches without following recognizable situational rules. Conversely, a pitcher with six pitch types can score low if their choices are highly count-dependent and predictable.

2. The Score Scale

Deception+ is standardized to the same scale as ERA+ and wRC+:

Score	Interpretation
115+	Elite unpredictability — essentially impossible to anticipate
110–114	Highly unpredictable
105–109	Above average
100	League average
95–99	Slightly below average
90–94	Predictable
Below 90	Highly predictable — follows recognizable patterns

Mean = 100 (league average pitcher in the training population)
Standard deviation = 10 (one SD above average = 110)

The score is always relative to the training population and period. A score of 107 means the pitcher is 0.7 standard deviations more unpredictable than the average pitcher in that reference group.

3. Core Concept: Surprise

The mathematical engine behind Deception+ is surprise (also called negative log-likelihood):

Surprise(pitch | model) = -log( P_model(actual pitch) )

Where P_model(actual pitch) is the predicted probability the model assigned to the pitch that was actually thrown.

Intuition:

Model says "I'm 80% sure a fastball is coming" → pitcher throws fastball → low surprise
Model says "I'm 10% sure a curveball is coming" → pitcher throws curveball → high surprise
Model says "I'm 50% sure a fastball is coming" → pitcher throws fastball → moderate surprise

Mathematical properties that make this useful:

Inversely proportional to probability: rare choices produce higher values
Proper scoring rule: the model is incentivized to give well-calibrated probabilities
Additive: per-pitch surprises sum meaningfully across many pitches
Information-theoretically grounded: the expected surprise over a distribution equals its entropy

We use natural logarithm (nats) rather than log base 2 (bits) for numerical stability.

4. The Two-Model Architecture

Deception+ does not simply measure raw surprise from one model. It compares two models to isolate genuine unpredictability from superficial effects like pitch mix diversity or count-driven tendencies.

Full Model (Complex Predictor)

A multinomial logistic regression trained on 18 features representing full game context:

Current count (balls, strikes, two-strike indicator, ahead-in-count indicator)
Game situation (inning, outs, score differential, high-leverage indicator)
Base-out state (8 configurations, runner-in-scoring-position flag)
Batter profile (handedness, chase rate, in-zone contact, overall swing rate)
Times through the order in this game
Previous pitch type in this at-bat (or "NONE" if first pitch)

This model asks: given everything I know, what pitch would a pattern-following pitcher throw here?

Baseline Model (Simple Predictor)

A simpler model using only basic features:

Count (balls, strikes, two-strike indicator)
Runner in scoring position (yes/no)
Batter handedness and pitcher handedness

This model captures the most elementary situational patterns — count-based tendencies that all pitchers exhibit to some degree.

Why two models?

The ratio of their surprise values isolates what cannot be explained even with deep context:

Scenario	Full Model	Baseline	Ratio	Meaning
Very predictable pitcher	Low surprise	Low surprise	< 1.0	Full context helps prediction a lot
Average pitcher	Moderate surprise	Moderate surprise	≈ 1.0	Context helps, but not much more than basics
Very unpredictable pitcher	High surprise	High surprise	> 1.0	Full context doesn't help; pitcher is genuinely random

Pitchers with a ratio > 1.0 surprise the complex model more than the simple model — meaning knowing more about them actually doesn't help. They are situationally independent.

Why Multinomial Logistic Regression?

The choice of algorithm for the full model is intentional:

Multi-class by design: handles 14+ pitch types naturally, with probabilities summing to 1.0
Fast: trains on 100k+ pitches in seconds, enabling daily runs
Interpretable: coefficients can be sanity-checked
Conservative: if a pitcher can fool even a simple logistic model, they are genuinely hard to predict

The model is an intentional lower bound — it deliberately doesn't use deep learning or tree ensembles, which would overfit and understate predictability. A pitcher who defies multinomial logistic regression is truly unpredictable.

5. Feature Inputs

Categorical Features (treated as unordered factors)

Feature	Description	Values
`last_pitch_type`	Previous pitch in this at-bat	FF, SL, CH, CU, SI, FC, FS, KC, ST, KN, FO, SV, CS, FT, OTHER, NONE
`base_state`	Base-out configuration	0–7 (see encoding below)
`stand`	Batter handedness	L, R
`p_throws`	Pitcher handedness	L, R
`inning`	Inning number	Categorical integer

Base state encoding: base3 × 4 + base2 × 2 + base1

Value	Configuration
0	Bases empty
1	Runner on 1st
2	Runner on 2nd
3	Runners on 1st and 2nd
4	Runner on 3rd
5	Runners on 1st and 3rd
6	Runners on 2nd and 3rd
7	Bases loaded

Continuous/Ordinal Features

Feature	Description	Range
`balls`	Current ball count	0–3
`strikes`	Current strike count	0–2
`outs`	Current outs in the inning	0–2
`score_diff`	Home score minus away score	Typically -10 to +10
`o_swing_pct`	Batter's chase rate (swings on pitches outside zone)	0.0–1.0
`z_contact_pct`	Batter's in-zone contact rate	0.0–1.0
`swing_pct`	Batter's overall swing rate	0.0–1.0
`chase_contact_pct`	Batter's contact rate on pitches outside zone	0.0–1.0

Binary Features

Feature	Description	Condition
`two_strikes`	2-strike count indicator	`strikes == 2`
`ahead_in_count`	Hitter-favorable count	`balls > strikes`
`is_risp`	Runner in scoring position	Runner on 2nd or 3rd
`is_top`	Top of inning	`inning_topbot == "TOP"`
`high_leverage`	Late and close game	`inning >= 7 AND
`n_thruorder_pitcher`	Times through order	1, 2, or 3 (from Statcast)

Batter Metric Defaults

If insufficient history exists for a pitcher-batter pair, batter metrics default to 0.5. This prevents NA-driven model failures while remaining neutral.

6. Unpredictability Ratio

After computing per-pitch surprise from both models, the pipeline aggregates to the pitcher level:

Mean_S_model    = mean( -log(P_model(actual_pitch))   ) over all test pitches
Mean_S_baseline = mean( -log(P_baseline(actual_pitch)) ) over all test pitches

Unpredictability_Ratio = Mean_S_model / Mean_S_baseline

Interpretation guide:

Ratio	Meaning
> 1.0	Complex model more surprised than simple model; pitcher resists pattern recognition
= 1.0	Equal surprise; context neither helps nor hurts prediction
< 1.0	Complex model less surprised; pitcher follows complex but learnable patterns

A ratio of exactly 1.0 is rare in practice. Most pitchers fall in the 0.90–1.10 range.

Why the ratio is robust:

Scale-invariant (both models see the same pitches)
Unaffected by overall pitch difficulty or arsenal diversity
Normalizes out count-level tendencies shared by both models

7. Standardization Formula

Raw ratios are transformed into the 100-point Deception+ scale using the training population's distribution:

μ    = mean(Unpredictability_Ratio across all pitchers in reference population)
σ    = standard_deviation(Unpredictability_Ratio across reference population)

Deception+ = 100 + 10 × ( (Unpredictability_Ratio - μ) / σ )

Reference population for standardization:

The reference μ and σ come from the training period, not the test period. This is deliberate:

Anchors every analysis to a stable "league average" definition
Scores from different test periods are comparable as long as they share the same training baseline
Prevents the mean from drifting as sample sizes change between analyses

In daily mode, a pre-computed baseline_params.rds (generated from 100 random 50/50 splits over 2+ years of data) provides fixed μ and σ values, ensuring long-run comparability.

8. Alternative Metric: PPI

Alongside Deception+, the pipeline computes an alternative metric called the Pitch Predictability Index (PPI):

PPI = 1 - (Mean_S_model / Mean_S_baseline)
    = 1 - Unpredictability_Ratio

PPI Value	Meaning
+1.0	Model is always completely surprised (perfectly unpredictable)
0.0	Both models equally surprised (average)
-1.0	Baseline always more surprised (complex model captures all patterns)

PPI lives on the range [-1, 1] and is clamped to that range. It is an intuitive complement to Deception+ for users who prefer a bounded scale.

Dynasty Index recommendation: Surface Deception+ as the primary metric (familiar scale, comparable to ERA+) and PPI as a secondary detail for users who want the raw ratio information.

9. Data Pipeline Step by Step

Step 1: Data Acquisition

Source: MLB Statcast via the sabRmetrics R package (MLB) and direct Baseball Savant API (AAA).

Data is cached locally as .Rds files named:

cache/savant_raw_{start_date}_{end_date}_{game_type}_{level}.Rds

Supported game types: R (regular season), P (playoffs), W (World Series), S (spring training).

Step 2: Feature Engineering

Raw Statcast pitch-level data is transformed:

Pitch type canonicalization — standardizes all pitch codes to the 14-type taxonomy (see Section 12)
Count features — two_strikes, ahead_in_count computed from balls and strikes
Base state encoding — 8-integer encoding from runner flags
Batter metrics — computed from rolling historical data for the pitcher-batter pair; defaults to 0.5 if insufficient history
Sequence feature — last_pitch_type is the previous pitch in the current at-bat, or "NONE" for the first pitch of each plate appearance
Times through order — pulled directly from Statcast n_thruorder_pitcher

Step 3: Feature Validation

Before model training, the pipeline removes features that would cause fitting errors:

Categorical features with only one level
Numeric features with zero variance

Missing values are handled as follows:

Categorical: NA → "UNK" factor level
Numeric: imputed with median (or 0 if all values are NA)

Step 4: Data Splitting

The pipeline supports three splitting strategies:

Mode	Training Data	Test Data	Use Case
Temporal (default)	`train_start` to `train_end`	`test_start` to `test_end`	Regular season → playoffs comparison
Random 50/50	Random half of each pitcher's pitches	Opposite half	Stable baseline computation
Same-period	Full period	Full period	In-sample validation

Step 5: Model Training

Full season mode (run.R): one multinomial model trained on the entire league's training-period data.

Daily mode (run_daily.R): one individual model per pitcher, trained on that pitcher's last N pitches (default: 500) drawn from historical data.

Model configuration:

nnet::multinom(
  pitch_class ~ [all features],
  data = training_data,
  maxit = 500,
  trace = FALSE
)

No regularization is applied — the model is meant to fully fit whatever patterns exist in training data.

Typical convergence: 50–100 iterations with 1,000+ pitches.

Step 6: Prediction and Surprise Calculation

For every pitch in the test period, the pipeline:

Calls predict(model, newdata = test_pitch, type = "probs") to get a probability vector over all pitch types
Looks up the probability assigned to the actual pitch thrown
Computes surprise_model = -log(max(p_actual, 1e-12)) (the floor prevents log(0))
Repeats with the baseline model to get surprise_baseline

Step 7: Pitcher-Level Aggregation

pitcher_level:
  n_pitches_test        = count of test pitches for this pitcher
  mean_surp_model       = mean(surprise_model across test pitches)
  mean_surp_base        = mean(surprise_baseline across test pitches)
  unpredictability_ratio = mean_surp_model / mean_surp_base

Step 8: Standardization

Using the reference μ and σ from the training population:

deception_plus = 100 + 10 × ((unpredictability_ratio - μ) / σ)

Step 9: Name Resolution

Pitcher names are resolved via the MLB Stats API in batches of 100 IDs:

GET https://statsapi.mlb.com/api/v1/people?personIds=id1,id2,...

Results are cached in cache/mlbam_name_cache.csv. If a name cannot be resolved, the pitcher is labeled Pitcher_{mlbam_id}.

10. Execution Modes

Full Season Analysis (`run.R`)

Designed for end-of-season or custom-range analysis. Trains one league-wide model on the specified training period, then evaluates all pitchers against the test period.

Key configuration:

MIN_TEST_PITCHES  <- 100   # Minimum test-period pitches to include a pitcher
MIN_TOTAL_PITCHES <- 250   # Minimum combined pitches
SPLIT_METHOD      <- "temporal"  # or "random"
BASELINE_TYPE     <- "conditional"  # or "marginal" or "hybrid"
TRAIN_LEVEL       <- "MLB"  # or "AAA"

Daily Analysis (`run_daily.R`)

Designed for daily production runs. For each pitcher who threw pitches on the target date:

Pull their last 500 pitches as individual training data
Train a per-pitcher multinomial model
Evaluate today's pitches
Standardize using the fixed baseline_params.rds

Command-line usage:

Rscript run_daily.R [YYYY-MM-DD] --level [MLB|AAA] --n_history 500 --min_history 100

Output path: output/{year}/{month}/{day}.csv

The daily output also includes role (starter or reliever) and status columns (see Section 11).

Baseline Computation (`compute_baseline.R`)

Run once (or periodically) to establish the reference distribution for standardization:

Load 2+ years of historical data
Run 100 independent random 50/50 splits
For each split, train per-pitcher models and evaluate on the held-out half
Compute μ and σ of the resulting unpredictability ratios
Save to baseline_params.rds

11. Output Columns

Core Columns (all modes)

Column	Type	Description
`pitcher_id`	integer	MLB Advanced Media (MLBAM) player ID
`pitcher_name`	character	Full name resolved from MLB Stats API
`total_pitches`	integer	Pitches across training + test periods combined (may double-count overlapping periods)
`n_pitches_test`	integer	Pitches in the test period used for evaluation
`mean_surp_model`	numeric	Average per-pitch surprise from the full model (nats)
`mean_surp_base`	numeric	Average per-pitch surprise from the baseline model (nats)
`ppi`	numeric	Pitch Predictability Index: `1 - (mean_surp_model / mean_surp_base)`, clamped to [-1, 1]
`unpredictability_ratio`	numeric	`mean_surp_model / mean_surp_base`
`deception_plus`	numeric	Final standardized score (mean = 100, SD = 10)

Additional Columns (daily mode only)

Column	Type	Description
`role`	character	`"starter"` or `"reliever"` based on pitch count threshold
`status`	character	Why a pitcher may be excluded (see below)

Status values:

Status	Meaning
`"evaluated"`	Normal result; pitcher had enough history and test pitches
`"debut_no_history"`	Pitcher has no prior MLB Statcast data
`"insufficient_history"`	Fewer than `min_history` pitches in historical record
`"insufficient_test_pitches"`	Threw too few pitches today to produce a reliable estimate

12. Pitch Type Taxonomy

All Statcast pitch type codes are canonicalized to the following 14 types before any modeling:

Code	Pitch Name
`FF`	4-seam fastball
`SI`	Sinker
`FT`	2-seam fastball
`FC`	Cutter
`FS`	Splitter
`CH`	Changeup
`SL`	Slider
`CU`	Curveball
`KC`	Knuckle curve
`SV`	Slurve
`CS`	Slow curve / curveball-slider variant
`ST`	Sweeper
`KN`	Knuckleball
`FO`	Forkball
`OTHER`	Any unrecognized or unusual pitch code

Multiple raw Statcast labels map to the same canonical code (e.g., sweeper maps to SL in some historical data). Canonicalization uses case-insensitive pattern matching.

13. Minimum Sample Requirements

Context	Minimum	Rationale
Historical training data (per-pitcher daily mode)	100 pitches	Below this, model coefficients are unreliable
Test period pitches	100 pitches (full season), 30 pitches (daily)	Below this, mean surprise estimates are noisy
Total pitches to appear in output	250 (full season)	Prevents outlier scores from one-pitch samples
Model convergence (reliable)	1,000+ pitches	Stable coefficients across all features

Pitchers below minimum thresholds are excluded from the output CSV entirely in full-season mode, or flagged with an appropriate status value in daily mode.

14. Interpreting Extreme Scores

High Deception+ (115+): What's Happening

Situational independence: Pitch selection does not shift based on count, runners, or score
Sequence independence: Previous pitch does not predict the next
Balanced usage in "obvious" situations: Doesn't default to fastball on 3-0, doesn't always throw offspeed with 2 strikes
Not necessarily a large arsenal: Two-pitch pitchers can be elite if their two pitches appear without detectable rules

Real example: A reliever with only a fastball and slider who alternates them seemingly at random regardless of count, batter, or score will have a very high ratio — neither model can anticipate which he will throw.

Low Deception+ (85 and below): What's Happening

Strong count patterns: e.g., always throws fastball 0-0, always throws offspeed with 2 strikes
Strict sequencing: e.g., always follows fastball with breaking ball
Situation-dependent patterns: dramatically shifts pitch mix with runners on base
Expected edge cases: position players pitching (tiny arsenal, no strategy), knuckleballers (throws one pitch), pitchers with depleted arsenals due to injury

Real examples from 2025 data: Matt Waldron (knuckleballer — essentially one pitch), Enrique Hernandez (position player — predictable by necessity).

15. Validated Correlations

The following relationships have been observed in 2025 Statcast data, holding for starters with 1,500+ pitches in a season:

Outcome	Direction	Interpretation
xFIP	Negative	Higher Deception+ → lower xFIP → better performance
SIERA	Negative	Higher Deception+ → lower SIERA → better performance
SwStr%	Positive	Higher Deception+ → more swinging strikes
K%	Positive	Higher Deception+ → more strikeouts

Effect sizes are meaningful but not overwhelming — unpredictability is one factor among many. The correlations persist after controlling for raw pitch quality metrics.

Role-specific findings:

Starters: Effect is strongest, likely because facing the same batter 2–3 times in a game amplifies the value of unpredictability
Relievers: Effect is most pronounced in high-leverage appearances; smaller effect in blowouts

16. Known Limitations

Limitation	Impact	Future Direction
No catcher game-calling effects	Catcher influence on pitch selection is unmeasured	Pitcher-catcher dyad analysis planned
Linear model only	Non-linear interaction effects not captured	XGBoost / random forest comparison in progress
Single previous pitch	Only one-pitch sequencing (not multi-pitch patterns)	Multi-pitch sequence features planned
No leverage weighting	A walk-off strikeout and a garbage-time pitch count equally	Leverage-weighted surprise being evaluated
No platoon splits	Score is averaged across vs. LHH and vs. RHH	Separate platoon scores planned
Training/test temporal assumption	Assumes patterns learned in training persist to test; may drift if pitcher makes in-season adjustments	Rolling window models being tested

17. Glossary

Term	Definition
Surprise	`-log(P)` where P is the predicted probability of the actual pitch. Higher = more unexpected
Unpredictability Ratio	Mean model surprise divided by mean baseline surprise for a given pitcher
Full model	Multinomial logistic regression using 18 context features
Baseline model	Simpler model using only count and handedness features, or frequency table lookup
Marginal baseline	Pitch type frequencies from the overall training set (no situational conditioning)
Conditional baseline	Pitch type frequencies within specific count/situation cells
Hybrid baseline	Conditional when cell has ≥5 observations, marginal otherwise
PPI	Pitch Predictability Index: `1 - Unpredictability_Ratio`, range [-1, 1]
Deception+	Standardized metric: `100 + 10 × z-score of Unpredictability_Ratio`
MLBAM ID	MLB Advanced Media player identifier used to join to other Statcast data
Times through order	How many times a pitcher has faced a particular batter in the current game
Canonical pitch type	Standardized pitch code from the 14-type taxonomy used internally
Training period	Date range used to fit the model (learn patterns)
Test period	Date range used to evaluate the model (measure surprise)
Reference population	Set of pitchers whose ratios define the `μ` and `σ` for standardization

Source: Deception+ on GitHub — Conor McGovern, 2025 For commercial licensing inquiries: comcgovern@gmail.com

FilesExpand file tree

DYNASTY_INDEX_INTEGRATION.md

Latest commit

History