← All articles

Defensive WAR is the noisiest part of WAR

Wins Above Replacement is the closest thing baseball has to a one-number summary of player value. It is cited in MVP debates, used as the framework for Hall of Fame arguments, and quoted in contract negotiations and front-office analyses. The metric is genuinely useful. It is also a composite of several underlying calculations whose individual reliability varies enormously, and the defensive component is by far the noisiest of them. The way WAR totals get compared in end-of-season conversations almost always understates how much of the difference between two players is just defensive sampling error, and the noise is large enough that several famous "WAR says X is better than Y" arguments are not actually defensible once you trace the gap down to its components.

How WAR is built

WAR for position players combines five things: batting runs, baserunning runs, fielding runs, positional adjustment, and a replacement-level baseline. Batting runs are computed from roughly 600 plate appearances per season worth of clean, well-recorded events — balls in play, walks, strikeouts, home runs, situational outcomes. The signal-to-noise ratio on the batting component stabilizes quickly and is reliable within about 2-3 runs after a full season. Baserunning is a smaller number but is also reasonably stable. The positional adjustment is a fixed multiplier with no measurement error. The replacement-level baseline is a constant. Four of the five WAR components, in other words, are either deterministic or stable within a few runs across a full season.

The fielding component is the problem. Defensive runs above average are computed from a much smaller sample of events than batting runs: a shortstop fields perhaps 450 ground balls a year, an outfielder gets perhaps 350 catchable balls in his zone. The events themselves are noisier — whether a ball was reachable depends on the fielder's positioning, the batter's contact quality, the wind, the angle, the opposing baserunners. The conversion of those raw events into a runs-saved number requires a model that tries to attribute responsibility for each play, and the model introduces its own measurement error.

How noisy is the defensive component

The published year-to-year correlation for defensive runs at the same position is roughly 0.4-0.5. That means a player who posts +10 defensive runs in one season has, in expectation, a true defensive talent that produces something like +4 to +5 runs the next season — but with a wide confidence interval. The standard error on a single season's defensive measurement is roughly 6-8 runs. That is the entire difference between an "average defender" and a "good defender" in the same year's data.

For comparison, the year-to-year correlation for batting runs at the same position is roughly 0.7-0.8, and the standard error on a single season's batting measurement is roughly 2-3 runs. The batting component is, by every published reliability standard, two to three times more stable than the defensive component. When two players' WAR totals differ by 1.5 wins on the season — which is the size of a typical MVP debate gap — the difference is almost always within the standard error of the defensive component alone. The "MVP-by-WAR" arguments that win these debates rest on a gap that is statistically inside the noise band.

The dWAR-driven Hall of Fame paradox

The clearest illustration of the problem appears in Hall of Fame WAR comparisons. Two players with similar career WAR totals can have profoundly different reliability on the underlying numbers. A player whose career WAR is 65 wins, with batting contributing 50 wins and defense contributing 15, has a more reliable estimate than a player whose career WAR is also 65 with batting at 35 wins and defense at 30. The first player's WAR is well-supported by the most stable component. The second player's WAR depends heavily on the noisiest component, accumulated across many seasons in a way that propagates the error rather than averaging it out.

The defensive-heavy careers that score well on WAR are precisely the ones whose Hall of Fame cases produce the biggest analytics-vs-traditionalist debates. The traditionalists look at the numbers, see a player with modest batting stats, and don't believe the WAR. The analytics community responds by citing the WAR total. Both sides are partly right. The WAR total includes information the traditionalists are underweighting, and the defensive runs portion of that total is noisier than the citation suggests. The honest middle is that the defensive contribution is real but the magnitude is uncertain, and the Hall of Fame argument should reflect that uncertainty rather than treating the WAR total as a fixed number.

The Statcast era hasn't fixed it

Outs Above Average and other Statcast-based defensive metrics, available since 2016, have improved the reliability of defensive measurement substantially. The published OAA year-to-year correlation is closer to 0.5-0.6, which is better than the older UZR or DRS models. The improvement is real and the trend is in the right direction. But OAA is still measurably less reliable than batting runs, and the WAR formulas in widespread use today are mostly still based on the older models, in part because consistent historical data requires using the older models to extend the metric back through pre-Statcast seasons.

The result is that even with the newer tools, the defensive-component noise is still the dominant source of uncertainty in modern WAR comparisons. The batting numbers for the top of any MVP race are usually known to within a few runs. The defensive numbers can swing by ten runs in either direction without contradicting the actual underlying defensive performance. Most "WAR leader by 0.8 wins" arguments are essentially arguments about which player's defensive estimate happened to land on the favorable side of the standard error.

The position-quality confound

The defensive component is also confounded by the quality of the pitching staff a fielder plays behind. A shortstop on a team with a ground-ball pitching staff gets many more defensive opportunities than a shortstop on a team with a fly-ball pitching staff. The shortstop with more chances has more variance to absorb in either direction, and his observed defensive runs total is more sensitive to a small handful of unusual plays than the shortstop with fewer chances. The same true defender will produce different fielding-runs totals depending on the pitching staff in front of him, and the WAR formulas only partially adjust for this.

The same effect appears across teammate quality. A second baseman on a defense with a strong shortstop will have different chance distributions than a second baseman next to a weak shortstop, because positioning, range overlap, and play attribution decisions all shift. The defensive runs column treats both second basemen as if they were measured against the same baseline. They weren't, and the difference leaks into the player-level number that gets cited as evidence of which is the better fielder.

How to read the totals

The honest read on a single-season WAR comparison is that any gap smaller than 1.5-2 wins is well inside the joint noise of the defensive measurements and should be treated as a tie. Players within that range are essentially equivalent by the metric, and the choice between them should be made on non-WAR information: clutch performance over multiple seasons (the bigger career version of the small-sample clutch problem), durability, leadership intangibles that traditional scouts evaluate, or simple personal preference. "Player X has 7.2 WAR, Player Y has 6.8 WAR, Player X is better" is not a defensible argument given the underlying component reliability.

For career totals, the gap that matters is larger. Two players within 5 career WAR of each other have essentially equivalent careers by the metric, and the cleaner comparison is to look at the batting-runs and pitching-runs components alone and treat the defensive contribution as a rough tier rather than a precise number. The metric is useful. It is more useful when the noisiest component is treated with the appropriate skepticism rather than averaged into a single total that gets quoted as if it were a verdict. The defensive runs column is the column that deserves the most uncertainty bars. It is the column whose precision is most consistently overstated in how WAR gets cited.