← All articles

Why goalies are the hardest position in sports to evaluate

Every major team sport has a position that confounds its statistics worse than the others, and in hockey it's the goaltender. The reason isn't that the position is mystical or that the data is bad. It's that the structure of the position — what a goalie does, when they do it, and how it interacts with the rest of the team — produces measurement problems that no single statistic has fully solved. Goalies are the position where the difference between "what they did" and "what they were responsible for" is largest, and the difference is large enough to swallow any individual season's evaluation.

The signal-to-noise problem

A starting goaltender faces somewhere between 1,500 and 2,000 shots in a full season. That sounds like a lot, but the relevant subset — high-danger chances, where the save percentage runs around .800 and any individual save can swing a game — is much smaller, perhaps two or three hundred per season. The per-shot save percentage difference between the league's best starting goaltender and a league-average one is something like 15 to 20 points on a per-shot basis. Over a season's worth of shots, that translates to maybe 30 to 40 goals saved above average. Over a single game or week, the variance from a league-average baseline easily swamps that signal.

The implication is that single-season goaltender performance is dominated by short-run noise. A goalie can post a .920 save percentage in October and a .895 save percentage in November and the difference isn't a change in skill; it's the normal range of variation around the same underlying skill level. Public goalie statistics are almost always reported at time scales where the noise overwhelms the signal, and the common conclusions drawn from them — "the team needs a new goalie, he's been bad for two months" — are usually wrong.

The shot quality problem

Save percentage doesn't know what shot it's measuring. Letting in five goals on twenty point shots from forty feet looks the same in the box score as letting in five goals on twenty slot one-timers, but they're very different events. The first goalie was failed by his defense; the second goalie was the failure.

Expected goals against, adapted from team-level xG, gives a partial fix. A goalie whose actual save percentage exceeds the league-average save rate on the specific shots he faced has saved goals above expectation; one who falls short has cost his team goals. The stat is called goals saved above expected, or GSAx, and it's been the leading edge of public goalie analytics for several years.

It's better than save percentage. It is not, however, fully clean. Public xG models do not see defender positioning at the moment of the shot. They infer it from proxies — shot type, location, and whether the previous shot bounced — and the inference is loose. A goalie who plays behind a defense that chases shooters but lets them get cleanly to slot positions will face shots that look ordinary in the xG model but were in fact harder than the model thinks. The adjustment runs the other way for tight defensive systems that limit clean looks. The result is that GSAx is biased by team defensive style in ways that no public model currently fully corrects.

The equipment problem

Goalie equipment has gotten incrementally larger and more protective over decades, and the league has periodically clawed back some of the growth via rule changes. Every time it does, save percentages across the league move. This is real signal — it tells you that some fraction of the historical variation in save percentage is equipment rather than skill — but it's hard to apply on a per-goalie basis. The Hall of Fame argument over whether a 1990s save percentage of .920 was "better" than a 2020s save percentage of .920 is partly an equipment argument, and the equipment side of it is genuinely difficult to settle.

Equipment also interacts with technique. The butterfly goalies of the late 1990s and 2000s posted historically elite save percentages partly because the equipment they wore made the butterfly mathematically optimal. Earlier goalies playing the same style would have had worse results because the pads didn't cover the net as efficiently. Comparing across eras requires baking in equipment, technique, and league-wide shooting trends, and no public stat does this well.

The starter-backup problem

Goaltender workload distributions across a season produce a survivorship-bias problem. Goalies who play poorly get benched; goalies who play well get more starts. The season-ending statistics for a goalie reflect not just his performance but the team's reaction to his performance. A backup who played 18 games may have a .925 save percentage on a small sample, but his sample is small partly because the team didn't trust him with more games — meaning the .925 may overstate his true ability. The starter playing 65 games at .910 may be the better goalie, even though the rate stat ranks them in the opposite order.

This shows up in trade evaluation. A backup who posts strong rate stats in limited starts often regresses when given a starter's workload elsewhere. The pattern is consistent enough that front offices discount backup-goalie save percentages by sample size, but the public discourse mostly doesn't.

The "team context" problem

Save percentage and even GSAx don't account for the strength of the goaltender's team. A goalie behind a strong possession team faces fewer high-leverage shots per minute; a goalie behind a defensively shaky team faces a constant barrage. The rate stats normalize per shot, which partly fixes this, but the volume of shots also matters, because every shot is a chance to make a save or to let one in, and the random variation around a goalie's true save percentage is proportional to the shot volume he faces. High-volume goalies have less stable rate stats. Low-volume goalies have more stable rate stats but also less data to evaluate.

The deepest problem is that the goaltender's contribution and the skater's contribution are not cleanly separable. When a defenseman makes a great stick check, the shot never happens, so the goaltender's stat sheet is unchanged but the defense is better. When the same defenseman fails the same check on a different night, the shot does happen, and the goaltender either saves it or doesn't. Half of a goaltender's rate stat is determined by which version of the defenseman showed up that month.

What scouts still do

Front offices have not converged on a single goaltender metric, and they don't pretend to have. The standard workflow is to look at the rate stats over multiple seasons — three years minimum, five years if available — for signal stability; to adjust for shot quality with public or proprietary xG-against models; to watch a lot of video; and to weight scouting reports about technique and consistency heavily. The video and scouting parts are not a backup; they are the largest single input in most evaluation processes, because the available statistical inputs are too noisy and too biased by team context to carry the decision alone.

The empirical case for this approach is that draft and trade outcomes for goaltenders evaluated primarily on stats have historically been bad. The bust rate for "analytics darlings" at the goaltender position is much higher than at any skater position. Teams that have invested heavily in statistical goalie evaluation have not produced obviously better results than teams that have leaned on traditional scouting. The opposite is also true; nobody has cracked the position via film alone. The best front offices use both and hold their conclusions tentatively.

The takeaway

The right level of confidence in any goaltender evaluation is lower than it would be for any other position. A scout who tells you he knows the third-line center on a team's roster is a slightly-above-average player can probably back that up with three different stats and two seasons of evidence. A scout telling you the same thing about a starting goaltender is making a claim with more uncertainty attached, whether he says so or not. The math is what it is. The position is genuinely hard.

For a casual viewer, the practical rule is short. Don't decide a goalie is bad based on a month, and don't decide a goalie is great based on a month. Three full seasons is the floor for forming a real opinion. Anything shorter is guessing.