Human Fallibility and the Case for Robot Baseball Umpires

How the ‘gambler’s fallacy’ and anchoring bias influence strike zones.
baseball analyzed with data lines
Illustration: Casey Chin

I, for one, will welcome our robot umpire overlords, at least when it comes to calling balls and strikes. The automated strike zone is coming, probably within the next three seasons, and I am here for it.

If you’ve spent any time on Twitter during baseball season, especially the postseason the last few years, you’ve probably stumbled on fans arguing for #RobotUmpsNow against those who argue for “the human element,” two sides of the ongoing debate over whether baseball should move to automated calling of balls and strikes. (It's a bittersweet topic; I'd kill for a missed strike call at this point, as it would mean we'd have actual baseball to watch again.) It came up yet again in the 2019 World Series, when umpire Lance Barksdale missed two obvious calls in Game 5, one of which he openly blamed on Washington catcher Yan Gomes, which led Nationals manager Davey Martinez to yell at Barksdale to “wake up,” and another so egregious that the victim, Victor Robles, jumped in anger and tossed his batting gloves after Barksdale called him out on a pitch that never even saw the strike zone. Both calls were bad, and in both cases there was at least the appearance that Barksdale was punishing the Nationals—punishing Gomes for assuming the strike call before it happened, then punishing the whole team later for questioning him in the first place. They may have simply been “human errors,” but the perception was worse.

Excerpted from The Inside Game: Bad Calls, Strange Moves, and What Baseball Behavior Teaches Us About Ourselves, by Keith Law. Buy on Amazon.

Courtesy of William Morrow

I’m unabashedly in the former camp; calling balls and strikes is a difficult task, virtually impossible for a human to do well (especially when there’s another human, the catcher, sitting in his way), and just a few errant calls can sway the outcome of a game or series. There are some practical arguments against this, notably that the existing pitch-tracking technology isn’t definitively more accurate than good umpires, but the latter argument, that we’re OK with nonplayers affecting the outcomes of games because of this “human element,” is codswallop. Humans shouldn’t be making these calls, because humans are subject to so many biases.

We have proof that umpires are biased, too, in at least two ways. I’m not talking about the sort of player-specific bias where Davey Strikethrower always gets the benefit of the doubt on a pitch that’s an inch or two off the plate or Joey Bagodonuts gets squeezed a lot as a hitter because umpires don’t like how much he complains. Those biases may exist, and, yes, they’d go away with an automated system, but the evidence for those biases isn’t very strong, and their effects aren’t universal.

I am talking about two very specific ways in which umpires consistently make mistakes because of cognitive biases, and these are far more pervasive because they’re not player—or even umpire—specific.

If you’re human, you have these cognitive issues, and since umpires are asked to make ball/strike calls immediately after each pitch and have almost zero latitude to change a call even if they think better of it, there is no corrective procedure available to them when they do miss a call. This is not a bug of using human umpires, but a feature.

The first known issue with human umpires is that the way they call a pitch is biased by their calls on the previous pitches, especially the pitch that came right before. There is no reason why the ball/strike status of one pitch should be affected by previous pitches; pitches are independent events, and if you can predict, even with a little success, whether a pitcher is going to throw a ball or strike on his next pitch, then that pitcher is too predictable and hitters will catch onto him.

In a paper published in 2016, Daniel Chen, Tobias Moskowitz, and Kelly Shue report their findings in a study of all pitches tracked by Major League Baseball’s Pitch f/x system, which tracked every pitch thrown in every game and recorded data like pitch location, vertical or horizontal movement, and release point, from 2008 to 2012. They looked at consecutive pitches that were “called” by the umpire—that is, not hit into play, hit foul, swung at and missed, or otherwise not adjudicated by the umpire—and found 900,000 such pairs. They also categorized all called pitches as obvious (that the pitch’s status as a ball or strike was clear) or ambiguous (pitches on or near the edges of the strike zone). They report that 99 percent of “obvious” pitches were called correctly, while only 60 percent of “ambiguous” pitches were.

They began with the specific question of whether an umpire was more likely to call pitch 2 a ball if they had called pitch 1 a strike—that is, whether the call on the previous pitch biased their call on the next one. They found a small but significant effect on all pitches, where umpires were 0.9 percent more likely to call pitch 2 a ball if they’d called the previous pitch a strike, and the effect rose to 1.3 percent if the previous two pitches were called strikes. The effect was more blatant when the next pitch was “ambiguous,” with biasing effects 10 to 15 times larger than those on “obvious” pitches.

The authors categorize this as a manifestation of the “gambler’s fallacy,” the errant belief that random or even semi-random outcomes will always even out in a finite sample. For example, gamblers may claim that a roulette wheel that has come up black five times in a row is more likely to come up red on the next spin because the wheel is “due”—which, by the way, you’ll hear quite often about hitters who are having a cold streak at the plate, and which is equally absurd. They also cite the possibility of self-imposed quotas, where umpires might feel that they have to call a certain number or percentage of strikes in each game.

Anchoring effect, a different cognitive bias, provides us with a simpler explanation. Some previous piece of information independent of the next decision still affects that next decision by changing the mind’s estimate of the probabilities of certain outcomes. The umpire’s call on the previous pitch should have no impact on their call on the next pitch, or on their probability of getting the call right on the next pitch, but it does because the umpire’s mind does not treat these two events as independent, even though the umpire may not be aware of this biasing. It could be a matter of an internal quota: “I called that last pitch a strike, so I should try to even things out.” It could be a subconscious expectation:

“The last pitch was a strike, and the pitcher isn’t that likely to throw two strikes in a row, so this pitch is more likely to be a ball.” Whatever the cause is, the simplest explanation is that the umpire’s mind is anchored on that last called pitch, and therefore the umpire’s internal calibration is thrown off for the next pitch. That means they’re less likely to get the next call right—and that’s another point in favor of giving the job of calling balls and strikes to machines, not humans.

The anchoring effect was first proposed by Tversky and Kahneman back in 1974, in a landmark paper modestly titled “Judgment Under Uncertainty.” The section title “Adjustment and Anchoring” begins with a statement that sounds obvious but contains multitudes: “In many situations, people make estimates by starting from an initial value that is adjusted to yield the final answer.”

When you are asked to estimate something, or find yourself in a situation where you need to make an estimate for yourself, you don’t just start the thought process from a blank slate. You begin with some piece of information that your mind deems relevant, and then you make adjustments up or down from there based on other factors or how the spirits move you. It’s a mental game reminiscent of The Price Is Right, the popular game show where contestants are often given some price for an item and asked to say whether the actual price is higher or lower. (Some games ask contestants to adjust specific digits of the price, which feels like an anchoring-and-adjustment game within an anchoring-and-adjustment game.) Your mind sets that initial anchor, grasping at whatever number is handy, and then you adjust it from there.

The most shocking result in their paper showed that research subjects’ minds would use totally irrelevant numbers as anchors for estimates. They spun a wheel that showed a random number from 0 to 100 in front of the test subjects and then asked the subjects what percentage of countries in the United Nations were African. They write: “For example, the median estimates of the percentage of African countries in the United Nations were 25 and 45 for groups that received 10 and 65, respectively, as starting points. Payoffs for accuracy did not reduce the anchoring effect.” (The correct answer would have been 32 percent, assuming they did the study in 1973.)

They characterized this as “insufficient adjustment,” although it looks more like “incompetent anchoring.” Their term applies more to their second experiment, where they asked two groups of high school students to calculate an eight-figure product, giving them five seconds and asking them to estimate the answer at that time. One group received the question as 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1, while the other received it as 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8. The first group’s median guess was 2,250; the latter’s was 512.

Dan Ariely, author of Predictably Irrational, describes a similar experiment he conducted at the Massachusetts Institute of Technology with his colleague Drazen Prelec where they would ask students to bid on some item, but first asked the students to write down the last two digits of their Social Security numbers as if that were the list price of the item. Those students with numbers above 50 bid more than three times as much as those students with numbers below 50. The anchor was meaningless. Its total irrelevance to the question at hand had no effect whatsoever on the students’ brains; the number was in front of them, and therefore it became an anchor from which the students adjusted up or down.

Anchoring and adjustment is one of many cognitive heuristics, or mental shortcuts, we use every day to cope with the sheer volume of information coming into our brains and the number of decisions we are expected to make. You can’t spend six hours at the grocery store trying to figure out whether each item meets or beats your optimal price, nor can you spend an hour each at six grocery stores to comparison shop. You make snap decisions on whether a price is good, and sometimes those decisions will be skewed by misinformation (for example, an item that is on sale may not be a bargain compared to other stores, or even that much of a discount from the regular price).

Umpires are asked to make most of their calls in, at most, about two seconds; when they take longer than that, there will be chirping from one dugout and probably some announcers about a “delayed call.” They make those ball/strike decisions a little faster by the use of heuristics, even ones they’re not quite aware they’re using. My hypothesis, at least, is that they are anchoring and adjusting from the previous pitch, or the previous few pitches, and thus the evidence of bias we see in their calls is the result of a persistent human cognitive error.

Before I continue with how the anchoring bias shows up in baseball, there’s another cognitive error that affects how home plate umpires call pitches, one you may have seen already if you’ve read the wonderful book Scorecasting: The Hidden Influences Behind How Sports Are Played and Games Are Won, by Tobias Moskowitz and L. Jon Wertheim. The book takes a Freakonomics-style look at issues across multiple sports, from home-field advantage to NFL draft pick values to whether “defense wins championships” to why the Chicago Cubs are cursed. (Well, they weren’t, but it’s still a good book.)

Moskowitz was a coauthor of the 2016 paper I cited earlier that looked at umpire accuracy and bias. A second effect that he and his coauthors found (also reported in Scorecasting) was that umpires were much less likely to call a pitch a ball if it would result in a batter drawing a walk, and were less likely to call a strike if it would result in a strikeout. Moskowitz and his coauthors refer to this as impact aversion, which you might think of as a bias toward doing nothing. (In fact, that’s first cousin to another bias, omission bias, which says that we view doing nothing as less harmful than doing something, even if the outcomes are the same.)

In Scorecasting, the authors looked at Pitch f/x data on pitch calls and locations over the 2007–2009 seasons, with 1.15 million called pitches in their sample. In overall situations, they found that umpires made the correct ball/strike call 85.6 percent of the time. However, when the count on the batter went to two strikes, meaning a third would result in a strikeout, and the pitch was within the strike zone, the umpires correctly called the pitch a strike only 61 percent of the time. (They excluded full counts, where either a called strike or ball would end the at bat, and thus impact aversion was not in play.) Umpires’ error rate more than doubled in those situations, likely because they shied away at least a little bit from making a decision that had a higher impact than other called pitches.

The converse situation, where there’s a three-ball count on the batter and the pitch is out of the strike zone, also showed evidence of this impact aversion. Umpires correctly called pitches out of the strike zone as balls 87.8 percent of the time, but in three-ball counts (excluding full counts) they made the correct call just 80 percent of the time. In baseball jargon, the umpire squeezes pitchers with two strikes and expands the zone with three balls.

They further demonstrated that the evidence of impact aversion was highest at the two ends of the spectrum of ball-strike counts. Umpires are way more likely to errantly call a ball a strike in 3–0 counts, and way more likely to call a pitch in the strike zone a ball on 0–2 counts. This is hardly a surprise if you’ve watched much baseball; there’s no greater chance of a gift strike call than with a 3–0 count. Writing for the Hardball Times back in 2010, Pitch f/x expert John Walsh found that the strike zone was 50 percent larger in a 3–0 count than it was in an 0–2 count, saying “these umpires are a bunch of softies.” Walsh goes on to point out that the run values of each count, meaning the expected value to the hitter of any specific ball-strike count, reach their two extremes at 3–0 (+.22 runs to the hitter, in his research) and 0–2 (-.11 runs to the hitter), so by altering the size of the strike zone more in those counts, umpires are flattening the expected values of these at bats—pulling both run values back toward zero. A previous article by Dave Allen, which Walsh references, found that an additional strike in the count had as much effect on the probability that an umpire would call a pitch a strike as would an additional inch of distance away from the center of the strike zone. Allen found that once you controlled for the ball-strike count and the amount of break on a pitch, the changes in the size of the strike zone across pitches became insignificant.

There’s an alternative explanation for this beyond “umpires are dumb.” (I’m not saying that, by the way; I happen to think the job of calling balls and strikes accurately enough in an MLB environment is beyond the capabilities of any human.) Etan Green and David Daniels argue in a 2018 paper that umpires employ statistical discrimination, using disallowed information like the count or batter handedness to improve their decision-making on balls and strikes, and a loose form of Bayesian updating (just nod and keep reading) to make more accurate and more rational calls over the course of a game. Doing so does not require knowing or using Bayes’s theorem, which allows you to calculate the probability of one event based on your prior knowledge of a condition related to the event. Green and Daniels write that this kind of intuitive correction is a heuristic honed over years of practice and constant feedback. A scout or baseball executive might call it “feel.” I see it as further argument that we should turn this job over to machines: if umpires feel the need to use information, like the game state, to get to the desired level of accuracy in ball/strike calls, that is in and of itself a problem with the system.

Labels about players can be their own form of anchoring, and baseball does love its labels. This guy’s an ace, but this other guy’s just a number two starter. Joey Bagodonuts? He’s a bust. Twerpy McSlapperson’s a grinder, a gamer, a professional hitter (duh), or, my absolute favorite, a baseball player. (Which distinguishes him how, exactly?)

Anchoring bias is pervasive inside or outside of baseball because it is such a fundamental shortcut for our brains. You can see how widespread its effects might be just in the world of baseball. If umpires are subject to anchoring bias in their calling of balls and strikes, then hitters and pitchers would have to try to adjust, consciously or subconsciously, to those variable strike zones from game to game and even within games or within innings. If umpires are especially averse to calling ball four or strike three, that will almost certainly alter how hitters and pitchers approach pitches in those counts. If a manager anchors on the first thing they learn about a player, such as the first live look they have at the player in spring training or in his first few games in the majors, it would likely impact how often the manager uses the player (or doesn’t use him) or how he deploys the player in the lineup or on the field. If general managers use a player’s draft status or signing bonus as an anchor, that’s a potentially large inefficiency for other executives to exploit in trades, or a trap to avoid for yourself in those same situations.

How do you overcome anchoring bias? Like many cognitive biases, anchoring is a heuristic—a shortcut your mind uses to replace what might be a complex evaluation process, one you can’t do in your head or in a short period of time, with a quick one. It’s a gut reaction, and those often aren’t useful or accurate. If you can buy the time to engage in your normal process for making decisions, you always want to do so. Listing the actual variables that should go into a decision, and then basing your evaluation or calculations just on those variables, can give you evidence that is free of the anchoring bias. For example, a major-league general manager may receive a trade offer shortly before the deadline that sounds great because it includes two former first-round picks. They may feel the time pressure to respond quickly, and their unconscious mind may say that it’s a good offer because those two players are former first-rounders (or just because they’re familiar names, which would invoke availability bias as well). It may be a fair offer, but the GM can’t know that without a proper evaluation—speaking to the team’s analysts and scouts about the players involved, gathering essential data, and then using that to drive the decision.

Sometimes the optimal solution will involve removing people from the decision-making process entirely. The existing radar and optical systems that provide MLB teams with Statcast data also allows the league to automate the calling of balls and strikes with an error rate that would be no worse than that of human umpires, and likely lower. They even experimented with this in the Arizona Fall League last year, resulting in some amusing moments when hitters started to protest strike calls only to realize they had nobody with whom to argue. The cleverly-titled Automated Ball-Strike system was in place at all AFL games played at Salt River Fields, the spring training home of the Diamondbacks and the Rockies, because that stadium also has the full setup of Statcast measuring equipment. The cameras track the path of the pitch, a software program determines whether the ball passed through the official strike zone as defined in the rules, and the home plate umpire gets an audio signal to indicate if the pitch was a ball or a strike, after which the umpire can announce the call. It was different, so many players didn’t like it on principle, but it offered consistency that human umpires just can’t match—and no anchoring bias.

If Major League Baseball chose to automate calling of balls and strikes, investing further in the existing technology to improve its accuracy at the margins of the zone, even without any immediate improvement in the frequency of inaccurate ball/strike calls, the missed calls would at least be more predictable, because they’d all come at the edges of the strike zone where the calls are ambiguous. Machines are not subject to anchoring bias, while people are. A computer might mistake a pitch an inch outside of the zone for a strike, but it won’t miss on a pitch right down the middle because prior pitches informed its expectations. Some decisions are just hard for humans to make without bias, because they lack the time to work around it. Recognizing which type of decision you’re facing is the first step in figuring out how to avoid this trap.


Excerpted from the book The Inside Game: Bad Calls, Strange Moves, and What Baseball Behavior Teaches Us About Ourselves, by Keith Law. Copyright © 2020 by Keith Law. From William Morrow, an imprint of HarperCollins Publishers. Reprinted by permission.


More Great WIRED Stories