For fans of major-league baseball, one of the highlights of the current season is the rate at which Barry Bonds of the San Francisco Giants is hitting home runs.
Through June 25, Bonds has hit 39 home runs in 77 games, already setting the record for the most home runs before the all-star break in mid-July. At this rate, he could slug 82 homers by the time the 162-game season ends. That number would easily surpass the record-setting 70 home runs that Mark McGwire of the St. Louis Cardinals hit in 1998.
The popular statistics journal known as The Sporting News has gone even further in its trend analysis. In the July 2 issue, the magazine makes the following projections: Bonds would hit 94 home runs if he always batted with nobody on base; 95 home runs if the Giants played a full season against just the San Diego Padres; 98 home runs if he always faced right-handed pitchers; and 102 home runs if every month were May.
Such whimsical projections call attention to the assumptions that underlie statistical attempts to predict outcomes, identify trends, or characterize sequences of events.
In the current issue of the Journal of Recreational Mathematics, economist Paul M. Sommers of Middlebury College, Vermont, addresses the question of whether top home-run sluggers knock out homers at random or whether they hit in streaks.
Statistically, a run is a sequence of the same number (or symbol or object) preceded and followed by different numbers (or symbols or objects). In tossing a coin, for example, you might get a run of four consecutive heads before and after tossing tails.
Suppose you toss a fair coin 250 times. Approximately 32 runs will consist of two heads or more. About half of these will contain at least one additional head, meaning that we will probably get 16 runs of three heads or more, eight runs of at least four heads, four runs of at least five heads, two runs of six heads or more, and one run of seven heads or more.
Most people are surprised at the occurrence of such long runs in a random sequence. People who are asked to write down a long string of heads and tails that they believe looks random rarely include sequences of four or five heads (or tails) in a row, even though such runs are likely to occur. In fact, it’s generally quite easy to distinguish a human-generated sequence from a random sequence because the one that is written down by a human typically incorporates an insufficient number of long runs.
Sommers examined the records of four home-run hitters: Mark McGwire (1998), Sammy Sosa (1998), Roger Maris (1961), and Babe Ruth (1927). For each player, he analyzed a sequence in which 0 represents a game in which no home run was hit and 1 represents a game in which one or more home runs were hit.
Mark McGwire, for example, hit no home runs in 104 games and one or more home runs in 58 games (games 1, 2, 3, 4, 13, 16, 19, 23, 27, 28, 34, 36, 38, 40, 42, 43, 46, 47, 48, 49, 52, 53, 59, 62, 64, 65, 69, 70, 76, 77, 79, 81, 89, 90, 95, 98, 104, 105, 115, 118, 124, 125, 126, 129, 130, 132, 136, 138, 139, 141, 143, 144, 151, 154, 156, 160, 161, and 162).
Sommers applied a statistical tool called a runs test to the resulting sequence of 1s and 0s. “The length of the longest ‘run’ was nine, which might be regarded as McGwire’s longest slump (successive games in which no home run was hit),” Sommers notes. “McGwire’s longest ‘hot streak’ (successive games with one or more home runs per game) was four games and occurred twice during the season.”
“The statistical evidence suggests that McGwire. . .belted his 70 homers out of the park at random,” Sommers concludes. “An analysis of the data on Sosa, Maris, and Ruth suggest that they too punched them out at random.”
At the end of the current season, it’ll be possible to check whether the home-run performance of Barry Bonds fits the same statistical profile–unless a factor such as injury comes into play to cut short Bonds’ pursuit of the home-run record.