ELO rating system

The ELO rating system is a method for calculating the relative skill levels of players in two-player games such as chess and Go. It is also used as a rating system for competitive multiplayer play in a number of computer games. It was originally invented as an improved chess rating system. ELO is often written in capital letters, but is not an acronym. It is the family name of the system's creator, Árpád Élő (1903-1992), a Hungarian-born American physics professor. "ELO" is written in uppercase to distinguish it from Professor Élő. He spelled his own name "Elo" after he left Hungary, a common anglicization.

Contents

1 A statistical system, not a reward system

2 Élő's rating system model

3 Implementing Élő's scheme

4 Comparative ratings

5 Mathematical details

6 See also

7 External links

A statistical system, not a reward system

Árpád Élő was a master-level chess player and an active participant in the United States Chess Federation (USCF) from its founding in 1939. The USCF used a numerical ratings system, devised by Kenneth Harkness, to allow members to track their individual progress in terms other than tournament wins and losses. The Harkness system was reasonably fair, but in some circumstances gave rise to ratings which many observers considered inaccurate. On behalf of the USCF, Élő devised a new system with a statistical flavor.

It was (and still is) daring to substitute statistical estimation for a system of competitive rewards. Rating systems for many sports award points in accordance with subjective evaluations of the greatness of certain achievements. For example, winning an important golf tournament might be worth five times as many rating points as winning a lesser tournament, and taking third place might be worth half the points of taking first place, etc.

A statistical endeavor, in contrast, postulates a model of some aspect of reality, and seeks to mathematically estimate, based on observation, the variables in that model. Competitors may still feel that they are being rewarded and punished for good and bad results, but the lofty claim of a statistical system is that it estimates real unknowns, and thus mirrors some hidden truth.

Élő's specific assumptions about the nature of reality are open to doubt, but chess fans praise the accuracy of ELO ratings with a fervor unheard of in other sports. For example, professional tennis ratings are purely rewards based on tournament results. (Statistically rating tennis players would be complicated by variables chess doesn't have, particularly the playing surface, but the rating organizations don't even try for predictive accuracy.) As a result, it is routine for tennis fans to consider the higher-rated player an underdog in a given match. In chess the higher-rated player is regarded as the favorite in almost every case.

Élő's rating system model

Élő's central assumption was that the chess "performance" of each player in each game is a normally distributed random variable. Although a player might perform significantly better or worse from one game to the next, Élő assumed that the mean value of the performances of any given player changes only slowly over time. Élő thought of the mean of a player's performance random variable as that player's true skill.

A further assumption is necessary, because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and say, "That performance is 2039." Performance can only be inferred from wins, draws and losses. Therefore, if a player wins a game, he is assumed to have performed at a higher level than his opponent for that game. Conversely if he loses, he is assumed to have performed at a lower level. If the game is a draw, the two players are assumed to have performed at nearly the same level.

Élő waved his hands at several details of his model. For example, he did not specify exactly how close two performances ought to be to result in a draw rather than a decisive result. And while he thought it likely that each player might have a different standard deviation to his performance, he made a simplifying assumption to the contrary.

To simplify computation even further, Élő proposed a straightforward method of estimating the variables in his model —i.e. the true skill of each player. One could calculate relatively easily, from tables, how many games a player is expected to win based on a comparison of his rating to the ratings of his opponents. If a player won more games than he was expected to win, his rating would be adjusted upward, while if he won fewer games than expected his rating would be adjusted downward. Moreover, that adjustment was to be in exact linear proportion to the number of wins by which the player had exceeded or fallen short of his expected number of wins.

From a modern perspective, Élő's simplifying assumptions are not necessary, because computing power is inexpensive and widely available. Moreover, even within the simplified model, more efficient estimation techniques are well known. Several people, most notably Mark Glickman, have proposed using more sophisticated statistical machinery to estimate the same variables. On the other hand, the computational simplicity of the ELO system has proved to be one of its greatest assets. With the aid of a pocket calculator, an informed chess competitor can calculate to within one point what his next officially published rating will be, which helps promote a perception that the ratings are fair.

Implementing Élő's scheme

The USCF implemented Élő's suggestions in 1960, and the system quickly gained recognition as being both more fair and more accurate than the Harkness system. Élő's system was adopted by FIDE in 1970. Élő described his work in some detail in the book The Rating of Chessplayers, Past and Present, published in 1978.

Subsequent statistical tests have shown that chess performance is almost certainly not normally distributed. Weaker players have significantly greater winning chances than Élő's model predicts. Therefore, both the USCF and FIDE have switched to systems based on the logistic distribution. However, in deference to Élő's contribution, both organizations are still commonly said to use "the ELO system".

Comparative ratings

The phrase "ELO rating" is often used to mean a player's chess rating as calculated by FIDE. However, this usage is confusing and often misleading, because Élő's general ideas have been adopted by many different organizations, including the USCF (before FIDE), the Internet Chess Club (ICC), Yahoo! Games, and the now defunct Professional Chess Association (PCA). Each organization has a unique implementation, and none of them precisely follows Élő's original suggestions. It would be more accurate to refer to all of the above ratings as ELO ratings, and none of them as the ELO rating.

Instead one may refer to the organization granting the rating, e.g. "As of August 2002, Gregory Kaidanov had a FIDE rating of 2638 and a USCF rating of 2742." It should be noted that the ELO ratings of these various organisations are not always directly comparable. For example, someone with a FIDE rating of 2500 will generally have a USCF rating near 2600 and an ICC rating in the range of 2500 to 3100.

The following analysis of the July 2003 FIDE rating list gives a rough impression of exactly what having a given FIDE rating means:

A player rated between 2400 and 2499 is likely to have the International Master title.
a player rated 2500 or above is likely to have the Grandmaster title.
113 players have a rating of 2600 or above.
16 players have a rating of 2700 or above.
1 player (Garry Kasparov) has a rating of 2800 or above.

The highest ever FIDE rating was 2851, which Garry Kasparov had on the July 1999 and January 2000 lists.

Mathematical details

Performance can't be measured absolutely, it can only be inferred from wins and losses. Ratings therefore have meaning only relative to other ratings. Therefore, both the average and the spread of ratings can be arbitrarily chosen. Élő suggested scaling ratings so that a difference of 200 rating points in chess would mean that the stronger player has an expected score of approximately 0.75, and the USCF initially aimed for an average club player to have a rating of 1500.

A player's expected score is his probability of winning plus half his probability of drawing. Thus an expected score of 0.75 could represent a 75% chance of winning, 25% chance of losing, and 0% chance of drawing. On the other extreme it could represent a 50% chance of winning, 0% chance of losing, and 50% chance of drawing. The probability of drawing, as opposed to having a decisive result, is not specified in the ELO system. Instead a draw is considered half a win and half a loss.

Above is an explanation for ELO in games where draws can occur. ELO ranking for games without the possibility of draws (Go, Backgammon) is discussed in Go rating with ELO. It explains also the non-cumulativeness of winning chances for big ELO differences in those zero-sum, full-information games, where the result can have also a quantity (small/big margin) in addition to the quality (win/loss) (Go).

If Player A has true strength <math>R_A<math> and Player B has true strength <math>R_B<math>, the exact formula (using the logistic curve) for the expected score of Player A is

<math>E_A = \frac 1 {1 + 10^{\frac{R_B - R_A}{400}}}<math>.

Similarly the expected score for Player B is

<math>E_B = \frac 1 {1 + 10^{\frac{R_A - R_B}{400}}}<math>.

Note that <math>E_A + E_B = 1<math>. In practice, since the true strength of each player is unknown, the expected scores are calculated using the player's current ratings.

When a player's actual tournament scores exceed his expected scores, the ELO system takes this as evidence that that player's rating is too low, and needs to be adjusted upward. Similarly when a player's actual tournament scores fall short of his expected scores, that player's rating is adjusted downward. Élő's original suggestion, which is still widely used, was a simple linear adjustment proportional to the amount by which a player overperformed or underperformed his expected score. The maximum possible adjustment per game (sometimes called the K-value) was set at K=16 for masters and K=32 for weaker players.

Supposing Player A was expected to score <math>E_A<math> points but actually scored <math>S_A<math> points. The formula for updating his rating is

<math>R_A^\prime = R_A + K(S_A - E_A)<math>

This update can be performed after each game or each tournament, or after any suitable rating period. An example may help clarify. Suppose Player A has a rating of 1613, and plays in a five-round tournament. He loses to a player rated 1609, draws with a player rated 1477, defeats a player rated 1388, defeats a player rated 1586, and loses to a player rated 1720. His actual score is (0 + 0.5 + 1 + 1 + 0) = 2.5. His expected score, calculated according the formula above, was (0.506 + 0.686 + 0.785 + 0.539 + 0.351) = 2.867. Therefore his new rating is (1613 + 32*(2.5 - 2.867)) = 1601.

Note that while two wins, two losses, and one draw may seem like a par score, it is worse than expected for Player A because his opponents were lower rated on average. Therefore he is slightly penalized. If he had scored two wins, one loss, and two draws, for a total score of three points, that would have been slightly better than expected, and his new rating would have been (1613 + 32*(3 - 2.867)) = 1617.

This updating procedure is at the core of the ratings used by FIDE, USCF, Yahoo! Games, the ICC, and FICS. However, each organization has taken a different route to deal with the uncertainty inherent in the ratings, particularly the ratings of newcomers, and to deal with the problem of ratings inflation/deflation. New players are assigned provisional ratings, which are adjusted more drastically than established ratings, and various methods (none completely successful) have been devised to inject points into the rating system so that ratings from different eras are roughly comparable.

The principles used in these rating system can be used for rating other competitions—for instance, international football matches.