NOGL, ELO and.. Glicko?

Rating words and students: Elo, Glicko, and the tests that feed them

This is a write-up of how we rate two different things on one shared scale:

* WORDS, by how hard they are, and
* STUDENTS, by how able they are,

so that the two numbers are directly comparable. If a word sits at 6000 and a student sits at 6000, the student should get that word right about half the time. That single property is the whole reason to put words and people on the same ruler.

1. The shared scale

Everything lives on a 1..10000 scale.

Each dictionary word has a difficulty rating (we call it nogl_elo), seeded at 5000 (the middle) until we learn otherwise.
Each student has an ability rating per game mode (flashcards, multiple choice, spelling bee, picture tree, memory), seeded at 0.

Motivation

Why seed students at 0 instead of the middle? Motivation. A child watching their number climb from zero is more fun than one that starts in the middle and barely moves. The cost is that a strong student is under-rated for a while; we come back to that in the section on cold starts.

The scale uses the classic Elo convention: a difference of 400 points means 10-to-1 odds. So a student 400 points above a word wins ~91% of the time; 800 points above, ~99%. We deliberately kept this “400” identical to chess Elo so a point means the same thing here as everywhere else. The absolute range (0..10000) is wider than chess, but a *point* is a point.

2. Plain Elo, and why it isn’t enough

Elo is two formulas. The expected score (probability of success) of a player rated R against an item rated D:

E = 1 / (1 + 10^(-(R – D) / 400))

and the update after a result S (1 = right, 0 = wrong):

R_new = R + K * (S – E)

K is the “step size.” The entire art of plain Elo is choosing K. And here’s the problem: one fixed K cannot be both things we need.

A new student should move fast — we know nothing, every game is news.
A settled student should move slowly — we know them, one fluke shouldn’t
swing their rating.

A single K is a compromise that is too slow at the start and too jittery later. You can hand-tune K to decay with games played, and people do, but that is a patch. The principled fix is to track, alongside the rating, how sure we are of it. That is what Glicko adds.

3. Glicko: the rating carries a confidence (RD)

Glicko (Mark Glickman’s system, the math behind a lot of modern ladders) pairs every rating with a Rating Deviation, RD: a measure of uncertainty, in the same points as the rating. Think of the true rating as living in a band R +/- RD.

High RD -> we’re unsure -> take big steps (fast cold start).
Low RD -> we’re sure -> take small steps (stable).

RD effectively makes K self-tuning — no hand-decayed K-factor needed. Three things fall out of it for free, and all three matter for a learning app:

(a) Cold start. A brand-new rating has maximum RD, so the first handful of games move it a long way, then it settles automatically.

(b) Idle reinflation. This is Glicko’s signature feature and the one most relevant to us. RD GROWS while a mode sits unused. A kid drills spelling bee for a week then ignores it for a month — their bee RD swells, so the next result counts more and we stop over-trusting a stale number. In a multi-mode app here modes routinely go cold, this is exactly what you want.

(c) Verifying Difficulty. When picking the next word, a wide RD says “probe broadly to find their level,” a tight RD says “fine-tune.” The rating alone can’t tell you that; the confidence can.

The Glicko update is the same shape as Elo but the step depends on RD, and each opponent’s own RD is discounted by a factor g(RD) — an uncertain opponent teaches you less. We implement Glicko-1 (rating + RD). Glicko-2 adds a third number, “volatility,” that tracks how erratic a player is; for a single-student site that’s more machinery than it earns, so we skipped it.

4. It’s a two-sided system (words rate too)

This is not a fixed exam where the questions have known, frozen difficulties. The words are being rated at the same time as the students — a player-versus-item Elo where both sides learn. A word that strong students keep missing drifts upward in difficulty; a word everyone gets right drifts down.

So the word side wants a confidence too. Today we approximate it with a simple count (noglc — how many times a word has been compared/answered): more data = more trust. A cleaner future step is to give words a real RD like students have, and let the two uncertainties meet in the same formula. Glicko already expects the opponent to have an RD, so the hook is there.

A note we learned the hard way: RD must be scaled to the *odds scale* (the 400), NOT to the rating *range* (10000). We first set the starting RD to 2000 because the scale runs to 10000 — and the math detonated: a single win rocketed a student from 0 to 22000, and RD never shrank, because wildly lopsided matchups carry almost no information. The lesson: keep RD on the order of the scale. Chess ships RD0 = 350 against a 400 scale; we use 500. The range being large does not mean the uncertainty should be.

Our current constants (all tunable in one file, php/glicko.php):

scale 400 (400 points = 10:1 odds)
RD0 500 (starting / maximum uncertainty)
RD floor 50 (most certain we ever claim to be)
idle return 60 days (a settled RD climbs back to RD0 over ~two months)
max delta 300 (UX governor: most a rating may move per update)

That last one isn’t Glicko — it’s a deliberate game-feel choice. We WANT a visible “period of rise” rather than the system instantly snapping a student to their true level, so we cap how far a rating can move in one update. Read it as a points-per-game ceiling: it sets the MINIMUM length of the climb (e.g. cap 300 -> reaching 6000 takes at least ~20 games). We deliberately set it just BELOW the natural per-game step (~250-350 under adaptive matchmaking) so the cap actually bites on the strong games and smooths the rise, rather than only catching outliers. We cap the visible rating but leave RD honest, so confidence still reflects the real evidence even while the displayed number is being held back. Pure statistics wants fast convergence; this is us trading a little of that for motivation, on purpose.

5. The test decides everything

A rating system is only as good as the games you feed it. Two designs with the identical Glicko math converge in 15 games or 150 depending entirely on *which items you show*. The governing principle:

A result is only informative when the outcome was in doubt.

Show a 6000 student a 1000 word and they get it right — you learn nothing you didn’t already assume. The information in a result peaks when the success probability is around 50%, i.e. when item difficulty sits near the student’s ability. So:

Matchmaking at the edge of ability. Pick words where the student is expected to succeed maybe 50-75% of the time — hard enough to be informative, easy enough to not be demoralizing. (The pedagogy and the statistics agree here, which is convenient.)
Don’t just serve items at the current estimate when you’re unsure. We simulated a true-6000 student seeded at 0 and matched only to words near their *current* rating: they beat easy words forever and crawled up ~58 points a game, still mis-rated at 1900 after 25 games. Switch to an ADAPTIVE staircase — raise difficulty after a win, lower it after a miss — and the same student reaches ~5900 in about 20 games. The staircase naturally walks the items up to the student’s true level and parks there, which is exactly where the informative ~50% games live.
Watch RD as well as the rating. In that adaptive run, RD actually *rose* during the long winning climb (all those lopsided wins taught us little) and only tightened once the matchups reached the 50% zone. That’s the system correctly saying “I see them winning, but I haven’t yet found where they lose.” Don’t display a confident level until RD has come down.
Cold start is a tension. Fast convergence wants high RD and items that probe a wide range early. Our “seed at 0 for motivation” choice fights that — a strong beginner must climb the ladder to be discovered. Two honest options: seed at the population middle (5000) with high RD for fast convergence, or keep the motivating 0 and lean on an aggressive early staircase to climb quickly. We chose the second to increase student motivation even though it prolongs the time taken to settle their score to an accurate level.
Batch as “rating periods”. Glicko was designed to digest a batch of games at once (a rating period) rather than one at a time. A play session is a natural period: collect the session’s results, update once. It’s both faster and statistically a little better behaved.
Partial credit is allowed. S doesn’t have to be 0 or 1. “Right but slow,” or “right on the second try,” can be a 0.5 or 0.75. The math doesn’t care, and it squeezes more signal out of each item.

6. How best to rate the WORDS themselves

Student ability is one half; word difficulty is the other, and it has its own best-practice tests:

Pairwise comparison i.e. “which of these two is harder?” Cheap, reliable, and it’s how we bootstrap difficulties before we have real answer data — a cohort-sort over pairs (our compc tool) ranks words against each other and those ranks seed nogl_elo. Sorting by comparisons is far more stable than asking anyone to assign an absolute number.
Let real answers move them. Once students are answering, every answer is also a datum about the word: a word that able students miss climbs. This is just the two-sided Elo from section 4, viewed from the word’s seat.
External prior inform starting guess. Frequency lists (how common a word is — gfreq, COCA), age-of-acquisition norms, and CEFR levels are decent first guesses for difficulty and worth seeding from. But they’re correlates, not truth: a frequent word can be hard to spell, a rare one easy to recognize. Seed from them, then let the answer data correct them.
Difficulty has a mode. “Difficulty” isn’t one number — a word can be easy to recognize (multiple choice) yet hard to spell (bee) or recall cold (flashcard). That’s the same reason students get a rating PER mode. If word difficulty ever varies by mode too much, words may eventually want per-mode difficulties as well.

The statistical grown-up version of all this is Item Response Theory (Rasch / 2PL models), which is what standardized adaptive tests use. Elo/Glicko is, in effect, an online streaming approximation of IRT: cheaper, updates live, and plenty good for our scale. Worth a nod if the blog audience wants the academic thread to pull.

7. Summary

One 1..10000 scale; 400 points = 10:1 odds; words seed at 5000, students at 0 (per mode).
Plain Elo’s single K can’t be fast and stable at once.
Glicko fixes that by carrying a confidence (RD): big steps when unsure, small when sure, and — crucially for us — RD re-grows while a mode is idle so stale ratings re-open.
It’s two-sided: words are rated and uncertain too (count today, RD later).
Keep RD scaled to the odds scale (~500), never to the rating range.
The TEST is half the system: probe at the edge of ability (~50-75% success) use an adaptive staircase, batch by session, and trust the level only once RD has fallen.

1. The shared scale

Motivation

2. Plain Elo, and why it isn’t enough

3. Glicko: the rating carries a confidence (RD)

4. It’s a two-sided system (words rate too)

5. The test decides everything

6. How best to rate the WORDS themselves

7. Summary

Comments