Rating words and students: Elo, Glicko, and the tests that feed them
This is a write-up of how we rate two different things on one shared scale:
* WORDS, by how hard they are, and
* STUDENTS, by how able they are,
so that the two numbers are directly comparable. If a word sits at 6000 and a student sits at 6000, the student should get that word right about half the time. That single property is the whole reason to put words and people on the same ruler.
1. The shared scale
- Each dictionary word has a difficulty rating (we call it nogl_elo), seeded at 5000 (the middle) until we learn otherwise.
- Each student has an ability rating per game mode (flashcards, multiple choice, spelling bee, picture tree, memory), seeded at 0.
Motivation
2. Plain Elo, and why it isn’t enough
- A new student should move fast — we know nothing, every game is news.
- A settled student should move slowly — we know them, one fluke shouldn’t
swing their rating.
3. Glicko: the rating carries a confidence (RD)
- High RD -> we’re unsure -> take big steps (fast cold start).
- Low RD -> we’re sure -> take small steps (stable).
4. It’s a two-sided system (words rate too)
- scale 400 (400 points = 10:1 odds)
- RD0 500 (starting / maximum uncertainty)
- RD floor 50 (most certain we ever claim to be)
- idle return 60 days (a settled RD climbs back to RD0 over ~two months)
- max delta 300 (UX governor: most a rating may move per update)
5. The test decides everything
- Matchmaking at the edge of ability. Pick words where the student is expected to succeed maybe 50-75% of the time — hard enough to be informative, easy enough to not be demoralizing. (The pedagogy and the statistics agree here, which is convenient.)
- Don’t just serve items at the current estimate when you’re unsure. We simulated a true-6000 student seeded at 0 and matched only to words near their *current* rating: they beat easy words forever and crawled up ~58 points a game, still mis-rated at 1900 after 25 games. Switch to an ADAPTIVE staircase — raise difficulty after a win, lower it after a miss — and the same student reaches ~5900 in about 20 games. The staircase naturally walks the items up to the student’s true level and parks there, which is exactly where the informative ~50% games live.
- Watch RD as well as the rating. In that adaptive run, RD actually *rose* during the long winning climb (all those lopsided wins taught us little) and only tightened once the matchups reached the 50% zone. That’s the system correctly saying “I see them winning, but I haven’t yet found where they lose.” Don’t display a confident level until RD has come down.
- Cold start is a tension. Fast convergence wants high RD and items that probe a wide range early. Our “seed at 0 for motivation” choice fights that — a strong beginner must climb the ladder to be discovered. Two honest options: seed at the population middle (5000) with high RD for fast convergence, or keep the motivating 0 and lean on an aggressive early staircase to climb quickly. We chose the second to increase student motivation even though it prolongs the time taken to settle their score to an accurate level.
- Batch as “rating periods”. Glicko was designed to digest a batch of games at once (a rating period) rather than one at a time. A play session is a natural period: collect the session’s results, update once. It’s both faster and statistically a little better behaved.
- Partial credit is allowed. S doesn’t have to be 0 or 1. “Right but slow,” or “right on the second try,” can be a 0.5 or 0.75. The math doesn’t care, and it squeezes more signal out of each item.
6. How best to rate the WORDS themselves
- Pairwise comparison i.e. “which of these two is harder?” Cheap, reliable, and it’s how we bootstrap difficulties before we have real answer data — a cohort-sort over pairs (our compc tool) ranks words against each other and those ranks seed nogl_elo. Sorting by comparisons is far more stable than asking anyone to assign an absolute number.
- Let real answers move them. Once students are answering, every answer is also a datum about the word: a word that able students miss climbs. This is just the two-sided Elo from section 4, viewed from the word’s seat.
- External prior inform starting guess. Frequency lists (how common a word is — gfreq, COCA), age-of-acquisition norms, and CEFR levels are decent first guesses for difficulty and worth seeding from. But they’re correlates, not truth: a frequent word can be hard to spell, a rare one easy to recognize. Seed from them, then let the answer data correct them.
- Difficulty has a mode. “Difficulty” isn’t one number — a word can be easy to recognize (multiple choice) yet hard to spell (bee) or recall cold (flashcard). That’s the same reason students get a rating PER mode. If word difficulty ever varies by mode too much, words may eventually want per-mode difficulties as well.
7. Summary
- One 1..10000 scale; 400 points = 10:1 odds; words seed at 5000, students at 0 (per mode).
- Plain Elo’s single K can’t be fast and stable at once.
- Glicko fixes that by carrying a confidence (RD): big steps when unsure, small when sure, and — crucially for us — RD re-grows while a mode is idle so stale ratings re-open.
- It’s two-sided: words are rated and uncertain too (count today, RD later).
- Keep RD scaled to the odds scale (~500), never to the rating range.
- The TEST is half the system: probe at the edge of ability (~50-75% success) use an adaptive staircase, batch by session, and trust the level only once RD has fallen.

