• NOGL, ELO and.. Glicko?

    Rating words and students: Elo, Glicko, and the tests that feed them

    This is a write-up of how we rate two different things on one shared scale:

    * WORDS, by how hard they are, and
    * STUDENTS, by how able they are,

    so that the two numbers are directly comparable. If a word sits at 6000 and a student sits at 6000, the student should get that word right about half the time. That single property is the whole reason to put words and people on the same ruler.

    1. The shared scale

    Everything lives on a 1..10000 scale.
    • Each dictionary word has a difficulty rating (we call it nogl_elo), seeded at 5000 (the middle) until we learn otherwise.
    • Each student has an ability rating per game mode (flashcards, multiple choice, spelling bee, picture tree, memory), seeded at 0.

    Motivation

    Why seed students at 0 instead of the middle? Motivation. A child watching their number climb from zero is more fun than one that starts in the middle and barely moves. The cost is that a strong student is under-rated for a while; we come back to that in the section on cold starts.
    The scale uses the classic Elo convention: a difference of 400 points means 10-to-1 odds. So a student 400 points above a word wins ~91% of the time; 800 points above, ~99%. We deliberately kept this “400” identical to chess Elo so a point means the same thing here as everywhere else. The absolute range (0..10000) is wider than chess, but a *point* is a point.

    2. Plain Elo, and why it isn’t enough

    Elo is two formulas. The expected score (probability of success) of a player rated R against an item rated D:
    E = 1 / (1 + 10^(-(R – D) / 400))
    and the update after a result S (1 = right, 0 = wrong):
    R_new = R + K * (S – E)
    K is the “step size.” The entire art of plain Elo is choosing K. And here’s the problem: one fixed K cannot be both things we need.
    • A new student should move fast — we know nothing, every game is news.
    • A settled student should move slowly — we know them, one fluke shouldn’t
      swing their rating.
    A single K is a compromise that is too slow at the start and too jittery later. You can hand-tune K to decay with games played, and people do, but that is a patch. The principled fix is to track, alongside the rating, how sure we are of it. That is what Glicko adds.

    3. Glicko: the rating carries a confidence (RD)

    Glicko (Mark Glickman’s system, the math behind a lot of modern ladders) pairs every rating with a Rating Deviation, RD: a measure of uncertainty, in the same points as the rating. Think of the true rating as living in a band R +/- RD.
    • High RD -> we’re unsure -> take big steps (fast cold start).
    • Low RD -> we’re sure -> take small steps (stable).
    RD effectively makes K self-tuning — no hand-decayed K-factor needed. Three things fall out of it for free, and all three matter for a learning app:
    (a) Cold start. A brand-new rating has maximum RD, so the first handful of games move it a long way, then it settles automatically.
    (b) Idle reinflation. This is Glicko’s signature feature and the one most relevant to us. RD GROWS while a mode sits unused. A kid drills spelling bee for a week then ignores it for a month — their bee RD swells, so the next result counts more and we stop over-trusting a stale number. In a multi-mode app here modes routinely go cold, this is exactly what you want.
    (c) Verifying Difficulty. When picking the next word, a wide RD says “probe broadly to find their level,” a tight RD says “fine-tune.” The rating alone can’t tell you that; the confidence can.
    The Glicko update is the same shape as Elo but the step depends on RD, and each opponent’s own RD is discounted by a factor g(RD) — an uncertain opponent teaches you less. We implement Glicko-1 (rating + RD). Glicko-2 adds a third number, “volatility,” that tracks how erratic a player is; for a single-student site that’s more machinery than it earns, so we skipped it.

    4. It’s a two-sided system (words rate too)

    This is not a fixed exam where the questions have known, frozen difficulties. The words are being rated at the same time as the students — a player-versus-item Elo where both sides learn. A word that strong students keep missing drifts upward in difficulty; a word everyone gets right drifts down.
    So the word side wants a confidence too. Today we approximate it with a simple count (noglc — how many times a word has been compared/answered): more data = more trust. A cleaner future step is to give words a real RD like students have, and let the two uncertainties meet in the same formula. Glicko already expects the opponent to have an RD, so the hook is there.
    A note we learned the hard way: RD must be scaled to the *odds scale* (the 400), NOT to the rating *range* (10000). We first set the starting RD to 2000 because the scale runs to 10000 — and the math detonated: a single win rocketed a student from 0 to 22000, and RD never shrank, because wildly lopsided matchups carry almost no information. The lesson: keep RD on the order of the scale. Chess ships RD0 = 350 against a 400 scale; we use 500. The range being large does not mean the uncertainty should be.
    Our current constants (all tunable in one file, php/glicko.php):
    • scale 400 (400 points = 10:1 odds)
    • RD0 500 (starting / maximum uncertainty)
    • RD floor 50 (most certain we ever claim to be)
    • idle return 60 days (a settled RD climbs back to RD0 over ~two months)
    • max delta 300 (UX governor: most a rating may move per update)
    That last one isn’t Glicko — it’s a deliberate game-feel choice. We WANT a visible “period of rise” rather than the system instantly snapping a student to their true level, so we cap how far a rating can move in one update. Read it as a points-per-game ceiling: it sets the MINIMUM length of the climb (e.g. cap 300 -> reaching 6000 takes at least ~20 games). We deliberately set it just BELOW the natural per-game step (~250-350 under adaptive matchmaking) so the cap actually bites on the strong games and smooths the rise, rather than only catching outliers. We cap the visible rating but leave RD honest, so confidence still reflects the real evidence even while the displayed number is being held back. Pure statistics wants fast convergence; this is us trading a little of that for motivation, on purpose.

    5. The test decides everything

    A rating system is only as good as the games you feed it. Two designs with the identical Glicko math converge in 15 games or 150 depending entirely on *which items you show*. The governing principle:
    A result is only informative when the outcome was in doubt.
    Show a 6000 student a 1000 word and they get it right — you learn nothing you didn’t already assume. The information in a result peaks when the success probability is around 50%, i.e. when item difficulty sits near the student’s ability. So:
    • Matchmaking at the edge of ability. Pick words where the student is expected to succeed maybe 50-75% of the time — hard enough to be informative, easy enough to not be demoralizing. (The pedagogy and the statistics agree here, which is convenient.)
    • Don’t just serve items at the current estimate when you’re unsure. We simulated a true-6000 student seeded at 0 and matched only to words near their *current* rating: they beat easy words forever and crawled up ~58 points a game, still mis-rated at 1900 after 25 games. Switch to an ADAPTIVE staircase — raise difficulty after a win, lower it after a miss — and the same student reaches ~5900 in about 20 games. The staircase naturally walks the items up to the student’s true level and parks there, which is exactly where the informative ~50% games live.
    • Watch RD as well as the rating. In that adaptive run, RD actually *rose* during the long winning climb (all those lopsided wins taught us little) and only tightened once the matchups reached the 50% zone. That’s the system correctly saying “I see them winning, but I haven’t yet found where they lose.” Don’t display a confident level until RD has come down.
    • Cold start is a tension. Fast convergence wants high RD and items that probe a wide range early. Our “seed at 0 for motivation” choice fights that — a strong beginner must climb the ladder to be discovered. Two honest options: seed at the population middle (5000) with high RD for fast convergence, or keep the motivating 0 and lean on an aggressive early staircase to climb quickly. We chose the second to increase student motivation even though it prolongs the time taken to settle their score to an accurate level.
    • Batch as “rating periods”. Glicko was designed to digest a batch of games at once (a rating period) rather than one at a time. A play session is a natural period: collect the session’s results, update once. It’s both faster and statistically a little better behaved.
    • Partial credit is allowed. S doesn’t have to be 0 or 1. “Right but slow,” or “right on the second try,” can be a 0.5 or 0.75. The math doesn’t care, and it squeezes more signal out of each item.

    6. How best to rate the WORDS themselves

    Student ability is one half; word difficulty is the other, and it has its own best-practice tests:
    • Pairwise comparison i.e. “which of these two is harder?” Cheap, reliable, and it’s how we bootstrap difficulties before we have real answer data — a cohort-sort over pairs (our compc tool) ranks words against each other and those ranks seed nogl_elo. Sorting by comparisons is far more stable than asking anyone to assign an absolute number.
    • Let real answers move them. Once students are answering, every answer is also a datum about the word: a word that able students miss climbs. This is just the two-sided Elo from section 4, viewed from the word’s seat.
    • External prior inform starting guess. Frequency lists (how common a word is — gfreq, COCA), age-of-acquisition norms, and CEFR levels are decent first guesses for difficulty and worth seeding from. But they’re correlates, not truth: a frequent word can be hard to spell, a rare one easy to recognize. Seed from them, then let the answer data correct them.
    • Difficulty has a mode. “Difficulty” isn’t one number — a word can be easy to recognize (multiple choice) yet hard to spell (bee) or recall cold (flashcard). That’s the same reason students get a rating PER mode. If word difficulty ever varies by mode too much, words may eventually want per-mode difficulties as well.
    The statistical grown-up version of all this is Item Response Theory (Rasch / 2PL models), which is what standardized adaptive tests use. Elo/Glicko is, in effect, an online streaming approximation of IRT: cheaper, updates live, and plenty good for our scale. Worth a nod if the blog audience wants the academic thread to pull.

    7. Summary

    • One 1..10000 scale; 400 points = 10:1 odds; words seed at 5000, students at 0 (per mode).
    • Plain Elo’s single K can’t be fast and stable at once.
    • Glicko fixes that by carrying a confidence (RD): big steps when unsure, small when sure, and — crucially for us — RD re-grows while a mode is idle so stale ratings re-open.
    • It’s two-sided: words are rated and uncertain too (count today, RD later).
    • Keep RD scaled to the odds scale (~500), never to the rating range.
    • The TEST is half the system: probe at the edge of ability (~50-75% success) use an adaptive staircase, batch by session, and trust the level only once RD has fallen.
  • Bucketing Part of Speech

    For a learner, part of speech is a bucket, not a linguistic analysis. The classic “parts of speech” set is the natural fit, and it lands on a clean 10, discounting ‘particle’.

    ┌─────┬─────────────────────────────────────┬────────┐
    │ #categorytag    │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 1   │ noun                                │ n      │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 2   │ verb (incl. modal/aux/linking)      │ v      │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 3   │ adjective                           │ adj    │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 4   │ adverb                              │ adv    │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 5   │ pronoun                             │ n-p    │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 6   │ preposition (incl. infinitive "to") │ prep   │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 7   │ conjunction                         │ conj   │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 8   │ determiner (incl. articles)         │ det    │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 9   │ number (incl. ordinal)              │ num    │
    ├─────┼─────────────────────────────────────┼────────┤
    │ 10  │ interjection (exclamation)          │ interj │
    └─────┴─────────────────────────────────────┴────────┘

    This is just the traditional eight parts of speech + det + num. The big four (n/v/adj/adv) soak up ~95% of the language, and everything exotic collapses into a bucket. Meaning, jump can be a noun or a verb; but we do not care what kind of noun or verb it is, only which one it is.

  • Building NOGL with small weights

    Language learners acquire grammatical structures in a predictable, natural order, independent of the order in which they are taught. — Dr. Stephen Krashen

    The Natural Order Hypothesis states that there is an internal sequence that the brain tends to follow when building up the grammar of a new language. Some grammatical “chunks” or vocabulary groups are easier and emerge earlier, while others come only later, even if you explicitly study them first.

    For example:

    • English learners typically acquire –ing (progressive) before 3rd person –s (he runs), no matter their first language or the teaching sequence.

    • This pattern is seen even if you teach “he runs” early — learners will still make errors like “he run” until their internal system is ready.

    The point of the NOH is extremely simple. We want to be better learners and better teachers, so some guidelines on what to teach first would be greatly appreciated! We don’t want to waste our time teaching vocabulary and grammar that the students simply cannot learn, or face significant headwinds to learn, versus what they are naturally ready to learn now.

    Examples of Natural Order

    As stated above, progressive -ing is accquired before third person -s. We know other things about the natural order too, thanks to the work done by Dr. Stephen Krashen and linguistic scientists all over the world who have reproduced his NOH experiments and verified the results.

    Input Hypothesis

    Input Hypothesis is the N+1 idea and it connects directly to the Natural Order idea. N represents what you can already understand. +1 is the next small step in the natural order — the next grammatical or lexical item your brain is ready to pick up. So, comprehensible input works best when it’s just a little above your current level, containing features that are the next step in the natural order.

    In short:

    Natural Order gives the “map”; i+1 gives the “step.”

    This is the clue we need to reverse-engineer the natural order.

    Simple Weights and Measures

    By the use of fair and simple weights and measures, we can constrain a score to match a word’s natural order. The first essential weight is anything which determines what a student already knows or doesn’t know. This is a picture of N. With a significant amount of data, we should be able to construct a bell curve that implies a progression through N to N+1 to N+X. For example, a flashcard quiz is given. The cards which are ‘known’ by an intermediate student should have a high probability of being ‘known’ by another intermediate student. That is, the words they know are not random, but very similar. There will be some words one student knows that another student doesn’t know, but for the most part, if one intermediate student knows a word, it can be considered part of N.

    Differential N is also a way to get to N+1. Given the above observation about intermediate students, Considering Student 1 and Student 2, if S1N > S2N, then “+1” is likely to be found in S1N-S2N. That is, the words that S1 knows that S2 doesn’t know, are highly likely to be close to the +1 that Student 2 needs to satisfy N+1. This is because if S1 is more advanced than S2, it is likely because he has better studied the material he is being tested on. Therefore, if we can establish a standard ‘known’ vocabulary for a certain level, and establish also what words they don’t know, it is then possible to look at all of the intermediate words that some of them know and some of them don’t and place them into three bands:

    • A) Known by approx. 90% of intermediate students
    • B) Known by 10%-20% of intermediate students
    • C) Unknown in general

    In this example, N is represented by A) and +1 is represented by B). These lists of words can be discovered by experimental data. Below I will list some of the data I am using to construct the NOGL scores.

    But first let’s discuss how NOGL is constructed.

    Intial Scores

    Intially, each word in the dictionary is given a unique integer number at random. This is it’s initial NOGL score. It is also given a NOGLC which is the confidence rating of that score.

    Word Comparison Tool

    The first method of weighting the scores is the Word Comparison Tool (WCT). The WCT presents two random words and asks the operator (who must be a native speaker) which word is easiest. It then exchanges the NOGL scores if they chose the word with the higher NOGL score. It also increases the NOGLC of each word by 1, whether an action was performed or not. It then chooses the next two words from among the words with the lowest NOGLC scores.

    The noglc increases when:

    • two words are chosen and their order is confirmed as correct by the native speaker operator.
    • two words are chosen and their order is confirmed by the native speaker and changed in the database.

    From this test, then, the noglc represents the number of times a native speaker has confirmed the ease of a word.

    Automated by CEFR and Frequency

    Although things like CEFR and word frequency do not necessarily imply natural order, it is highly unlikely that an A1 word should have a higher NOGL score than a C1 or C2 word. Therefore, although it is not directly intended that the NOGL will align with CEFR in the end, automatically processing over CEFR is a good way to push the list “towards” what a human would do, and then let a human clean up the mess left by CEFR. Therefore several automatic iterations were performed using the WCT but automated via comparing the word’s CEFR score.

    Automated by Game and Test Results

    Between ten and twenty students in Taiwan were given the opportunity to play web-based interactive learning games such as memory with images and words, a spelling game with images and audio cues, and a flashcard game where audio and a word is displayed and they must choose the correct card (i.e. multiple choice). Students usually preferred to play the multiple choice flashcard game because all they had to do was listen to the audio and read the word and click on an image. While not perfect, it is a rough estimate of what a particular student knows.

    As the games were played, each time a student got a wrong answer, that word would be increased by 1 point. And, words were shown initially lowest scores first. This is a little like the WCT; two to four words are shown, if the word is easy it’s score remains low, but the words that typically cause students to fail go up and are not shown until later difficulties. This is a slow way of organizing words by what may appear to be natural order.

    Across four different games and the WCT overseen by several foreigners, scores began to emerge. Initially it was mostly replacements. For example,

    • hope had a score of 10122 and was exchanged with contributor’s 1997.
    • pain had a score of 5862 and was exchanged with conscience which had a score of 1911.
    • other exchanges observed were authentic vs. mail, taste vs. inadequate, and and goldfish vs chaos.

    Over time as the NOGLC increased, words began to sort into a semblance of order. Here is a comparison of the words with NOGL scores 100, 1000, 2000, 3000 and 6000:

    • 100. Color
    • 1000. Cat
    • 2000. Bullet
    • 3000. Eminently
    • 6000. Needlepoint

    Although not perfect, it does seem that the words are roughly ordered. Consider that there are 10,000 words in the dictionary and if organized into even groups, an A1 rating (1st out of 6) would represent the first 1500 words. So while you may argue that ‘Cat’ should appear earlier than ‘Color’, they are already close in terms of banding. And, this is only with a NOGLC of between 2 and 3 from random scores. As NOGLC approaches (perhaps, 10) and data is added from the tests and games, we expect the NOGL score to improve over time.

    Manual Editing

    With 10,000 words in the dictionary, it is possible that a word just has a bad day. If the easiest word in the world had the most difficult score, and was compared against others randomly 10 times, it is very likely it would end up with a NOGL score close to, but still significantly different from where it should be. This will be reflected in it’s NOGLC score. We don’t know what a good NOGLC score would be. The comparisons are random (but consider noglc) so without manual editing it may be very difficult to get a correct score even with a noglc > 10.

    The solution is manual editing. The operator simply looks at the list and chooses a word that seems drastically out of place. The operator then places that word where he thinks it belongs. This is a more advanced process since every other word needs to move its nogl score up (or down) by 1 to accomodate the move. Or, use the BASIC principle; increase scores by times-ten, and just insert the new words wherever you want and re-order later. NOGL is relative; it can be calculated from the existing list if needed. Otherwise there is no need to remember a nogl score; only it’s relative score.

    Hmm! Let’s try that. nogl scores will be x10 and in practice we will just drop the last digit. Lets call the last digit a decimal! Oh, that is interesting.

    In any case, the purpose of NOGL is to discover the natural order.

    Ultimately, if it replicates the mental learning curve demonstrated by actual students, then it must be useful in replicating that mental curve when teaching English. By using these weights and other similar, simple and fair weights, we believe that NOGL will come to approimate the natural order, becoming more and more accurate over time.

     

     

     

     

     

  • Problems with CEFR

    Let’s play a game. Which word doesn’t belong?

    • solecism
    • ambergris
    • brobdingnagian
    • bully

    How about this list?

    • accountability
    • assassination
    • atrocity
    • bat

    If you picked ‘bully’ and ‘bat’, you are just like most people. But believe it or not, all of the words on the first list are CEFR C2 level words, and all of the words on the second are C1 level!

    While it is true that some sources list words like bat and bully earlier, it seems that CEFR itself is somewhat misunderstood as a concept. While it’s stated goal is ‘…to provide a standardized way to describe language proficiency, which helps language professionals create consistent syllabuses, curriculum guidelines, and examinations across different countries.’ in reality it is a kind of ongoing academic study of which words appear in English texts by frequency. So it isn’t really a useful way to decide which words to teach first. It’s more like a survey of words in popular media. Since words like ‘bat’ are uncommon in general speech and writing, they score very low on the list despite being extremely easy to understand. Simply put, CEFR does not align with the natural order hypothesis. Ironically, the idea that it could tell you which words were ‘easier’ is the reason why it got popular. Well it’s time to wake up! CEFR isn’t about that at all.

    The solution is a “new CEFR” with a different mandate and a different modus operandi. The idea is simple; go back to those lists above; it is very clear that in the first list, the word ‘bully’ is easier than ‘solecism’, or even ‘brobdingnagian’ — a word I have never encountered in my entire life. It is very clear that the word ‘bat’ is easier than ‘accountability’ or ‘coherent’.

    Therefore, we have already established a primary means of ranking; what does an experienced English speaker believe is more or less common? In order to do this, we simply take two words and compare them. To these words then, a weight is added where one word is placed on the left and one word is placed on the right.

    Another idea is to group words by the general time it is expected that someone should know them. For example, there are ‘common animal words’ which are essentially common pets and common zoo animals. These words will be dominated by animals that appear commonly in the home, on TV in cartoons, or commonly on farms. Dog, cat, fish, mouse, and then likely words like elephant, horse, cow, pig, chicken, etc. (especially chicken, since it is also a food word).  However even here, we see that ‘dog’ and ‘cat’ seem easier — somehow, than elephant. What to do?

    What is needed then, is a new score! We have CEFR, but maybe we also need NOGL — ahh, yes, noggles! You have ‘sefir’ (CEFR) and now you have noggles (NOGL)! It stands for Natural Order Graded Level. The idea is that the words will be ordered based on where they are expected to be in the Natural Order Hypothesis.

    Note that this will be very heavily weighted by words that appear in school textbooks, so to a lesser extent NOGL will be influenced by lists like CEFR since they may be used to construct textbooks for children. Never-the-less, the idea is that they will learn some words easier or faster than others — so even within a level like A1, there may be separated ‘easy A1’ and ‘the difficult half of A1’. Given CEFR, I would bet money that several of the easy A2 words would be easier than the difficult A1 words! This problem is solved with NOGL.

    Another example, where this will have a direct practical application is on lists like the JLPT N5 or the MoE’s “800 words for children” or “2,000 words for highschool” and the like. There’s going to be a separation in these lists where some words will generally be taught first before other words, and this creates an expectation that they will be known. But also, children will pick up words on their own naturally. NOGL must reflect both realities in order to be useful, while differing away from such academic extravagances just enough to allow users to lean into the natural order hypothesis to supercharge their English teaching ability.

    In general, for a case-study game like a spelling bee or multiple choice flashcard tests, using a NOGL-based score will provide a better grading for what a student will actually, practically know, than CEFR. And, since the NOGL will align with what people actually know, it will be the most efficient way to find which words to study next.

    Towards a strong definition of NOGL

    Lets be a bit specific here, although this might not be the final idea, it’s a shot in that direction.

    1. Take a list of words. Attempt to organize them into two lists ‘easy’ and ‘difficult’, thus creating two lists from one. It does not really matter how you do this; you could compare two words at random, but it would probably work better to scan for the easiest words, move those, then scan for the most difficult words and move those, and repeat. You may even want to create three or four or five lists at the same time like this, but follow the KISS principle; keep it simple here.
    2. Repeat the process on the sub-lists until you have about 10 groups of words.
    3. If you did 3 rounds of simple comparison, you would have 8 groups of words. Four rounds is 16 groups and is probably accurate enough to grade the entire language. It’s also probably too many groups for practical use. 7 groups is ABCDEFG, and has already more than CEFR (6) or JLPT (5). You can also have an A1/A2 designation within the groups for 14 levels, even though each group is considered a unified thing. The A1/A2 is just which ones come first i.e. an ‘approach rating’. You could even have A3, A4, A5, etc. which signifies it’s order.
    4. Each NOGL rated word has a kind of ‘elo’ score for comparing against other words; and the entire list is then split into 7 groups.
    5. But, if we had 10 groups (NOGL-1, NOGL-2, etc.) we could call it N1, N2, N3 etc. So the NOGL is it’s number and then we need a way to signify it’s group, or room system.

    So ultimately NOGL is intended to approximate n from input theory; if you need n+1, first, what is n. Then, we can place them into groups under the NOGL banner.

    NOGL grading is unscientific, but here’s an idea. Ungraded words all have a value of 1 (but they’re ungraded, so this isn’t shown). When a word enters the icon lex, it is compared aganist some words from the corpus. The eaiest one All words are expressed a NOGL number based on their CEFR (to start). Then, words within each CEFR level are compared and the more difficult word changes it’s value to be at least one higher than the value of the card it is carrying. Methods like these may allow us to quickly come up with a usable version of “NOGL”.

    I’ll think this over and come back with some experimental data… soon!

  • Neo’s Success

    Neo’s Success

    Neo finally got his driver’s license! He even got a perfect score on the test!

     

  • Hello world!

    this is the first post on the blog.