Nelson Academy

Category: Input Hypothesis

CEFR-T: A graded list for Chinese ESL speakers.
Introducing CEFR‑T (CEFR-Teach): A Vocabulary Ladder Built for How We Actually Teach.

For years, almost everyone teaching English has leaned on the same six letters: A1, A2, B1, B2, C1, C2; the levels of the *Common European Framework of Reference*, or CEFR. They’re useful. They’re everywhere. And, quietly, they’ve never quite fit the student in front of us.

Today we’re rolling out our own refinement of that ladder. We call it CEFR‑T for “CEFR, Teacher’s edition”. This post explains where it comes from, why we built it, and what it changes for your child’s learning here at Nelson Academy.

First, a quick tour: Oxford CEFR vs. CEFR‑J

Most published “CEFR” word lists you’ll meet are really Oxford’s interpretation (ex. The Oxford 3000 and Oxford 5000 word lists). These take a large corpus of real English (newspapers, books, conversation) and tag each headword with the CEFR level at which a learner is likely to meet it. It’s careful, corpus‑driven work. But it’s calibrated around how English appears to a broadly *Western* audience. “Native frequency” is doing a lot of the sorting.

CEFR‑J comes at the same problem from the other side of the world. Developed in Japan (by Tono and colleagues) for East‑Asian learners of English, it noticed something every ESL teacher in Japan already knows in their bones: the standard A‑and‑B bands are far too wide for the years our students actually spend inside them. There are over 1400 words at the A1 level — a learner can sit “at A1” for a very long time. So CEFR‑J breaks the lower levels into finer steps and re‑orders the vocabulary around what an Asian EFL classroom really introduces first.

That re‑calibration is exactly why we adopted CEFR‑J over Oxford as our level authority in previous years. For our students, it simply describes reality better.

But why CEFR‑T?
Two reasons, one practical, one philosophical.

1. Even CEFR‑J’s A1 is too large. We simply do not need more than 400 or 500 words in a particular band. A label is only useful if it tells you what to teach next, and a band of 1,500+ words tells you nothing.

2. We wanted the levels to mean something a parent can see on a shelf. Walk into any bookshop in Taiwan and you’ll find the graded English magazines — the ones organized by difficulty with one star, two stars, three stars (* / * / ***), running roughly 500 to 2,000 headwords. Beginner titles. Intermediate titles. Advanced titles. That tiered, star‑rated world is the one our student already lives in. What if our vocabulary levels mapped straight onto it?

That’s CEFR‑T.

CEFR-T Roadmap

Thirteen bands, plus “unrated,” with roughly 500 words each; about 6,500 headwords end to end:
```
| Tier          | Bands        | Maps to...                                        |
|---------------|--------------|---------------------------------------------------|
| Prerequisite  | A0           | The first 250 to 500 words a student will learn.  |
| Beginner      | A1 · A2 · A3 | Beginner level *, ** and *** words. Gradeschool.  |
| Intermediate  | B1 · B2 · B3 | Intermediate level *, ** and ***. Highschool.     |
| Advanced      | C1 · C2 · C3 | Advanced/University level.                        |
| End‑game      | D1 · D2 · D3 | Very difficult or obscure words                   |
```
The trick is the third sub‑level. Where standard CEFR gives you A1 and A2, CEFR‑T gives you A1, A2 and A3; three honest steps that line up with three stars. Suddenly “what comes next” is a small, teachable hop instead of a plateau that lasts years. A0 anchors the bottom with the genuinely-first words, and the D tail keeps our games from ever running out of hard material for a strong student.

Textbooks

Besides periodic material, how does this line up with textbooks? Very tentatively, we can seed almost all of the books in the A-level. It is highly likely that we would do something like:
- Wonderskills has Starter (Yellow), Basic (Red) and Intermediate (Green) with three books per level. We can use A0 for starter, A1 for basic and A2 for intermediate.
- EasyLink has L1, L2, L3 to L6 and maybe more. The early books are A0, then A1, as the crow flies.
- Reading Sketch Starter is probably A0, with Reading Sketch likely at A1-A2.
By A2 we can already allow 1000-1500 words; if these books are teaching words that are out of these bands, they are bad textbooks. But moreso than this, the headwords themselves are often not really the main point of the lesson. It’s the reading itself — what does it reinforce from lesson to lesson, that is most important. Headwords are important but great consideration must be given to words that are simply assumed. This will be an ongoing challenge to rate words effectively but, if a student is required to know the word at the Starter level it is clearly an A0 word. Basic, A1, and intermediate A2.

After this level of textbook instruction, I feel that moving to more advanced material (especially periodic content) is the next step. Or, more advanced static work such as Magic Treehouse.

Readers: The Parallel Concern

In addition to the above, the system must also carefully match the readers we use, which are primarily the Oxford Reading Tree or Oxford Story Tree series. We find red books to be A0, blue to be A0/A1, Green to be A1/A2, with Orange fleshing out the remainder of A2. Higher levels such as Pink really belong in A3 or B1. If a student is reading pink level (The Litter Queen, Bully, Kidnappers, etc.) they are beyond * and possibly ** level in magazines like “Let’s Talk in English.”. The A2/A3 rating supports this theory.

Primary Use Case: Textbook Selection

All in all there is quite a lot of room in the A series. Pegging a student as A0, A1, A2 or A3 will enable the proper choice of textbook early on. If a student is testing above the A3 level, they may be ready for readers above the Magic Treehouse level, with true B-level students able to start studying American newspapers, and later, classic gradeschool novels like Freckle Juice — early classics like if you give a mouse a cookie — and so on. If a teacher’s job (considering MCI/N+1) is presenting the right material to the students at the right time, CEFR-T suddenly becomes one of the most important metrics a teacher can have. It can help plug clear holes in a student’s knowledge and provide the fertile ground they need to truly enable MCI (massive comprehensible input) and FSR (free sustained reading) to work their magic.
Pros and cons

What do we gain?
- Granularity where it counts. Five steps from absolute beginner (A0) to solid intermediate (B3), instead of two and a half. That’s where the real teaching happens.
- Levels you can point at. A child finishing “A3” is ready for the one‑star beginner magazine. The level isn’t abstract. It’s a shelf.
- ~500 words a band. Each level is a real, finishable goal, not a warehouse.
- Real, measurable progress on a map — Learning a word is never just a shot in the dark that might not actually help them improve. Each word represents solid, focused progress towards a definite goal.
- A built‑in top end. The D bands mean our spelling and picture games always have a harder rung to climb.
What do we lose?
- It isn’t Oxford. CEFR‑T is a local instrument, tuned to one classroom and one country’s reading culture. It is *not* an internationally recognised standard, and we’d never present it as one. If you need an official number for an exam or an application, the world’s A1–C2 still rules — and we keep that mapping intact underneath.
- It’s hand‑curated. Its strength (a human deciding the level you’ll *actually teach* a word at) is also its subjectivity. It reflects judgement, not a corpus count.
- The A‑level boundaries are ours. Reasonable teachers could draw them slightly differently. But, they match our materials and are directly applicable to our lessons. It’s a good reference for students who are used to our system.
We are very excited to use this new system to help students learn English!
June 24, 2026
Building NOGL with small weights
Language learners acquire grammatical structures in a predictable, natural order, independent of the order in which they are taught. — Dr. Stephen Krashen

The Natural Order Hypothesis states that there is an internal sequence that the brain tends to follow when building up the grammar of a new language. Some grammatical “chunks” or vocabulary groups are easier and emerge earlier, while others come only later, even if you explicitly study them first.

For example:
- English learners typically acquire –ing (progressive) before 3rd person –s (he runs), no matter their first language or the teaching sequence.
- This pattern is seen even if you teach “he runs” early — learners will still make errors like “he run” until their internal system is ready.
The point of the NOH is extremely simple. We want to be better learners and better teachers, so some guidelines on what to teach first would be greatly appreciated! We don’t want to waste our time teaching vocabulary and grammar that the students simply cannot learn, or face significant headwinds to learn, versus what they are naturally ready to learn now.

Examples of Natural Order

As stated above, progressive -ing is accquired before third person -s. We know other things about the natural order too, thanks to the work done by Dr. Stephen Krashen and linguistic scientists all over the world who have reproduced his NOH experiments and verified the results.

Input Hypothesis

Input Hypothesis is the N+1 idea and it connects directly to the Natural Order idea. N represents what you can already understand. +1 is the next small step in the natural order — the next grammatical or lexical item your brain is ready to pick up. So, comprehensible input works best when it’s just a little above your current level, containing features that are the next step in the natural order.

In short:

Natural Order gives the “map”; i+1 gives the “step.”

This is the clue we need to reverse-engineer the natural order.

Simple Weights and Measures

By the use of fair and simple weights and measures, we can constrain a score to match a word’s natural order. The first essential weight is anything which determines what a student already knows or doesn’t know. This is a picture of N. With a significant amount of data, we should be able to construct a bell curve that implies a progression through N to N+1 to N+X. For example, a flashcard quiz is given. The cards which are ‘known’ by an intermediate student should have a high probability of being ‘known’ by another intermediate student. That is, the words they know are not random, but very similar. There will be some words one student knows that another student doesn’t know, but for the most part, if one intermediate student knows a word, it can be considered part of N.

Differential N is also a way to get to N+1. Given the above observation about intermediate students, Considering Student 1 and Student 2, if S1N > S2N, then “+1” is likely to be found in S1N-S2N. That is, the words that S1 knows that S2 doesn’t know, are highly likely to be close to the +1 that Student 2 needs to satisfy N+1. This is because if S1 is more advanced than S2, it is likely because he has better studied the material he is being tested on. Therefore, if we can establish a standard ‘known’ vocabulary for a certain level, and establish also what words they don’t know, it is then possible to look at all of the intermediate words that some of them know and some of them don’t and place them into three bands:
- A) Known by approx. 90% of intermediate students
- B) Known by 10%-20% of intermediate students
- C) Unknown in general
In this example, N is represented by A) and +1 is represented by B). These lists of words can be discovered by experimental data. Below I will list some of the data I am using to construct the NOGL scores.

But first let’s discuss how NOGL is constructed.

Intial Scores

Intially, each word in the dictionary is given a unique integer number at random. This is it’s initial NOGL score. It is also given a NOGLC which is the confidence rating of that score.

Word Comparison Tool

The first method of weighting the scores is the Word Comparison Tool (WCT). The WCT presents two random words and asks the operator (who must be a native speaker) which word is easiest. It then exchanges the NOGL scores if they chose the word with the higher NOGL score. It also increases the NOGLC of each word by 1, whether an action was performed or not. It then chooses the next two words from among the words with the lowest NOGLC scores.

The noglc increases when:
- two words are chosen and their order is confirmed as correct by the native speaker operator.
- two words are chosen and their order is confirmed by the native speaker and changed in the database.
From this test, then, the noglc represents the number of times a native speaker has confirmed the ease of a word.

Automated by CEFR and Frequency

Although things like CEFR and word frequency do not necessarily imply natural order, it is highly unlikely that an A1 word should have a higher NOGL score than a C1 or C2 word. Therefore, although it is not directly intended that the NOGL will align with CEFR in the end, automatically processing over CEFR is a good way to push the list “towards” what a human would do, and then let a human clean up the mess left by CEFR. Therefore several automatic iterations were performed using the WCT but automated via comparing the word’s CEFR score.

Automated by Game and Test Results

Between ten and twenty students in Taiwan were given the opportunity to play web-based interactive learning games such as memory with images and words, a spelling game with images and audio cues, and a flashcard game where audio and a word is displayed and they must choose the correct card (i.e. multiple choice). Students usually preferred to play the multiple choice flashcard game because all they had to do was listen to the audio and read the word and click on an image. While not perfect, it is a rough estimate of what a particular student knows.

As the games were played, each time a student got a wrong answer, that word would be increased by 1 point. And, words were shown initially lowest scores first. This is a little like the WCT; two to four words are shown, if the word is easy it’s score remains low, but the words that typically cause students to fail go up and are not shown until later difficulties. This is a slow way of organizing words by what may appear to be natural order.

Across four different games and the WCT overseen by several foreigners, scores began to emerge. Initially it was mostly replacements. For example,
- hope had a score of 10122 and was exchanged with contributor’s 1997.
- pain had a score of 5862 and was exchanged with conscience which had a score of 1911.
- other exchanges observed were authentic vs. mail, taste vs. inadequate, and and goldfish vs chaos.
Over time as the NOGLC increased, words began to sort into a semblance of order. Here is a comparison of the words with NOGL scores 100, 1000, 2000, 3000 and 6000:
- 100. Color
- 1000. Cat
- 2000. Bullet
- 3000. Eminently
- 6000. Needlepoint
Although not perfect, it does seem that the words are roughly ordered. Consider that there are 10,000 words in the dictionary and if organized into even groups, an A1 rating (1st out of 6) would represent the first 1500 words. So while you may argue that ‘Cat’ should appear earlier than ‘Color’, they are already close in terms of banding. And, this is only with a NOGLC of between 2 and 3 from random scores. As NOGLC approaches (perhaps, 10) and data is added from the tests and games, we expect the NOGL score to improve over time.

Manual Editing

With 10,000 words in the dictionary, it is possible that a word just has a bad day. If the easiest word in the world had the most difficult score, and was compared against others randomly 10 times, it is very likely it would end up with a NOGL score close to, but still significantly different from where it should be. This will be reflected in it’s NOGLC score. We don’t know what a good NOGLC score would be. The comparisons are random (but consider noglc) so without manual editing it may be very difficult to get a correct score even with a noglc > 10.

The solution is manual editing. The operator simply looks at the list and chooses a word that seems drastically out of place. The operator then places that word where he thinks it belongs. This is a more advanced process since every other word needs to move its nogl score up (or down) by 1 to accomodate the move. Or, use the BASIC principle; increase scores by times-ten, and just insert the new words wherever you want and re-order later. NOGL is relative; it can be calculated from the existing list if needed. Otherwise there is no need to remember a nogl score; only it’s relative score.

Hmm! Let’s try that. nogl scores will be x10 and in practice we will just drop the last digit. Lets call the last digit a decimal! Oh, that is interesting.

In any case, the purpose of NOGL is to discover the natural order.

Ultimately, if it replicates the mental learning curve demonstrated by actual students, then it must be useful in replicating that mental curve when teaching English. By using these weights and other similar, simple and fair weights, we believe that NOGL will come to approimate the natural order, becoming more and more accurate over time.
November 6, 2025

Category: Input Hypothesis

CEFR-T: A graded list for Chinese ESL speakers.

First, a quick tour: Oxford CEFR vs. CEFR‑J

But why CEFR‑T?

CEFR-T Roadmap

Textbooks

Readers: The Parallel Concern

Primary Use Case: Textbook Selection

Pros and cons

Building NOGL with small weights

Examples of Natural Order

Input Hypothesis

Simple Weights and Measures

Word Comparison Tool

Automated by CEFR and Frequency

Automated by Game and Test Results

Manual Editing