Tokens Explained: How LLMs Read and Write
Tokens control API cost, context limits, and why your model gets cut off mid-sentence. Here's exactly how LLMs read and write.
TL;DR
- • A token is roughly 4 characters or 0.75 words. 'Apple' is 1 token. 'Supercalifragilisticexpialidocious' breaks into several.
- • Max Output Tokens is a hard cutoff. The model stops mid-sentence when it hits the limit, no matter what your system instruction says.
- • JSON formatting costs 2-3× more tokens than plain text for the same data.
- • Hindi, Arabic, and Chinese text uses 2-3× more tokens than English, increasing both cost and context window usage.
- • Gemini 2.5 Flash has a 1M token context window. GPT-4o has 128K. System instructions count toward that budget on every single call.
You set Max Tokens to 2 and typed “Continue: Jana Gana Ma”. The model returned one word. Then nothing. Not an apology. Not a shorter version of the song. One token, then a hard stop.
That experiment is in Lesson 2 of the TinkerLLM curriculum for a reason. It makes the abstract concrete: every LLM reads your input and generates its output in discrete chunks called tokens. Not words. Not characters. Tokens. Once you understand what a token is, a lot of other things start making sense: why Hindi text costs more than English, why JSON responses burn through your budget faster, and why a long conversation history makes every new message more expensive.
What a Token Actually Is
The simplest definition: a token is roughly 4 characters of English text, or about 0.75 words. So 100 words of English prose is approximately 130 tokens.
This rule of thumb holds well for common English words. “Apple” is 1 token. “Apple pie” is 2 tokens. “Unhappiness” is probably 1 token because it appears frequently enough in training data. But the rule breaks down with rare or long words.
“Supercalifragilisticexpialidocious” breaks into several tokens. The tokenizer has never seen that word as a complete unit, so it splits it into familiar subword pieces. How many tokens? Check yourself: the number varies by model. The point is it’s not 1, and it’s not 34 (one per character). It’s somewhere in between, wherever the tokenizer finds the most efficient split.
Numbers get fragmented too. “12345” might become three or four separate tokens depending on the tokenizer. This is one reason models are unreliable at arithmetic. They’re not computing with numbers the way a calculator does. They’re doing pattern matching on token fragments that happen to look like numbers. A calculator processes the value 12345. An LLM processes whatever token sequence “12345” maps to, and the arithmetic has to emerge from patterns in the training data, not from the numeric value itself.
Try it yourself: Open the TinkerLLM playground and type “Supercalifragilisticexpialidocious” in the prompt box. Check the ⚡ icon in the playground header after you run it. You’ll see it registers more than one token. Now clear that and type “Apple”. One token. This is exercise 2-1 in Lesson 2, and it takes under two minutes. The difference you observe is the tokenizer at work.
Why Not Words or Characters?
Splitting text by words seems like the obvious approach. The problem is vocabulary size. English has over 170,000 dictionary words, and that’s before you add proper nouns, technical jargon, code, transliterated languages, and slang. A model that needed one vocabulary entry per word would require a lookup table too large to be practical. And most of those entries would appear so rarely in training data that the model would learn almost nothing useful about them individually.
Splitting by character is the other extreme. A one-character vocabulary is tiny, which is computationally nice, but it forces the model to work much harder. “cat” becomes three separate tokens, and the model has to piece together meaning from individual letters rather than from recognizable word units. This makes learning take much longer and produces worse results for the same training effort.
The actual solution is subword tokenization. Modern tokenizers find a middle ground: common words stay intact as single tokens, and rare or long words get broken into common subword pieces that the model recognizes from other contexts.
Two of the most widely used approaches: BPE (Byte Pair Encoding), used by GPT models, and SentencePiece, used by Gemini. BPE builds its vocabulary by iteratively merging the most frequent character pairs in the training data. You start with individual characters, merge the most common pair into a new token, and repeat until you reach your target vocabulary size. SentencePiece treats text as a raw byte stream without assuming any language’s word boundaries, which makes it more language-agnostic from the start.
If you want to see how these algorithms differ in detail, Hugging Face’s tokenizer documentation is a clear reference.
Gemini’s vocabulary contains roughly 256,000 tokens. GPT-4 uses around 100,000. The larger vocabulary means the tokenizer can represent common text more efficiently, which usually means a lower token count for the same input. This matters for cost and for how much text fits in a context window.
The Hard Cutoff: Max Output Tokens
Every API call has a token budget for the model’s response. Max Output Tokens is the hard ceiling on how many tokens the model can generate.
Hard means hard. The model generates one token at a time. When it hits the limit, it stops. It doesn’t know the cutoff is coming. It doesn’t try to wrap up its sentence. It doesn’t summarize. It just stops, wherever it happens to be in the output.
This is fundamentally different from a system instruction like “Answer in one sentence.” That’s a soft limit. The model tries to comply, but it’s a request, not a wall. The model might write two sentences if it thinks the topic warrants it. Max Output Tokens doesn’t ask. It cuts.
Try it yourself: In the TinkerLLM playground, set Max Output Tokens to 2 and type “Continue: Jana Gana Ma” (this is exercise 2-2 in Lesson 2). The response will be one or two tokens, then nothing. Then set Max Output Tokens to 5 and ask “Explain the Mahabharata” (exercise 2-3). The model starts an explanation and gets cut off mid-thought. There’s no truncation notice, no error message. Just an incomplete response. That’s the hard cutoff in action.
When a system instruction and a hard limit conflict, the hard limit wins. Exercise 2-7 in Lesson 2 demonstrates this directly: set Max Output Tokens to 10 and the system instruction to “Write a very long essay”. The model tries to follow the instruction. It generates 10 tokens of essay. Then it stops. Hard limits don’t care about your instructions.
For most use cases, you want Max Output Tokens set high enough that it doesn’t interfere. If you’re generating detailed documentation or long-form content, a limit of 2,048 or 4,096 tokens gives the model room to work. The limit is there as a safety bound, not as a way to control response length in normal operation.
Format Cost: JSON vs Plain Text
The format you request directly affects how many tokens the response uses, and therefore what you pay.
The same three cities, in four different formats:
| Format | Example | Approximate Tokens |
|---|---|---|
| Plain text | Mumbai, Delhi, Bangalore | 5 |
| Python list | ["Mumbai", "Delhi", "Bangalore"] | 12 |
| JSON object | {"cities": ["Mumbai", "Delhi", "Bangalore"]} | 16 |
| YAML | cities:\n - Mumbai\n - Delhi\n - Bangalore | 10 |
The information is identical. The token cost isn’t. Every quotation mark, bracket, and colon is a token or part of one. JSON is the most expensive format for structured data. YAML is cheaper. Plain text is cheapest.
For a single API call, the difference is negligible. For an application making 10,000 calls per day, moving from JSON to plain text for the right use cases can reduce output token costs by 30 to 50 percent.
The question to ask before requesting JSON: do you actually need the structure, or can you parse something simpler? For a list of city names, comma-separated text is easy to split in any programming language. For nested data with multiple fields per record, JSON earns its cost because the alternative is writing a fragile custom parser.
Try it yourself: In the TinkerLLM playground, ask for “3 cities in India in JSON format” (exercise 2-8 from Lesson 2). Watch the ⚡ token count in the header. Then ask for the same 3 cities as plain comma-separated text. The JSON version will consistently use more tokens for identical content. If your application handles parsing, you can often save meaningful budget by requesting a simpler format.
The Token Tax on Non-English Languages
Tokenizers are trained on whatever text was available during the model’s development. For most major LLMs, that training data was predominantly English. Common English words get efficient, compact token representations. Non-English text often doesn’t.
Here’s what that looks like concretely:
| Language | Example phrase | Approximate tokens |
|---|---|---|
| English | ”Hello, how are you?“ | 5 |
| Hinglish (romanized) | “Namaste, aap kaise hain?“ | 7-9 |
| Hindi (Devanagari) | “नमस्ते, आप कैसे हैं?“ | 10-14 |
| Arabic | ”مرحبا، كيف حالك؟“ | 10-15 |
| Chinese | ”你好,你好吗?“ | 8-12 |
Hindi and Arabic text can cost 2 to 3 times more tokens than equivalent English text. This is sometimes called the token tax. It’s not a deliberate policy decision. It’s a consequence of where the training data came from. The BPE algorithm builds its vocabulary from the most frequent patterns in the training corpus, and if that corpus is 80% English, English words get compact representations and everything else gets fragmented more.
For anyone building or using AI products for Indian audiences, this has real implications. A Hindi-language chatbot costs more to run than an equivalent English one. It also uses more of the context window per message, which means shorter effective conversation histories.
Newer models are trained on more multilingual data, and the gap is narrowing. But it’s worth accounting for in any product built for non-English speakers. You can observe it directly in exercise 2-9 of Lesson 2: type a Hinglish sentence and check the token count. Then type the same sentence in Devanagari script. The Devanagari version will usually register more tokens.
Context Window: The Total Budget
The context window is the maximum number of tokens a model can process in a single call: input plus output combined.
Everything goes into this shared space. Your prompt. Your system instruction. The conversation history from previous messages. Whatever room is left is the model’s output budget. If your context fills up before the model finishes generating, you’ll either get an error or a truncated response.
| Model | Context Window |
|---|---|
| Gemini 2.5 Flash | 1,048,576 tokens (1M) |
| Claude 3.5 Sonnet | 200,000 tokens |
| GPT-4o | 128,000 tokens |
| Llama 3.1 70B | 128,000 tokens |
Gemini 2.5 Flash’s 1M token window is large enough that most everyday use cases won’t approach it. You could paste the full text of several novels into a single prompt and still have room for output.
But context window size is also a cost consideration. Every token in the context costs money. In a chat application that includes the full conversation history in each API call, a 50,000-token conversation history means every new user message, even a single sentence, costs 50,000+ tokens of input. At scale, long conversation histories become expensive.
And don’t forget system instructions. If your system instruction runs 500 tokens and you’re making 10,000 calls per day, that’s 5 million tokens per day in system instruction alone, before your actual user prompts or output. Keeping system instructions concise is one of the most effective ways to control costs at scale.
Tokenizer Quirks That Affect Output
A few properties of tokenization regularly surprise people.
Whitespace is part of the token. The token for ” hello” (with a leading space) is different from the token for “hello” (without one). These are distinct entries in the vocabulary. Most of the time this doesn’t affect output quality, but prompts with unusual spacing or formatting can behave slightly differently than you’d expect.
Numbers get fragmented. “1234567” might become something like [‘12’, ‘345’, ‘67’] or another grouping. The tokenizer doesn’t recognize numeric values, only character sequences. This is why models are unreliable at counting digits, sorting numbers, or doing multi-step arithmetic. The math would have to emerge from pattern matching on fragments, not from actual numeric computation.
Rare code identifiers cost more. A function named “calculateExponentialMovingAverage” will probably cost 6 to 10 tokens. Common library names like “numpy” or “React” are likely single tokens because they appear frequently in training data. Obscure internal identifiers at your company are probably not.
You can see exactly how any text gets tokenized using the OpenAI Tokenizer tool. It shows token boundaries and total count for any input. The tokenizer is GPT-4’s, so the numbers won’t be identical for Gemini, but the general patterns hold across both.
API Pricing: What Tokens Actually Cost
LLM API pricing is almost always expressed in tokens per million. The standard pattern: input tokens cost less than output tokens, because generating tokens requires more compute than reading them.
Check Google’s pricing page for current Gemini rates. The numbers change as models update and as pricing tiers are adjusted, so anything quoted here would be stale within months. What won’t change: every token in your prompt, every token in your system instruction, and every token the model generates costs something. At 10,000+ daily calls, all three add up.
The practical takeaway is straightforward. Shorter system instructions save money. Leaner output formats save money. Concise prompts save money. Not pennies: at meaningful scale, these choices make a real difference in monthly API spend.
Putting It Together
Tokens are the unit of measurement for how much a prompt costs, how much context the model can hold, and how long its response can be. Every parameter you control in the playground, from Max Output Tokens to context window usage, is measured in tokens.
If you haven’t read the post on what temperature actually does in LLMs, that’s a natural next step. Temperature controls which token the model picks at each generation step. Understanding both concepts, what the tokens are and how the model selects between them, gives you a more complete picture of what’s happening under the hood.
The exercises in Lesson 2 of the TinkerLLM curriculum cover everything in this post hands-on. They’re designed to take under 30 minutes total, and they give you a concrete feel for each concept rather than asking you to take it on faith.
FAQ
What’s the difference between a token and a word?
A word is a linguistic unit with meaning. A token is a unit of text that a tokenizer has assigned a single vocabulary entry. Many common English words are single tokens, but the tokenizer doesn’t know about grammar or semantics. “Dog” is one token. “Dogs” might be one token or two, depending on how frequently the plural appeared in training data. “Unhappiness” could be one token if it’s common enough, or split into “Un” and “happiness”. The tokenizer is optimizing for encoding efficiency, not for linguistic meaning. The two often align, but not always.
Why does Hindi use more tokens than English?
Most LLM tokenizers were trained on data that was heavily weighted toward English. Because BPE and SentencePiece algorithms build vocabularies from the most frequent patterns in the training corpus, English words get compact single-token representations. Hindi words, especially in Devanagari script, appear less frequently and get split into more pieces. The result is that the same sentence in Hindi can cost 2 to 3 times more tokens than in English. Multilingual models trained on more balanced corpora show smaller gaps, but no major model has fully closed it.
How do I know how many tokens my prompt uses?
In the TinkerLLM playground, the ⚡ icon in the header shows your input token count after each submission. For programmatic access, the Gemini and OpenAI APIs both return token counts in their response metadata. If you want to count tokens before making a call, the OpenAI Tokenizer works well for GPT-family models. Gemini’s API includes a countTokens method that returns an exact count for any given prompt. This is useful when you’re building applications and want to validate that your prompts stay within a budget before committing to the call.
What happens when I hit the context window limit?
Behavior depends on the application. If you’re calling the API directly, you’ll get an error if your input exceeds the model’s context window. In chat applications that manage history automatically, most implementations start dropping the oldest messages to make room for new ones. This is why long conversations can start feeling like the model has forgotten things from earlier: those earlier messages were dropped from the context to keep the total under the limit. Gemini 2.5 Flash’s 1M token window is large enough that this rarely happens in practice, but GPT-4o’s 128K window fills up faster in extended, multi-turn conversations.
Does the tokenizer affect model quality, not just cost?
Yes, and it’s underestimated. Because numbers are fragmented into sub-tokens, models are genuinely unreliable at arithmetic, even on simple problems. Because whitespace creates different tokens, unusual formatting in prompts can produce unexpected outputs. Because rare code identifiers get split into pieces, the model is better at recognizing common library names than obscure internal function names. The tokenizer is the model’s interface to all text, and its limitations carry through to the model’s behavior. When a model seems to struggle with a specific type of input, the tokenizer is often part of the reason.
Why can’t I just set Max Output Tokens to a very large number for everything?
You can, and for tasks like generating long documents, you should set it high enough that it doesn’t interfere. But a few things worth knowing. First, the model stops whenever it thinks it’s done, regardless of the limit: a high Max Output Tokens doesn’t force longer output, it just prevents early cutoffs. Second, very long outputs cost proportionally more, and if the model hallucinates across a 5,000-token response, you’ve paid for 5,000 tokens of unreliable content. Third, production applications often set a reasonable Max Output Tokens as a safety bound to prevent runaway generation from exceeding budget expectations. A default of 2,048 tokens works well for most use cases, with higher limits set intentionally per task.
Is JSON always the wrong format to request?
Not at all. JSON is worth the token cost when your application genuinely needs structured, parseable output. The overhead is real, but so is the development cost of writing a reliable parser for free-form text. The useful question is whether the structure is actually necessary for your downstream code. For a simple list of items, comma-separated text is trivial to split in any language. For nested data with multiple fields per record, JSON formatting earns its cost because the alternative involves fragile string parsing. YAML is a reasonable middle ground for moderately structured data that doesn’t need the strictness of JSON.
Delivery lead at Kalvium Labs with a background in instructional design. Writes concept explainers and process posts. Thinks about how people actually learn before jumping to solutions.
LinkedInWant to try this yourself?
Open the TinkerLLM playground and experiment with real models. 26 exercises free.
Start Tinkering