Top-K vs Top-P vs Temperature: Picking the Right One

You set temperature to 0.3 and the output was still unpredictable. Someone on a forum said to try lowering Top-P. You did. Nothing obviously changed. Then you noticed the Top-K slider and had no idea whether you were supposed to touch it at all.

These three parameters overlap enough to confuse almost everyone the first time. They all affect which token the model picks next. But they do it differently, they’re applied in a specific order, and adjusting the wrong one won’t solve your problem even when the right adjustment would.

Here’s the decision rule: which parameter you should reach for first, what each one controls, and the common mistakes from combining them incorrectly.

The 30-second version

If you’re scanning, here’s the short version.

The Three-Knob Reference

Parameter	What it controls	Default (Gemini)	Reach for it when…
Temperature	Shape of the probability distribution. Low = more focused, high = more spread.	1.0	The output is too repetitive or too chaotic overall
Top-K	Number of tokens eligible to be sampled. K=1 is greedy. K=40 allows the top 40 candidates.	40	You want variety but the model keeps jumping to unlikely tokens
Top-P	Cumulative probability cutoff. P=0.9 means only tokens that together account for 90% of probability survive.	0.95	You want the pool size to adjust to the model’s confidence level automatically

For most tasks: start with temperature. Add Top-K or Top-P only when temperature alone isn’t producing what you need.

What each one actually does

Temperature

Temperature scales each logit (the raw score the model assigns to every possible token) before the softmax function converts those scores into probabilities. Low temperature sharpens the distribution. High temperature flattens it.

You don’t need to reread the full math here. The temperature deep-dive covers it with the formula and worked examples. The short version: temp 0.0 means the model always picks the highest-probability token, deterministically. Temp 1.0 uses the raw distribution. Temp 1.5 and above flattens things until unlikely tokens become nearly as probable as the top choices.

For most applications, you’re working in the 0.0 to 1.0 range. Above 1.0 is available but coherence drops fast.

Temperature is the first thing you should adjust because it affects the entire distribution before any filtering happens. It’s the broadest of the three controls.

Top-K

Top-K filters the token list before sampling. If you set K=40, the model considers only the 40 tokens with the highest probability. Everything outside that set gets zeroed out.

At K=1, the model always picks the single most likely token. This is called greedy decoding and it produces deterministic output: the same input produces the same output every time, regardless of your temperature setting. (Temperature still scales the logits before Top-K filters them, but if only one token survives the filter, the sampling step has nothing to decide.)

At K=40, there are 40 candidates. At K=100, there are 100. The higher you set K, the more tokens are in play and the more variety you get.

Top-K is useful when you want diversity but also want to prevent the model from sampling tokens that are genuinely low-probability. Think of it as a hard floor: below a certain rank, no token is allowed in, no matter what temperature says.

The TinkerLLM exercises 5-1 and 5-2 show this directly. Exercise 5-1 sets K=1 with temperature=1.0. The model always completes “Once upon a…” with “time” regardless of how high temperature goes, because only one token survives the filter. Exercise 5-2 opens the pool to K=40 at the same temperature, and the completions diversify.

Try it yourself: Run exercise 5-1 in the TinkerLLM playground. Send the same prompt five times. Identical answer each time. Then switch to exercise 5-2. The pool is now 40 tokens wide and the outputs start varying.

Top-P

Top-P (nucleus sampling) applies a different cutoff logic. Instead of giving you a fixed count of candidates, it takes the smallest group of tokens whose cumulative probability reaches the threshold P.

If P=0.9, the model adds tokens in probability order until their cumulative total reaches 90%. The number of surviving tokens varies. When the model is confident (one token dominates), only a few tokens survive. When the model is uncertain (many tokens have similar probabilities), many tokens survive.

This self-adjusting behavior is the point. Introduced in the Holtzman et al. 2019 nucleus sampling paper, Top-P matches the pool size to the model’s actual confidence rather than a fixed count. A P=0.9 cutoff on “The sky is _____” might leave 3 tokens (blue, grey, clear). The same P=0.9 on a creative completion might leave 200 tokens because no single continuation dominates.

Exercise 6-1 in the TinkerLLM Lesson 6 sets P=0.1 and prompts “The color of the clear sky is usually…” At P=0.1, only the top 10% of cumulative probability survives. “Blue” is nearly certain. Raise P to 0.95 and many more completions become possible.

The sampling pipeline

These three controls are applied in a specific order, and you need to know it to understand what each parameter is actually doing. Most implementations follow this sequence:

Raw logits
  → Temperature scaling (divide each logit by T)
  → Top-K filter (keep only the top K tokens by score)
  → Top-P filter (keep tokens until cumulative probability ≥ P)
  → Normalize remaining probabilities
  → Sample one token

Order matters. Temperature runs before either filter. When you raise temperature to 1.5, you’re flattening the distribution before Top-K decides what counts as “top.” A token that was rank 41 at temperature 1.0 might move to rank 30 at temperature 1.5 because the score differences compressed. What gets cut by Top-K depends on the shape of the distribution after temperature scaling.

Top-K runs before Top-P. So if Top-K=10 cuts the list to 10 tokens, Top-P applies to only those 10. The combination is more restrictive than either filter alone.

This is why stacking extreme values across all three creates outputs you can’t debug cleanly. Each filter’s behavior depends on what the previous one produced. Change one at a time.

When to change each one

This is the table you actually came for. Real tasks, concrete recommendations.

The Sampling Decision Guide

Task	Temperature	Top-K	Top-P	Start here
Structured data extraction	0.0	1-10	0.1-0.3	Temperature (to 0.0)
Classification or labeling	0.0-0.1	Default	Default	Temperature
Code generation	0.0-0.3	Default	Default	Temperature
Factual Q&A	0.0-0.3	Default	Default	Temperature
Summarization	0.3-0.5	Default	Default	Temperature
Chatbot or conversational AI	0.5-0.7	Default	Default	Temperature
Creative writing (varied)	0.7-1.0	40-100	0.9-0.95	Temperature, then Top-K if still repetitive
Brainstorming	0.8-1.0	40-100	0.9	Temperature
Consistent style with variety	0.7	40	0.9	Top-K (to maintain style while allowing variation)
Reducing nonsense in creative output	0.8	20-40	0.9	Lower Top-K

A few notes on reading this table. “Default” means leave the API’s default in place (for Gemini: topK=40 and topP=0.95).

And for creative tasks, the pattern is: set temperature first. Then notice if the output feels repetitive despite a high temperature value. If it does, try raising Top-K. The problem may be that the candidate pool is too small, not that the distribution is too flat.

Common mistakes people make

1. Stacking all three at extreme values

This is the most common one. Temperature 1.5, Top-K 100, Top-P 0.98. You want maximum creativity so you push everything up. What you get is output that’s unpredictable in ways you can’t trace back to any specific cause. When something breaks, you don’t know which parameter is responsible. Set one, observe, then adjust the next.

2. Raising temperature to get diversity when the real issue is Top-K

You want varied outputs. You push temperature from 0.7 to 1.2. The output gets more chaotic but not actually more diverse in the ways you wanted. The problem might be that Top-K is sitting at 10, which limits the candidate pool to 10 tokens regardless of how flat the distribution gets. Raise Top-K first. Temperature at 0.7 with K=100 gives you more useful variety than temperature at 1.2 with K=10.

3. Setting Top-P to 1.0 thinking it disables the filter

In most APIs, Top-P=1.0 means “include all tokens,” which is functionally the same as having no Top-P filter. That part is correct. But people do this while also keeping a restrictive Top-K in place, assuming the combination gives them full flexibility. It doesn’t. If Top-K=5 runs before Top-P=1.0, the pool is already cut to 5 tokens before Top-P even runs. Check both settings when you want to open up the sampling.

4. Adjusting Top-P when temperature is the actual problem

I see this in the TinkerLLM playground often. Someone sets Top-P to 0.5 because the output feels repetitive. But Top-P at 0.5 shrinks the candidate pool. It doesn’t make the remaining tokens’ probabilities more spread out. If the model is picking the same tokens repeatedly, that’s usually a temperature issue (too low) or a Top-K issue (too restrictive). Top-P determines which tokens qualify, but temperature determines how spread out their probabilities are within the qualified set. Adjust temperature first.

The 3-minute experiment

This exercises all three parameters and takes about three minutes in the TinkerLLM playground.

Open the playground and go to Lesson 5.
Run exercise 5-1: Top-K=1, temperature=1.0. Prompt is “Complete the phrase commonly: ‘Once upon a…’” Send it five times. Same completion every time. That’s greedy decoding. K=1 makes temperature irrelevant.
Run exercise 5-2: Top-K=40, temperature=1.0. Prompt: “Invent a name for a new fruit that tastes like starlight.” Send it five times. Different names each time because 40 tokens are now in play.
Move to Lesson 6. Run exercise 6-1: Top-P=0.1. Prompt: “The color of the clear sky is usually…” At P=0.1, only the top 10% cumulative probability survives. “Blue” is nearly certain.
Change Top-P to 0.95 and send the same prompt again. Notice how the range of completions expands.
Finally, set temperature to 1.5 with Top-P=0.95 and Top-K=40. Run: “Write the opening line of a novel.” Send it twice. Then drop temperature to 0.7 and run again. Compare the coherence.

The goal isn’t to memorize settings. It’s to build your intuition for which direction to turn first.

Try it yourself: Go through steps 2 through 4 in the TinkerLLM playground. The whole sequence takes under five minutes and shows the pipeline order in a way that’s hard to forget after you’ve seen it live.

Model-specific notes

These parameters aren’t available consistently across every API.

API Parameter Support

Model / API	Temperature	Top-K	Top-P	Notes
Gemini (Google AI)	Yes, 0-2.0	Yes (default: 40)	Yes (default: 0.95)	All three exposed. Gemini API docs.
GPT-4o (OpenAI)	Yes, 0-2.0	No	Yes	OpenAI’s API does not expose Top-K.
Claude 3.5 (Anthropic)	Yes, 0-1.0	Yes	Yes	Narrower temperature range. Can’t exceed 1.0.
Llama 3 (via Ollama or vLLM)	Yes	Yes	Yes	All three available. Defaults vary by deployment.

The OpenAI Top-K absence is confirmed and has been consistent for years. If you’re reading a guide that tells you to “set top_k for GPT-4,” that’s either referencing a wrapper library that simulates the behavior client-side, or it’s wrong. The OpenAI API controls sampling through temperature and top_p only.

For Gemini, the defaults (topK=40, topP=0.95, temperature=1.0) are reasonable for general-purpose tasks. You don’t need to override them unless you have a specific reason. More on Gemini tokenization and vocabulary size in Tokens Explained.

Claude’s temperature cap at 1.0 matters if you’re porting prompts from Gemini or OpenAI, where temp=1.5 is valid. Check the accepted range before migrating settings across providers.

FAQ

Should I change temperature or Top-P first?

Temperature, almost always. It has the most predictable effect on output behavior. Top-P is useful for fine-tuning once you’ve found a temperature that roughly works. If you start with Top-P and leave temperature at the default, you’re adjusting the coverage of a distribution you haven’t shaped yet. Get the shape right first.

Can I just use Top-P and ignore Top-K?

For most tasks, yes. Top-P is self-adjusting and handles both confident and uncertain situations without you having to pick a specific count. The practical reason to add Top-K on top of Top-P is when you want a hard ceiling on candidates regardless of the model’s confidence. If the model is generating from a 500-token pool during uncertain completions and that’s causing problems, Top-K can cut that pool to 40 or 50 without touching the self-adjusting Top-P logic. But if temperature and Top-P are producing good results, there’s no point in adding Top-K complexity.

What’s the difference between Top-K and Top-P when both are on?

They filter by different criteria and run sequentially, with Top-K running first. Top-K keeps only the top K tokens by probability rank. Top-P keeps enough tokens to reach P% of cumulative probability. If K=20 and P=0.9, you cut to the top 20 tokens first, then apply the 90% cumulative cutoff to those 20. In most settings one filter is the binding constraint, but they interact in non-obvious ways at extreme values.

Does OpenAI expose Top-K?

No. OpenAI’s API doesn’t support a top_k parameter for GPT-4o or any of its main model endpoints. The two sampling parameters available are temperature and top_p. This is confirmed in OpenAI’s API reference and has been the case consistently. If you need Top-K control, you’d need a model provider that exposes it, such as the Gemini API or a self-hosted Llama deployment via Ollama.

What Top-P value should I start with?

0.9 or 0.95 for most tasks. Gemini’s default of 0.95 is a reasonable starting point across creative and conversational work. If you need the model to stay more focused, move down to 0.8. If the output sticks to a narrow predictable set of completions despite moderate temperature, try raising Top-P toward 1.0. Values below 0.5 are typically only useful for highly constrained tasks where you want the model to draw from a very tight nucleus of confident predictions.

Does Top-K affect latency or cost?

Not directly. Top-K filtering happens inside the model’s sampling step, a small fraction of total inference time. The dominant cost factors are input tokens, output tokens, and model size. A Top-K of 1 and a Top-K of 100 produce the same number of output tokens through different sampling strategies, but the sampling step itself adds negligible compute. API pricing is based on tokens processed, not parameter settings.

Why does Top-K=1 still produce different outputs sometimes?

A few possible explanations. Some implementations apply temperature after Top-K rather than before, so a high temperature could still affect the one surviving candidate. Without a fixed random seed, minor numerical differences can appear across calls. And differences in system instructions or stop sequences between calls can shift what the model considers most likely before the filter runs. The most reliable path to truly identical outputs is K=1 plus a fixed seed plus identical prompt and system instructions on every call.

Top-K, Top-P, Temperature: Which to Change

TL;DR