What is Temperature in LLMs? Explained with Examples

You changed temperature to 0 and the model started repeating the exact same answer every time. You changed it to 1.5 and the output turned into word salad. The setting clearly does something. But what?

Most explanations say “temperature controls creativity.” That’s technically accurate but practically useless. It’s like saying the steering wheel controls where the car goes.

Here’s what temperature actually does under the hood, why different values produce such different results, and how to pick the right setting for whatever you’re building.

How the model picks each word

Before temperature makes sense, you need to understand how a model decides what to write next.

LLMs don’t generate sentences. They predict one token at a time. A token is roughly one word or word fragment. For every token position, the model calculates a raw score, called a logit, for every token in its vocabulary.

A model like Gemini has a vocabulary of about 256,000 tokens. So for every single word it generates, it produces 256,000 scores. These raw logits then go through a function called softmax that converts them into probabilities.

The result looks something like this:

Token	Raw Logit	Probability (after softmax)
“blue”	4.2	45%
“red”	3.1	20%
“green”	2.7	15%
“purple”	1.9	8%
“clear”	1.4	5%
…	…	…

The model then samples from this distribution. It picks a token based on these probabilities. “Blue” wins most of the time, but “red” or “green” can show up too.

Temperature changes the shape of this distribution before sampling happens. That’s the entire mechanism.

The math behind it

The standard softmax formula converts logits into probabilities:

P(token_i) = e^(logit_i) / Σ e^(logit_j)

With temperature, the formula becomes:

P(token_i) = e^(logit_i / T) / Σ e^(logit_j / T)

Where T is the temperature value. That division by T before the exponential is doing all the work.

When T is less than 1 (say 0.2), each logit gets divided by a small number, making it larger. This amplifies the differences between tokens. The highest-probability token becomes even more dominant. The distribution gets sharper.

When T is greater than 1 (say 1.5), each logit gets divided by a large number, making it smaller. This compresses the differences. All tokens become closer in probability. The distribution gets flatter.

When T approaches 0, the highest-logit token approaches 100% probability. The model always picks its top choice. This is called greedy decoding.

Short version: low temperature makes the model more confident in its first choice. High temperature makes it consider more options equally.

Temperature 0: Deterministic mode

At temperature 0, the model picks the token with the highest logit every time. No randomness. Run the same prompt 10 times and you get the same answer 10 times.

This is useful more often than people expect.

When to use temp 0:

Extracting structured data from text (emails, dates, names)
Classification (sentiment analysis, categorization, labeling)
Math and logic problems
Code generation where consistency matters
Any task where you need the same answer reliably

Try it yourself: Open the TinkerLLM playground, set temperature to 0.0, and type “What is 123 x 456?” three times. Same answer. Same explanation. Same formatting. That’s greedy decoding in action.

Temperature 0.5-0.7: The practical default

Most production applications land here. Chatbots, writing assistants, Q&A systems, summarization tools.

At 0.5-0.7, the model still favors high-probability tokens but occasionally picks alternatives. The output reads naturally without getting erratic. If you’re not sure what temperature to use, start at 0.7 and adjust from there.

When to use 0.5-0.7:

Customer-facing chatbots (variety without hallucinations)
Summarization (consistent but not robotic)
Translation (accuracy with natural phrasing)
General Q&A where tone matters

This range is where the model sounds human without being unreliable.

Temperature 1.0: Unmodified probabilities

At 1.0, you get the model’s original probability distribution. The softmax output is used exactly as computed. No scaling in either direction.

Run the same prompt twice at temperature 1.0 and you’ll get meaningfully different responses. Different word choices, different structures, sometimes different conclusions.

When to use temp 1.0:

Brainstorming (you want multiple diverse ideas)
Creative writing (poetry, fiction, humor)
Generating options for A/B testing
Any task where “surprising” is a feature, not a bug

Try it yourself: Set temperature to 1.0 in TinkerLLM. Type “Give me 3 names for a pet rock.” Send it three times. Different names each time. At temp 0, you’d get the same three every time.

Temperature 1.5+: Where coherence breaks

Above 1.0, the distribution flattens fast. Tokens that the model originally considered unlikely become almost as probable as the top choices.

At 1.5, you get surprising word combinations that sometimes feel creative. At 2.0, you get combinations that feel broken. The model produces text that’s syntactically plausible but semantically nonsensical.

The mistake people make here is equating randomness with creativity. They set temperature to 1.5 thinking they’ll get “more creative” output. What they get is “more random” output. A random sentence isn’t creative. A random sentence is random.

For genuinely creative tasks, 0.7-1.0 gives you the variety you want without sacrificing coherence. Go above 1.0 only when you understand you’re trading quality for surprise.

Try it yourself: Set temperature to 2.0 in TinkerLLM. Ask for a poem. Read it. Now set it to 0.8 and ask for the same poem. The 0.8 version will almost always be better writing, even though the 2.0 version had more “randomness.”

How temperature interacts with Top-K and Top-P

Temperature isn’t the only randomness control. It works alongside two other parameters that approach the problem differently.

Top-K restricts how many tokens the model can choose from. If K=40, the model only considers the 40 highest-probability tokens. Everything else gets zeroed out before sampling.

Top-P (Nucleus Sampling) restricts by cumulative probability instead. If P=0.9, the model considers the smallest set of tokens whose probabilities add up to 90%. When the model is confident, this means fewer candidates. When it’s uncertain, more.

These three controls are applied in sequence: logits → temperature scaling → Top-K filtering → Top-P filtering → sampling.

Temperature changes the shape of the distribution. Top-K changes the size of the candidate pool. Top-P changes the coverage of the candidate pool. They do different things and can work together.

Setting	What It Controls	Typical Default
Temperature	How spread out probabilities are	1.0
Top-K	Maximum number of candidate tokens	40
Top-P	Cumulative probability threshold	0.9-0.95

For most tasks, adjusting temperature alone is enough. Add Top-K or Top-P when you want finer control. For example, keeping temperature at 1.0 for diversity but using Top-K=10 to prevent truly unexpected token choices.

TinkerLLM’s Lessons 5 and 6 cover Top-K and Top-P with exercises where you adjust both alongside temperature and observe the combined effect.

Temperature settings by use case

Here’s a reference table based on what works in practice across different tasks. These aren’t theoretical recommendations. They’re the values that produce good results when you actually run them.

Use Case	Temperature	Why
Data extraction	0.0	Needs identical, consistent output
Classification	0.0-0.1	Same input should produce same label
Math and logic	0.0	Deterministic reasoning
Code generation	0.0-0.3	Consistency over variety
Summarization	0.3-0.5	Accuracy first, slight natural variation
Chatbot / Q&A	0.5-0.7	Sounds natural, mostly reliable
Translation	0.3-0.5	Accuracy with natural phrasing
Creative writing	0.7-1.0	Diversity and surprise
Brainstorming	0.8-1.0	Maximum idea variety
Experimental	1.0-1.5	When you want to see what happens

Temperature across different models

The same temperature value doesn’t behave identically across models. Each model has different internal logit distributions, so temp 0.7 on Gemini produces different output characteristics than temp 0.7 on GPT-4o. Always test with your specific model.

Model	Default Temp	Range	Notes
Gemini 2.5 Flash	1.0	0.0-2.0	Higher default than most. Consider lowering for structured tasks.
Gemini Pro	1.0	0.0-2.0	Same range as Flash
GPT-4o	1.0	0.0-2.0	OpenAI recommends 0.7 for most tasks
Claude 3.5	1.0	0.0-1.0	Narrower range. Can’t go above 1.0.
Llama 3	0.6	0.0-2.0	Lower default reflects Meta’s stability preference

The 2-minute experiment

This entire post is something you can verify yourself instead of taking on faith.

Open the TinkerLLM playground
Set temperature to 0.0
Type: “What is 123 x 456?” and send
Send the exact same prompt again. Identical answer.
Set temperature to 1.0
Type: “Give me 3 unique names for a pet rock” and send
Send the same prompt again. Different names.
Set temperature to 2.0
Send any prompt. Watch the output get strange.

That’s temperature. Not a concept to memorize for an interview. Something you observe with your own eyes.

TinkerLLM’s Lessons 3 and 4 cover temperature with guided exercises for both consistency (low temp) and creativity (high temp). The exercises validate your understanding in real time. If you can set the right temperature for a given task, the exercise passes. If you can’t, it tells you what went wrong.

FAQ

What temperature should I use for chatbots?

0.5-0.7 for most customer-facing chatbots. Enough variation that responses don’t sound robotic. Enough consistency that the bot doesn’t contradict itself or hallucinate. For chatbots handling sensitive topics (medical, financial, legal), drop to 0.2-0.3. The risk of a wrong answer outweighs the benefit of natural-sounding prose.

Does temperature affect hallucinations?

Yes. Higher temperature increases the probability of selecting low-confidence tokens, which are more likely to be factually wrong. Lowering temperature is one of the first things to try when a model produces incorrect facts. It won’t eliminate hallucinations entirely, that’s a deeper architectural problem, but it reduces their frequency. TinkerLLM’s Lesson 17 covers hallucinations with exercises where you observe the difference across temperature settings.

Can I just use temperature 0 for everything?

Technically, yes. But the output will sound robotic and repetitive. Temp 0 always picks the single most likely token, which means every response follows the same patterns. For factual Q&A and data extraction, that’s fine. For anything involving natural conversation, it produces flat, predictable text that users notice.

What’s the difference between temperature and Top-P?

Temperature scales the entire probability distribution. It changes how spread out probabilities are. Top-P filters the distribution by cumulative probability, only considering tokens that together represent X% of the total probability mass. Temperature is a global multiplier. Top-P is a dynamic cutoff. They solve different problems and can be combined. More on Top-P in the TinkerLLM curriculum.

Why does temperature 1.5 produce gibberish?

At 1.5, every logit gets divided by 1.5 before softmax. This compresses the differences between tokens so that rare, unlikely tokens become almost as probable as the top choices. The model starts selecting tokens that are syntactically possible but semantically wrong. It’s not malfunctioning. The math is working exactly as designed. The distribution is just too flat to produce coherent text.

Is there a temperature equivalent in image generation models?

In diffusion models like Stable Diffusion and DALL-E, the “guidance scale” or “CFG scale” serves a similar purpose. Higher values make the model follow the prompt more strictly (similar to low temperature). Lower values allow more variation (similar to high temperature). The mechanism is different, but the trade-off between fidelity and diversity is the same.

What temperature does Google use for Gemini in production?

Google hasn’t published exact values, but based on the Gemini API defaults, the starting point is 1.0. For factual applications like AI Overviews in Search, they almost certainly use values closer to 0. For creative features like “Help me write” in Gmail, something in the 0.7-0.9 range.

Does changing temperature cost more?

No. Temperature doesn’t affect token count or API pricing. It only changes which tokens are selected during generation, not how many are generated. A prompt at temperature 0 costs exactly the same as the same prompt at temperature 2.0.

What Temperature Actually Does in LLMs

TL;DR