Skip to content
Prompt Engineering 9 min read

What Temperature Actually Does in LLMs

Temperature controls randomness in AI output. Here's the math, the practical settings, and an experiment you can run yourself.

D
Dharini S
April 18, 2026

TL;DR

  • Temperature scales the logits before softmax. Low temp sharpens the probability distribution, high temp flattens it.
  • Temp 0.0 = deterministic (same answer every time). Use for data extraction, math, code.
  • Temp 0.7-1.0 = creative sweet spot. Temp 1.5+ = increasingly random and incoherent.
  • Temperature works alongside Top-K and Top-P, but they control different things.
  • You can test all of this in the TinkerLLM playground in under 2 minutes.

You changed temperature to 0 and the model started repeating the exact same answer every time. You changed it to 1.5 and the output turned into word salad. The setting clearly does something. But what?

Most explanations say “temperature controls creativity.” That’s technically accurate but practically useless. It’s like saying the steering wheel controls where the car goes.

Here’s what temperature actually does under the hood, why different values produce such different results, and how to pick the right setting for whatever you’re building.

How the model picks each word

Before temperature makes sense, you need to understand how a model decides what to write next.

LLMs don’t generate sentences. They predict one token at a time. A token is roughly one word or word fragment. For every token position, the model calculates a raw score, called a logit, for every token in its vocabulary.

A model like Gemini has a vocabulary of about 256,000 tokens. So for every single word it generates, it produces 256,000 scores. These raw logits then go through a function called softmax that converts them into probabilities.

The result looks something like this:

TokenRaw LogitProbability (after softmax)
“blue”4.245%
“red”3.120%
“green”2.715%
“purple”1.98%
“clear”1.45%

The model then samples from this distribution. It picks a token based on these probabilities. “Blue” wins most of the time, but “red” or “green” can show up too.

Temperature changes the shape of this distribution before sampling happens. That’s the entire mechanism.

The math behind it

The standard softmax formula converts logits into probabilities:

P(token_i) = e^(logit_i) / Σ e^(logit_j)

With temperature, the formula becomes:

P(token_i) = e^(logit_i / T) / Σ e^(logit_j / T)

Where T is the temperature value. That division by T before the exponential is doing all the work.

When T is less than 1 (say 0.2), each logit gets divided by a small number, making it larger. This amplifies the differences between tokens. The highest-probability token becomes even more dominant. The distribution gets sharper.

When T is greater than 1 (say 1.5), each logit gets divided by a large number, making it smaller. This compresses the differences. All tokens become closer in probability. The distribution gets flatter.

When T approaches 0, the highest-logit token approaches 100% probability. The model always picks its top choice. This is called greedy decoding.

Short version: low temperature makes the model more confident in its first choice. High temperature makes it consider more options equally.

Temperature 0: Deterministic mode

At temperature 0, the model picks the token with the highest logit every time. No randomness. Run the same prompt 10 times and you get the same answer 10 times.

This is useful more often than people expect.

When to use temp 0:

  • Extracting structured data from text (emails, dates, names)
  • Classification (sentiment analysis, categorization, labeling)
  • Math and logic problems
  • Code generation where consistency matters
  • Any task where you need the same answer reliably

Try it yourself: Open the TinkerLLM playground, set temperature to 0.0, and type “What is 123 x 456?” three times. Same answer. Same explanation. Same formatting. That’s greedy decoding in action.

Temperature 0.5-0.7: The practical default

Most production applications land here. Chatbots, writing assistants, Q&A systems, summarization tools.

At 0.5-0.7, the model still favors high-probability tokens but occasionally picks alternatives. The output reads naturally without getting erratic. If you’re not sure what temperature to use, start at 0.7 and adjust from there.

When to use 0.5-0.7:

  • Customer-facing chatbots (variety without hallucinations)
  • Summarization (consistent but not robotic)
  • Translation (accuracy with natural phrasing)
  • General Q&A where tone matters

This range is where the model sounds human without being unreliable.

Temperature 1.0: Unmodified probabilities

At 1.0, you get the model’s original probability distribution. The softmax output is used exactly as computed. No scaling in either direction.

Run the same prompt twice at temperature 1.0 and you’ll get meaningfully different responses. Different word choices, different structures, sometimes different conclusions.

When to use temp 1.0:

  • Brainstorming (you want multiple diverse ideas)
  • Creative writing (poetry, fiction, humor)
  • Generating options for A/B testing
  • Any task where “surprising” is a feature, not a bug

Try it yourself: Set temperature to 1.0 in TinkerLLM. Type “Give me 3 names for a pet rock.” Send it three times. Different names each time. At temp 0, you’d get the same three every time.

Temperature 1.5+: Where coherence breaks

Above 1.0, the distribution flattens fast. Tokens that the model originally considered unlikely become almost as probable as the top choices.

At 1.5, you get surprising word combinations that sometimes feel creative. At 2.0, you get combinations that feel broken. The model produces text that’s syntactically plausible but semantically nonsensical.

The mistake people make here is equating randomness with creativity. They set temperature to 1.5 thinking they’ll get “more creative” output. What they get is “more random” output. A random sentence isn’t creative. A random sentence is random.

For genuinely creative tasks, 0.7-1.0 gives you the variety you want without sacrificing coherence. Go above 1.0 only when you understand you’re trading quality for surprise.

Try it yourself: Set temperature to 2.0 in TinkerLLM. Ask for a poem. Read it. Now set it to 0.8 and ask for the same poem. The 0.8 version will almost always be better writing, even though the 2.0 version had more “randomness.”

How temperature interacts with Top-K and Top-P

Temperature isn’t the only randomness control. It works alongside two other parameters that approach the problem differently.

Top-K restricts how many tokens the model can choose from. If K=40, the model only considers the 40 highest-probability tokens. Everything else gets zeroed out before sampling.

Top-P (Nucleus Sampling) restricts by cumulative probability instead. If P=0.9, the model considers the smallest set of tokens whose probabilities add up to 90%. When the model is confident, this means fewer candidates. When it’s uncertain, more.

These three controls are applied in sequence: logits → temperature scaling → Top-K filtering → Top-P filtering → sampling.

Temperature changes the shape of the distribution. Top-K changes the size of the candidate pool. Top-P changes the coverage of the candidate pool. They do different things and can work together.

SettingWhat It ControlsTypical Default
TemperatureHow spread out probabilities are1.0
Top-KMaximum number of candidate tokens40
Top-PCumulative probability threshold0.9-0.95

For most tasks, adjusting temperature alone is enough. Add Top-K or Top-P when you want finer control. For example, keeping temperature at 1.0 for diversity but using Top-K=10 to prevent truly unexpected token choices.

TinkerLLM’s Lessons 5 and 6 cover Top-K and Top-P with exercises where you adjust both alongside temperature and observe the combined effect.

Temperature settings by use case

Here’s a reference table based on what works in practice across different tasks. These aren’t theoretical recommendations. They’re the values that produce good results when you actually run them.

Use CaseTemperatureWhy
Data extraction0.0Needs identical, consistent output
Classification0.0-0.1Same input should produce same label
Math and logic0.0Deterministic reasoning
Code generation0.0-0.3Consistency over variety
Summarization0.3-0.5Accuracy first, slight natural variation
Chatbot / Q&A0.5-0.7Sounds natural, mostly reliable
Translation0.3-0.5Accuracy with natural phrasing
Creative writing0.7-1.0Diversity and surprise
Brainstorming0.8-1.0Maximum idea variety
Experimental1.0-1.5When you want to see what happens

Temperature across different models

The same temperature value doesn’t behave identically across models. Each model has different internal logit distributions, so temp 0.7 on Gemini produces different output characteristics than temp 0.7 on GPT-4o. Always test with your specific model.

ModelDefault TempRangeNotes
Gemini 2.5 Flash1.00.0-2.0Higher default than most. Consider lowering for structured tasks.
Gemini Pro1.00.0-2.0Same range as Flash
GPT-4o1.00.0-2.0OpenAI recommends 0.7 for most tasks
Claude 3.51.00.0-1.0Narrower range. Can’t go above 1.0.
Llama 30.60.0-2.0Lower default reflects Meta’s stability preference

The 2-minute experiment

This entire post is something you can verify yourself instead of taking on faith.

  1. Open the TinkerLLM playground
  2. Set temperature to 0.0
  3. Type: “What is 123 x 456?” and send
  4. Send the exact same prompt again. Identical answer.
  5. Set temperature to 1.0
  6. Type: “Give me 3 unique names for a pet rock” and send
  7. Send the same prompt again. Different names.
  8. Set temperature to 2.0
  9. Send any prompt. Watch the output get strange.

That’s temperature. Not a concept to memorize for an interview. Something you observe with your own eyes.

TinkerLLM’s Lessons 3 and 4 cover temperature with guided exercises for both consistency (low temp) and creativity (high temp). The exercises validate your understanding in real time. If you can set the right temperature for a given task, the exercise passes. If you can’t, it tells you what went wrong.

FAQ

What temperature should I use for chatbots?

0.5-0.7 for most customer-facing chatbots. Enough variation that responses don’t sound robotic. Enough consistency that the bot doesn’t contradict itself or hallucinate. For chatbots handling sensitive topics (medical, financial, legal), drop to 0.2-0.3. The risk of a wrong answer outweighs the benefit of natural-sounding prose.

Does temperature affect hallucinations?

Yes. Higher temperature increases the probability of selecting low-confidence tokens, which are more likely to be factually wrong. Lowering temperature is one of the first things to try when a model produces incorrect facts. It won’t eliminate hallucinations entirely, that’s a deeper architectural problem, but it reduces their frequency. TinkerLLM’s Lesson 17 covers hallucinations with exercises where you observe the difference across temperature settings.

Can I just use temperature 0 for everything?

Technically, yes. But the output will sound robotic and repetitive. Temp 0 always picks the single most likely token, which means every response follows the same patterns. For factual Q&A and data extraction, that’s fine. For anything involving natural conversation, it produces flat, predictable text that users notice.

What’s the difference between temperature and Top-P?

Temperature scales the entire probability distribution. It changes how spread out probabilities are. Top-P filters the distribution by cumulative probability, only considering tokens that together represent X% of the total probability mass. Temperature is a global multiplier. Top-P is a dynamic cutoff. They solve different problems and can be combined. More on Top-P in the TinkerLLM curriculum.

Why does temperature 1.5 produce gibberish?

At 1.5, every logit gets divided by 1.5 before softmax. This compresses the differences between tokens so that rare, unlikely tokens become almost as probable as the top choices. The model starts selecting tokens that are syntactically possible but semantically wrong. It’s not malfunctioning. The math is working exactly as designed. The distribution is just too flat to produce coherent text.

Is there a temperature equivalent in image generation models?

In diffusion models like Stable Diffusion and DALL-E, the “guidance scale” or “CFG scale” serves a similar purpose. Higher values make the model follow the prompt more strictly (similar to low temperature). Lower values allow more variation (similar to high temperature). The mechanism is different, but the trade-off between fidelity and diversity is the same.

What temperature does Google use for Gemini in production?

Google hasn’t published exact values, but based on the Gemini API defaults, the starting point is 1.0. For factual applications like AI Overviews in Search, they almost certainly use values closer to 0. For creative features like “Help me write” in Gmail, something in the 0.7-0.9 range.

Does changing temperature cost more?

No. Temperature doesn’t affect token count or API pricing. It only changes which tokens are selected during generation, not how many are generated. A prompt at temperature 0 costs exactly the same as the same prompt at temperature 2.0.

temperature LLM parameters prompt engineering Gemini AI fundamentals softmax
D
Dharini S The Educator

Delivery lead at Kalvium Labs with a background in instructional design. Writes concept explainers and process posts. Thinks about how people actually learn before jumping to solutions.

LinkedIn

Want to try this yourself?

Open the TinkerLLM playground and experiment with real models. 26 exercises free.

Start Tinkering