Learn LLM by Doing: How We Built TinkerLLM

Every time we onboarded a new engineer at Kalvium Labs, the same thing happened. They’d spent weeks shipping LLM features for client products, handling real API calls on production traffic, and if you asked them in a code review what temperature 0.0 actually does to the probability distribution, you’d get a pause. Not a long pause. Just long enough.

We have 200+ AI engineers building products for startups across India, the US, and the Gulf. Most of them learned how LLMs work by encountering edge cases in production. That’s not a great learning environment when the edge case is a client’s data.

So in late 2024, we started building an internal training tool. Six months and two mistakes later, we had something we thought other people might actually want.

The Problem Was Simpler Than It Looked

The original design was what you’d expect from engineers who had seen a lot of courses: theory modules first, then exercises at the end. Standard structure. Logical.

We tested it internally with CS interns and junior engineers from Kalvium Labs. They kept skipping the reading. Less than 5% made it through a full theory screen before jumping to the playground and just trying things. The completion behavior was unambiguous: people wanted to send prompts.

That told us something important. The real problem wasn’t “how do we teach LLM fundamentals.” It was “how do we give someone a reason to keep experimenting until the concept actually clicks on its own.”

The answer was interleaving. Theory lives inside the exercise flow now, not above it as a separate module. You read three sentences about tokens, then you type “Supercalifragilisticexpialidocious” into the playground and watch the token counter respond. You don’t finish a module and then practice. You read and practice in the same breath, sometimes in the same scroll.

That one structural decision shaped everything else. It’s also why we have 68 exercises across 18 lessons instead of 18 lecture videos with one lab each.

Two Wrong Turns Before We Got It Right

Wrong Turn 1: We Built the Admin Dashboard First

We started with the institutional model because that matched how we imagined TinkerLLM working at Kalvium Labs. We run training programs for engineers. We wanted institutions to be able to create batches, add students, and track completion per batch. So that’s what we built first.

Two months of solid work: FireCMS Pro v3 dashboards, Firestore collections for institutions and batches, analytics aggregation via Cloud Functions, relational data models with Firestore references linking batches to institutions and students to batches. It was genuinely well-built. The CMS had drill-down analytics, custom entity drawers, real-time charts.

Then we put three CS interns in front of it.

They didn’t want an institution. They wanted to sign in with Google and start doing exercises. The idea of waiting for an admin to create a batch and onboard them was a non-starter. They weren’t students enrolled in a program through their university. They were developers who found us and wanted to learn something that afternoon.

We’d built a good tool for clients who run cohorts. We’d built nothing for the person who’d find TinkerLLM through a search at 11pm.

The pivot wasn’t clean. The institutional layer still exists in the codebase, and real clients use it for batch training. But we built a parallel B2C self-serve flow on top: any Google sign-in gets access, user doc auto-creates on first login, and the student lands directly in the exercise view. The paywall sits at Lesson 3. Lessons 1 and 2 (26 exercises) are free, no card required, no API key required.

The lesson wasn’t that institutional tooling was wrong. It was that we’d built layer 2 before validating layer 1. Build the student experience first. The admin who runs reports can wait.

Wrong Turn 2: Our Validators Were Too Strict

Every exercise in TinkerLLM has a validate(userPrompt, modelResponse, config) function that runs client-side. When you submit a prompt and the model responds, the validator checks whether the response meets the exercise criteria. If it passes, the exercise marks complete and your progress writes to Firestore.

The first version used exact string matching. If an exercise expected the word “deterministic” somewhere in the model’s response, and the model instead said “always the same output every time,” the validator returned false. Student understood the concept correctly. System said no.

That failure mode appeared everywhere. Exercises covering temperature, tokenization, and hallucinations all had validators that were too rigid. Different Gemini model versions phrased things differently. The same model would phrase things differently on different days. Students were getting the right conceptual understanding and hitting a wall that made no sense to them.

We rewrote the validation layer around two helpers. has(text, ...terms) checks whether all listed terms appear in the response (case-insensitive substring matching). hasAny(text, ...terms) checks whether at least one does. The validators became intentionally loose. Exercise 3-1 (Deterministic Logic, temperature set to 0 on “123*456”) now passes when the response contains the correct arithmetic result and the config shows temperature: 0. It doesn’t care whether the model uses the word “deterministic” or not.

We also added a postCompletionTip field to every exercise. More on that below.

The tradeoff is that validation is soft by design. Someone motivated could probably game a few exercises. But the cost of a student getting the concept right and being told they’re wrong is higher than the cost of an occasional false positive. You can’t teach someone who gave up because your validator was pedantic.

What Shipped

18 lessons. 68 exercises. Covering tokenization, temperature, top-K, top-P, system instructions, zero-shot prompting, few-shot prompting, chain of thought, stop sequences, formatting, JSON mode, coding, vision, hallucinations, sycophancy, and safety. The lessons fall into four categories: Concepts, Engineering, Coding, and Advanced.

26 of those exercises are free: Lessons 1 and 2. Lesson 1 (Hello, Intelligence) starts with exercise 1-1, where you type “Jana Gana M” and let the model complete the anthem. Not a demo. Not a pre-recorded output. The model is actually responding to your prompt via the Gemini API. Exercise 1-8 asks you to type “What is the pincode of Hogwarts?” and watch what happens. By the time you hit Lesson 2 (The Currency of Tokens), you’re working with a max-tokens=2 constraint on “Continue: Jana Gana Ma” to understand what a hard cutoff actually looks like from the user’s side.

Try it yourself: Exercise 1-1 is free. Open app.tinkerllm.com, sign in with Google, and type “Jana Gana M” with no other instruction. Under 60 seconds, and you’ll immediately see something that most AI courses spend 10 minutes explaining.

The playground shows model, temperature, top-K, top-P, max output tokens, system instructions, stop sequences, and response format. Every parameter that matters at the API level is accessible. But the controls visible per lesson change based on what’s being taught: Lesson 3 (Temperature: Consistency) only exposes the temperature slider, because that’s the only knob you need to understand deterministic output. Showing everything at once from day one is how you confuse people.

For Lessons 3-18 (the paid tier, Rs. 499 one-time in India, $9 international), the default model is Gemini 2.5 Flash via the @google/genai SDK. Lesson 14 (Coding Specialist) switches to a Pro model. Lesson 17 (Hallucinations and Sycophancy) defaults to Gemini Flash Lite 2.0, because that model is noticeably more sycophantic, which is exactly what makes those exercises work. Exercise 4-1 (Brainstorming) runs at temperature 1.0 and asks for names for a pet rock. Run it twice. You get different names each time. That’s the point.

Why Client-Side Validation

The BYOK model (Bring Your Own Key) shapes the whole technical architecture. Students enter their Gemini API key and it’s stored in localStorage. API calls go from the browser directly to Google. TinkerLLM’s backend never touches the AI traffic.

That means we can’t do server-side validation. There’s no server in that request path seeing the responses.

But BYOK also means the platform has zero marginal cost per user. We’re not paying for API calls. A student running all 68 exercises pays for their own Google AI Studio quota, which has a free tier generous enough that most students going through the course won’t pay a cent for the API calls themselves. The economics work because we’re not in the middle.

Client-side validation with keyword matching fits this architecture exactly. It’s fast (no network round-trip for grading), it’s cheap, and it’s honest about what it is: a completion signal, not a rigorous grading system. The goal isn’t to catch students who figured out how to pass without learning. The goal is to give a clear “you got it” signal and move them to the next experiment.

OpenRouter support (for non-Gemini models) uses the same BYOK pattern. The key is stored in localStorage, calls go direct from the browser, and routing logic in geminiService.ts checks whether the model name starts with “gemini” to decide which endpoint to hit.

The Part We Didn’t Plan

Every exercise in the data file has a postCompletionTip field. It wasn’t in the original spec.

What we noticed in early testing: the moment a student hit “success” on an exercise, they clicked Next. The curiosity that had driven them through the exercise evaporated the second they got the checkmark. But that moment, right after a concept clicked, was exactly when they were most willing to experiment.

So we added a tip that appears in the success overlay after completion. Exercise 1-1’s tip says “Try typing ‘Twinkle Twinkle’ and see if it continues correctly.” Exercise 2-1 (Counting Tokens) says “Try a string of emojis and check the token count.” Exercise 1-8 (Hallucination Check) says “Ask about the ‘Great Whale War of 1850’ and see if it invents history.”

None of these count toward completion. They’re optional. But early testers started doing them. The checkmark stopped being the end of the exercise and started being the beginning of something else.

We also added a Lab view for internal use: a regression testing mode where we can run all 68 exercises against the live API and see which validators pass or fail after a model update. That wasn’t planned either. It was something we hacked together after Gemini 2.5 Flash got updated and three validators suddenly started behaving differently. Now it’s a proper view in the app.

Try it yourself: Complete exercise 1-1 at app.tinkerllm.com and read the tip that appears in the success overlay. Then actually do what it suggests. Takes 30 seconds.

What Shipped and Where It Runs

The full stack:

Student app: React 19, Vite 6, TypeScript, Tailwind. Hosted on Firebase Hosting at app.tinkerllm.com. State managed entirely in App.tsx via React hooks, no external state library.
Marketing site: Astro SSG, MDX blog (this post is here), deployed on Cloudflare Pages at tinkerllm.com.
Database: Firestore named database dev-db. Authentication via Firebase (Google Sign-In, no password).
Backend: Firebase Cloud Functions v2, Node 24. One real trigger: fires on every new progress record, updates the leaderboard and cascades analytics aggregation for students, batches, and institutions.
AI: Gemini 2.5 Flash as the default, via the @google/genai SDK. OpenRouter for non-Google models. Students bring their own keys.
Payments: Lemon Squeezy. One-time purchase, Rs. 499 (India) / $9 (international). Handles tax, receipts, and Indian payment methods. Webhook writes a purchase flag to Firestore on success.
Admin: FireCMS Pro v3. The institutional layer we built too early, but it’s the right tool for clients running TinkerLLM for their own engineer cohorts.

Build time from “we need an internal training tool” to “this should be a product people can buy” was about six months. That includes two months we spent on the wrong layer first.

Try it yourself: Lesson 17 is where the course gets genuinely surprising. The Strawberry Problem (exercise 17-x: count the R’s in “strawberry”) and the sycophancy exercises are the ones early testers kept sharing. Start with the free exercises and work your way there.

FAQ

What is TinkerLLM and who is it for?

TinkerLLM is a hands-on AI course built around a live LLM playground. Instead of watching someone else send prompts to an AI model, you send the prompts yourself, adjust the parameters, and observe the effects directly. It’s aimed at CS students and junior developers who use AI tools every day but would struggle to explain what temperature does or why a model gets cut off mid-sentence. If your knowledge of LLMs is “I know how to use ChatGPT” and you want your knowledge to be “I understand how this works at the API level,” that’s the gap TinkerLLM covers.

Why real models instead of simulations?

Because the interesting behaviors you need to understand (temperature variance, tokenization quirks, sycophancy, hallucination patterns) only show up when you’re talking to a real model. A simulation would show you what we think should happen. The real Gemini API shows you what actually happens, which is often more interesting and occasionally more surprising. Setting temperature to 0 and running the same prompt three times doesn’t become real until you’ve done it yourself and seen identical outputs. See the temperature explainer post for the full mechanics, including the softmax math.

How long does it take to complete the course?

Lessons 1 and 2 (the 26 free exercises) take most people 2-3 hours, depending on how much time they spend on the optional postCompletionTip experiments. The full 68-exercise course, going through all 18 lessons deliberately, takes around 8-10 hours. You can go faster if you’re already familiar with some concepts. There’s no deadline, no expiry, no live sessions. You work through it at your own pace, and your progress persists in Firestore so you can pick up where you left off.

Do I need an API key? What does it cost?

For the free tier (Lessons 1-2, 26 exercises), you don’t need an API key. The exercises run without one. To unlock Lessons 3-18, you pay a one-time Rs. 499 (India) or $9 (international) via Lemon Squeezy. After purchasing, you enter your own Gemini API key from Google AI Studio. That key is stored in your browser’s localStorage, never on our servers. The Gemini API free tier is generous enough that most students going through the paid exercises won’t pay anything for the API calls. You can check the current free tier limits on the Gemini API docs.

Why is the free tier 26 exercises and not just 5?

Because 5 exercises isn’t enough to know whether the course actually works for how you learn. Lessons 1 and 2 cover autocomplete as a mental model, knowledge retrieval, creativity, hallucination previews, tokenization, max token constraints, soft vs. hard limits, format cost, and how system instructions interact with token budgets. That’s real substance. By exercise 2-7 (Conflict: System vs. MaxTokens), you’ll have observed something counterintuitive about how LLMs process competing instructions. That’s worth understanding whether or not you buy the rest. And if it’s useful, you’ll want the other 42 exercises.

How do you validate an open-ended AI response?

Every exercise has a validate(userPrompt, modelResponse, config) function that runs in the browser after each model response. The function uses two helpers: has(text, ...terms) (all listed terms must appear in the response) and hasAny(text, ...terms) (at least one must appear). Validation is intentionally loose because an AI response is never exactly the same twice. Exercise 3-1 (Deterministic Logic) passes if the response contains the correct multiplication result and the config shows temperature at 0. It doesn’t check for specific phrasing. The validation is a completion signal, not a grading rubric, and that distinction matters for learning.

What’s the difference between TinkerLLM and a YouTube AI tutorial?

A YouTube tutorial shows you someone else sending prompts to an AI. You watch it, understand it in the moment, close the tab, and retain maybe 20% a day later because you never actually did anything. TinkerLLM makes you send the prompts. Every concept has a specific exercise where you configure a parameter and observe the effect with your own hands. You can browse the full 18-lesson curriculum before paying for anything. The 26 free exercises exist specifically so you can verify this format works for you before you spend Rs. 499 on the rest.

What models does the playground support?

The default for most lessons is Gemini 2.5 Flash via the @google/genai SDK. Lesson 14 (Coding Specialist) uses a Gemini Pro model for better code generation. Lesson 17 (Hallucinations and Sycophancy) defaults to Gemini Flash Lite 2.0 because it’s more sycophantic by nature, which is what makes those exercises instructive. You can add an OpenRouter API key in Settings and route to any model OpenRouter supports. The routing logic in the app checks the model name prefix: if it starts with “gemini,” the call goes to Google; everything else goes to OpenRouter. Both paths are direct from your browser, BYOK, no server proxy.

Start with Lesson 1 (free) at app.tinkerllm.com.

How We Built TinkerLLM: 68 Exercises, 2 Wrong Turns

TL;DR