Types of AI Models (LLMs, Diffusion, etc.)

LLM vs Diffusion vs Multi-modal — when to reach for which.

Definition

Not all AI is the same. LLMs are next-token predictors for text. Diffusion models iteratively de-noise random pixels into images. Multi-modal models handle text + image + audio + video. Knowing which family of model fits your task is half the battle of building an AI feature.

The 4 Families You'll Actually Touch

Family	Examples	Best for
LLMs (text)	GPT-5.2, Claude Sonnet 4.5, Gemini 3 Pro	Code, copy, analysis, classification, RAG
Diffusion (image)	DALL·E, Stable Diffusion, Midjourney, Nano Banana	Hero images, illustrations, mockups
Multi-modal	GPT-5 Vision, Gemini 3, Claude Vision	Screenshot analysis, OCR-ish tasks, video summaries
Speech (TTS / STT)	OpenAI Whisper, ElevenLabs, OpenAI TTS	Voice notes, podcast transcripts, voice agents

Picking the Right Model

Need to write or analyse text? → LLM. Pick the cheapest tier that produces good output.
Need to generate or edit an image? → Diffusion. Nano Banana for fast, GPT Image 1 for premium.
Need to understand a screenshot or PDF? → Multi-modal LLM (GPT-4o / Claude Vision).
Need real-time voice in / voice out? → Whisper + LLM + TTS pipeline (or a unified voice model).

Rule of thumb: start with the cheapest model that solves the task. Upgrade only when you can measure quality wins.

What Does NOT Belong in This List

Lots of older 'AI' tooling is not generative — it's classical ML or rule-based:

Logistic regression / random forests / gradient boosted trees → for tabular data.
Topic modelling (LDA) → for clustering.
Hard-coded heuristics → which is fine! Don't drag an LLM into a problem if x in list: solves.

Key Takeaways

LLM for text, Diffusion for images, Multi-modal for screenshots/video, Speech for voice.
Each family has its own SDK, pricing, and gotchas — don't pick based on hype.
Start cheap; upgrade only when output quality doesn't meet the bar.
Not every problem needs an LLM. Classical ML or plain code is often the right answer.

Interview Questions

Practice Questions

Sketch the model architecture of an AI tutor: what models do you wire together for chat, audio, and an avatar?
Pick a feature in your app. Decide which model family fits — and write a 2-line justification.
Compare cost per 1 K tokens for GPT-5.2 vs. Claude Sonnet 4.5 vs. Gemini 3 Pro. Which is the cheapest acceptable option?

Pro Tips

Bookmark provider pricing pages — they change every few months.
Cache deterministic LLM calls aggressively — same prompt, same answer, no token cost.
Diffusion images take 2-30 seconds — never block your UI on them. Always generate async.

AI-powered recap

Quick recap quiz?

We'll generate 5 MCQs from this lesson and check your understanding instantly. Takes ~30 seconds.

Ready to move on?

// feedback.matters()

Did this lesson help you?

Types of AI Models (LLMs, Diffusion, etc.)

Q01Difference between an LLM and a diffusion model?

Q02What's a 'multi-modal' model?

Q03Which model would you reach for to summarise a 1-hour podcast?

Q04Why is Nano Banana so popular?

Quick recap quiz?