back to course
Lesson 04 / 914%· free preview
Introduction to Prompt Engineering4/5
Types of AI Models (LLMs, Diffusion, etc.)
LLM vs Diffusion vs Multi-modal — when to reach for which.
Definition
Not all AI is the same. LLMs are next-token predictors for text. Diffusion models iteratively de-noise random pixels into images. Multi-modal models handle text + image + audio + video. Knowing which family of model fits your task is half the battle of building an AI feature.
The 4 Families You'll Actually Touch
| Family | Examples | Best for |
|---|---|---|
| LLMs (text) | GPT-5.2, Claude Sonnet 4.5, Gemini 3 Pro | Code, copy, analysis, classification, RAG |
| Diffusion (image) | DALL·E, Stable Diffusion, Midjourney, Nano Banana | Hero images, illustrations, mockups |
| Multi-modal | GPT-5 Vision, Gemini 3, Claude Vision | Screenshot analysis, OCR-ish tasks, video summaries |
| Speech (TTS / STT) | OpenAI Whisper, ElevenLabs, OpenAI TTS | Voice notes, podcast transcripts, voice agents |
Picking the Right Model
- Need to write or analyse text? → LLM. Pick the cheapest tier that produces good output.
- Need to generate or edit an image? → Diffusion. Nano Banana for fast, GPT Image 1 for premium.
- Need to understand a screenshot or PDF? → Multi-modal LLM (GPT-4o / Claude Vision).
- Need real-time voice in / voice out? → Whisper + LLM + TTS pipeline (or a unified voice model).
Rule of thumb: start with the cheapest model that solves the task. Upgrade only when you can measure quality wins.
What Does NOT Belong in This List
Lots of older 'AI' tooling is not generative — it's classical ML or rule-based:
- Logistic regression / random forests / gradient boosted trees → for tabular data.
- Topic modelling (LDA) → for clustering.
- Hard-coded heuristics → which is fine! Don't drag an LLM into a problem
if x in list:solves.
Key Takeaways
- LLM for text, Diffusion for images, Multi-modal for screenshots/video, Speech for voice.
- Each family has its own SDK, pricing, and gotchas — don't pick based on hype.
- Start cheap; upgrade only when output quality doesn't meet the bar.
- Not every problem needs an LLM. Classical ML or plain code is often the right answer.
Interview Questions
Practice Questions
- Sketch the model architecture of an AI tutor: what models do you wire together for chat, audio, and an avatar?
- Pick a feature in your app. Decide which model family fits — and write a 2-line justification.
- Compare cost per 1 K tokens for GPT-5.2 vs. Claude Sonnet 4.5 vs. Gemini 3 Pro. Which is the cheapest acceptable option?
Pro Tips
- Bookmark provider pricing pages — they change every few months.
- Cache deterministic LLM calls aggressively — same prompt, same answer, no token cost.
- Diffusion images take 2-30 seconds — never block your UI on them. Always generate async.
AI-powered recap
Quick recap quiz?
We'll generate 5 MCQs from this lesson and check your understanding instantly. Takes ~30 seconds.
Ready to move on?
// example library
Want more hands-on snippets in AI?
Browse 0 runnable examples · across 0 chapters · short, copy-paste-friendly · grouped by topic
// side-by-side reference
See this in other languages
Compare the same concept across C, C++, Java, and Python — one table, zero tab-switching.
// feedback.matters()
Did this lesson help you?
