Turning Learners Into Developers
Codekilla
CODEKILLA
back to course
Lesson 04 / 914%· free preview
Introduction to Prompt Engineering4/5

Types of AI Models (LLMs, Diffusion, etc.)

LLM vs Diffusion vs Multi-modal — when to reach for which.

Definition

Not all AI is the same. LLMs are next-token predictors for text. Diffusion models iteratively de-noise random pixels into images. Multi-modal models handle text + image + audio + video. Knowing which family of model fits your task is half the battle of building an AI feature.

The 4 Families You'll Actually Touch
FamilyExamplesBest for
LLMs (text)GPT-5.2, Claude Sonnet 4.5, Gemini 3 ProCode, copy, analysis, classification, RAG
Diffusion (image)DALL·E, Stable Diffusion, Midjourney, Nano BananaHero images, illustrations, mockups
Multi-modalGPT-5 Vision, Gemini 3, Claude VisionScreenshot analysis, OCR-ish tasks, video summaries
Speech (TTS / STT)OpenAI Whisper, ElevenLabs, OpenAI TTSVoice notes, podcast transcripts, voice agents
Picking the Right Model
  • Need to write or analyse text? → LLM. Pick the cheapest tier that produces good output.
  • Need to generate or edit an image? → Diffusion. Nano Banana for fast, GPT Image 1 for premium.
  • Need to understand a screenshot or PDF? → Multi-modal LLM (GPT-4o / Claude Vision).
  • Need real-time voice in / voice out? → Whisper + LLM + TTS pipeline (or a unified voice model).

Rule of thumb: start with the cheapest model that solves the task. Upgrade only when you can measure quality wins.

What Does NOT Belong in This List

Lots of older 'AI' tooling is not generative — it's classical ML or rule-based:

  • Logistic regression / random forests / gradient boosted trees → for tabular data.
  • Topic modelling (LDA) → for clustering.
  • Hard-coded heuristics → which is fine! Don't drag an LLM into a problem if x in list: solves.
Key Takeaways
  • LLM for text, Diffusion for images, Multi-modal for screenshots/video, Speech for voice.
  • Each family has its own SDK, pricing, and gotchas — don't pick based on hype.
  • Start cheap; upgrade only when output quality doesn't meet the bar.
  • Not every problem needs an LLM. Classical ML or plain code is often the right answer.
Interview Questions

Practice Questions
  1. Sketch the model architecture of an AI tutor: what models do you wire together for chat, audio, and an avatar?
  2. Pick a feature in your app. Decide which model family fits — and write a 2-line justification.
  3. Compare cost per 1 K tokens for GPT-5.2 vs. Claude Sonnet 4.5 vs. Gemini 3 Pro. Which is the cheapest acceptable option?
Pro Tips
  • Bookmark provider pricing pages — they change every few months.
  • Cache deterministic LLM calls aggressively — same prompt, same answer, no token cost.
  • Diffusion images take 2-30 seconds — never block your UI on them. Always generate async.
AI-powered recap

Quick recap quiz?

We'll generate 5 MCQs from this lesson and check your understanding instantly. Takes ~30 seconds.

Ready to move on?
// example library
Want more hands-on snippets in AI?
Browse 0 runnable examples · across 0 chapters · short, copy-paste-friendly · grouped by topic
Explore examples
// side-by-side reference
See this in other languages
Compare the same concept across C, C++, Java, and Python — one table, zero tab-switching.
Compare Languages
// feedback.matters()
Did this lesson help you?