SEO 8 min

Search Engine Working: Crawler, Sitemap & robots.txt

Read on to explore search engine working: crawler, sitemap & robots.txt — a beginner-friendly walkthrough by Codekilla.

Rahul Chaudhary Thu Apr 30 2026

What is Search Engine Working?

When you type a query into Google, Bing, or any search engine, you get results in milliseconds. But behind that speed lies a complex system of bots crawling billions of pages, indexing content, and ranking it based on relevance. Search engines don't manually read every website — they send automated programs called crawlers (or spiders) to discover, scan, and catalog your pages. These crawlers follow links, parse content, and build a massive index that powers search results.

To control how these bots interact with your site, you use two critical tools: robots.txt (a file that tells crawlers which pages to avoid) and sitemap.xml (a roadmap that lists all the pages you want indexed). Understanding how crawlers work and how to guide them is the difference between a site that ranks and one that stays invisible.

Why It Matters

Visibility: If crawlers can't find or access your pages, you won't appear in search results — no matter how great your content is.
Budget Efficiency: Search engines allocate a limited "crawl budget" to each site. Blocking low-value pages saves resources for your important content.
SEO Performance: Properly configured robots.txt and sitemaps help search engines understand your site structure, leading to faster indexing and better rankings.
User Experience: Faster crawling and accurate indexing mean users find fresh, relevant content when they search for topics you cover.
Technical Control: You decide what gets indexed — keeping draft pages, admin panels, and duplicate content out of search results.

How Crawlers Work

Crawlers are bots that follow a simple loop: discover a URL, fetch the page, parse the content, extract links, and repeat. Google's crawler is called Googlebot, Bing uses Bingbot, and there are dozens of others. They start with a seed list of known URLs (from previous crawls, sitemaps, or external links), then recursively follow every link they find.

When a crawler visits your site, it first checks robots.txt to see if you've blocked any paths. If allowed, it downloads the HTML, processes JavaScript (if the bot supports it), and extracts text, images, and links. The content gets sent to the search engine's indexing system, which analyzes keywords, relevance, and quality. The crawler then adds any new URLs to its queue and moves on.

Crawl frequency depends on your site's authority, update frequency, and server performance. High-traffic news sites get crawled every few minutes; small blogs might wait days between visits. You can't force a crawler to visit more often, but you can optimize your site to make crawling efficient.

python
# Simplified crawler logic (pseudocode)
def crawl(url, visited=set()):
    if url in visited:
        return
    visited.add(url)
    
    # Check robots.txt before proceeding
    if not is_allowed(url):
        return
    
    page = fetch_page(url)
    index_content(page.text)
    
    for link in extract_links(page.html):
        crawl(link, visited)

robots.txt: The Gatekeeper

The robots.txt file sits at the root of your domain (https://yoursite.com/robots.txt) and uses a simple syntax to allow or block crawler access. Every reputable bot checks this file before crawling. You specify rules with User-agent (which bot) and Disallow (which paths to block).

Common use cases: hiding admin pages (/wp-admin/), preventing duplicate content (/print/ versions), blocking search result pages (/search?q=), or stopping bots from wasting time on low-value directories like /cgi-bin/. You can also point crawlers to your sitemap here.

Directive	Purpose	Example
`User-agent: *`	Applies rule to all bots	`User-agent: *`
`Disallow: /admin/`	Block a directory	`Disallow: /admin/`
`Allow: /public/`	Explicitly allow a path	`Allow: /public/`
`Sitemap:`	Link to your sitemap	`Sitemap: https://site.com/sitemap.xml`

Critical point: robots.txt is a directive, not a security tool. Bots honour it voluntarily — malicious scrapers ignore it. Never put sensitive content behind robots.txt alone; use proper authentication.

txt
# Example robots.txt for a blog
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /wp-login.php
Allow: /wp-content/uploads/

# Block aggressive bots
User-agent: AhrefsBot
Crawl-delay: 10

Sitemap: https://codekilla.com/sitemap.xml

Sitemap: Your Site's Blueprint

A sitemap is an XML file listing all the URLs you want search engines to index, along with metadata like last-modified dates, update frequency, and priority. While crawlers can discover pages by following links, a sitemap ensures they don't miss orphaned pages (content with no internal links) or newly published articles.

Sitemaps are especially useful for large sites, sites with deep navigation hierarchies, or content-heavy platforms where new pages are added daily. You can include images, videos, and news articles in specialized sitemap formats. Most CMS platforms (WordPress, Shopify, Next.js plugins) auto-generate sitemaps.

xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://codekilla.com/blog/search-engine-working</loc>
    <lastmod>2025-01-15</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://codekilla.com/courses/javascript</loc>
    <lastmod>2025-01-10</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>

Submit your sitemap through Google Search Console and Bing Webmaster Tools. This doesn't guarantee indexing, but it speeds up discovery. Sitemaps should stay under 50,000 URLs and 50MB uncompressed — split large sites into multiple sitemaps linked from a sitemap index.

Crawl Budget and Optimization

Every site has a crawl budget — the number of pages a search engine will crawl in a given timeframe. Google doesn't publicly share exact numbers, but it's influenced by server speed, site authority, and content freshness. If you waste crawl budget on duplicate pages, parameter URLs, or infinite scroll, important content might not get indexed.

Optimize crawl budget by: blocking low-value paths in robots.txt, using canonical tags to consolidate duplicates, fixing broken links (404s waste crawl), compressing images/files for faster loading, and implementing proper pagination with rel=next/prev. Monitor crawl stats in Search Console to spot issues like sudden drops (server errors) or spikes (crawler traps).

Problem	Solution
Duplicate content (filters, sorts)	Use canonical tags or block in robots.txt
Slow server response	Upgrade hosting, enable caching, use CDN
Infinite scroll/calendar pages	Block date archives, limit crawlable pages
Orphaned pages	Add to sitemap, link from main navigation

Quick Cheat Sheet

Need	Reach For
Block admin/login pages from crawlers	`robots.txt` with `Disallow: /admin/`
Help search engines find all your pages	XML sitemap at `/sitemap.xml`
Prevent duplicate content indexing	Canonical tags + robots.txt blocks
Speed up indexing of new posts	Submit updated sitemap to Search Console
Check what Google sees on your site	Google Search Console > Coverage report
Test robots.txt before deploying	Search Console > robots.txt Tester

Common Mistakes

Blocking CSS/JS in robots.txt — Google needs these files to render pages properly. Blocking them can hurt mobile rankings since Google can't see responsive design.
Forgetting to submit sitemaps — Creating a sitemap but never telling search engines where it is. Always add the Sitemap: directive in robots.txt and submit via webmaster tools.
Using robots.txt for security — Disallowing /secret-documents/ doesn't stop people from guessing the URL. Use server-level authentication or .htaccess instead.
Overly complex robots.txt — Too many rules slow down parsing. Keep it simple: block broad directories, not individual files.
Outdated sitemap URLs — Sitemaps pointing to deleted or redirected pages waste crawl budget. Automate sitemap generation to stay current.
Not monitoring crawl errors — Search Console shows which pages return 404s or server errors. Fix these or you'll lose rankings on broken content.

💡 Think Like a Programmer: Search engines are glorified web scrapers with priorities — your job is to write clear instructions (robots.txt, sitemaps) so they spend time on what matters. Treat crawlers like collaborators, not adversaries.

// was this useful?

Did this article answer your question?

// SEO · published by Codekilla

← Back to all articles

// related articles