All posts
AI CrawlersRenderingTechnical

How AI Crawlers Read Your Site

ChatGPT, Perplexity, and Google AI Overviews fetch your pages differently than a browser. Here's what their crawlers see, and why server-rendered HTML wins.

By GEO Repair

When you open a page, your browser downloads the HTML, runs the JavaScript, fetches data, and paints a finished screen. AI crawlers, the bots behind ChatGPT, Perplexity, Google AI Overviews, and Claude, usually do far less. Understanding that gap is the single most useful thing you can learn about AI Search Optimization, because almost every common failure traces back to it.

What a crawler actually fetches

Most AI crawlers make a plain HTTP request and read the raw HTML that comes back, the same thing you'd see with curl or "View Source," not the fully-rendered DOM you see in DevTools. Many of them do not execute JavaScript at all, and the ones that do treat it as expensive and optional.

That has a blunt consequence: if your headline, your article body, or your answer only appears after the page hydrates on the client, a crawler that doesn't run JavaScript sees none of it. To that bot, your page is close to blank.

You can check this yourself in seconds:

curl -s https://example.com | grep -i "your headline text"

If your most important content doesn't show up in that output, it likely doesn't show up for a non-rendering crawler either.

Why server-rendered HTML wins

The fix is to make your meaningful content present in the initial HTML response. With a framework like Next.js, that means server components and server-side rendering rather than client-only data fetching for anything that matters for discovery.

Server-rendered HTML wins for AI search because it is:

  • Always visible: no JavaScript execution required to see the content.
  • Cheap to fetch: one request returns the whole answer, so crawlers don't skip it.
  • Stable: the content doesn't depend on a client runtime that might error or time out.

Interactive widgets are fine. The rule is simpler than it sounds: the selling content and the answers should exist in the HTML before any script runs. Layer interactivity on top, but don't hide the substance beneath it.

Crawlers respect your rules, including the accidental ones

AI crawlers identify themselves with user-agents like GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended, and they honor robots.txt. That's good, until a stale rule blocks them by accident. A Disallow: / left over from a staging environment, or a blanket block of an unfamiliar user-agent, will quietly remove you from consideration.

Make sure your robots.txt allows the AI user-agents you want to reach, points to a valid XML sitemap, and doesn't fence off the pages you most want read.

Structure tells a crawler what it's looking at

Once a crawler has your HTML, structure decides how confidently it can extract an answer:

  • One <h1> and a sensible heading order signal what the page is about and how it's organized.
  • Real landmarks (<main>, <article>, <nav>) separate content from chrome so a model isn't quoting your footer.
  • JSON-LD structured data states facts explicitly: this is an Article, here's the author, here's the published date.
  • Question-shaped headings with self-contained answers map directly to how people phrase queries.

This is the same semantic hygiene that helps screen readers and traditional search. AI search just raises the stakes, because the crawler is trying to understand and reuse your content, not merely rank it.

The honest takeaway

Doing all of this won't guarantee that any AI system cites you; nobody can promise that. What it does is remove the technical reasons you'd be skipped: an empty page to a non-rendering bot, a blocked crawler, or content a model can't confidently parse. Get the readiness right, and you've done the part that's actually in your control. Run a checkup to see exactly where your pages stand.