How AI search engines pick which brands to recommend
A look inside the retrieval-and-generation pipeline that powers ChatGPT Search, Perplexity, Google AI Overviews, and Gemini — and what it means for which brands get cited.
When a customer asks ChatGPT, "what's the best running shoe under $150 for flat feet?", the model does not pull a single ranked list out of its training data. It runs a multi-stage pipeline — retrieval, ranking, generation, and post-hoc citation — and your brand is either inside that pipeline or invisible to it. This article walks through each stage and shows where Generative Engine Optimization tactics actually land.
The four-stage pipeline (in 2026)
Most consumer AI search products are built on the same underlying architecture, often described as retrieval-augmented generation (RAG). The vendors' UIs differ, but the core stages are remarkably consistent:
- Query rewriting. The user's natural-language question is rewritten into one or more web search queries that the retrieval system can execute.
- Retrieval. Those queries hit a web index (Bing for ChatGPT, Google for Gemini, a custom index for Perplexity, a hybrid for Claude). The top 5–30 results are fetched.
- Reranking + grounding. The retrieved documents are scored for relevance, chunked, and packed into the model's context window. This is the stage where most documents lose: a fetched page that doesn't survive the relevance rerank never makes it to the model.
- Generation. The model writes the answer, grounding claims in the surviving documents. Citations are typically attached after the fact, mapping back to which document supported which sentence.
Three things follow from this structure. First, your brand can win or lose at any of the four stages independently — the most common failure is being absent from retrieval at all, which we'll dig into next. Second, the model itself is "dumb" about your brand in the sense that it usually doesn't carry detailed memories of small brands through training; it relies on the retrieved documents to know what you sell. Third, whichever document is most retrievable, most relevant, and most quote-friendly wins — which is a much more concrete optimization target than "rank well".
Why most brands lose at the retrieval stage
When we run audits across 100–300 prompts in a brand's category, the single most common failure mode is not low rankings — it's complete absence from the retrieved set. The model never sees the brand's pages, so it can't cite them, so it falls back to whatever third-party content it does have (reviews on aggregator sites, comparison articles from affiliates, sometimes outdated press coverage).
Three common reasons:
- The brand's own site is JavaScript-rendered without server-side fallback. Most AI crawlers either don't execute JavaScript or do so under strict time/cost budgets. If your most important content only appears after a client-side fetch, the crawler sees an empty shell.
- The retrievable content is locked inside PDFs or behind login walls. Detailed product specs sitting in PDF datasheets are invisible to most retrieval pipelines, even when the PDF is technically public.
- No high-authority third-party page links to the brand with the right anchor text. Retrieval indexes lean heavily on link-graph signals to decide what to crawl frequently and what to consider authoritative. A brand with no third-party coverage is a brand the index forgets about between refreshes.
Why structured data matters more than it did in 2020
Pre-LLM, structured data (Schema.org / JSON-LD) was a small ranking signal — useful for getting rich-result snippets on Google but rarely a make-or-break factor. In the RAG era, structured data has been quietly promoted to a primary input. The reason is bandwidth: when the retrieval system fetches your page and has to extract facts under a 200-millisecond budget, it would much rather read a small JSON-LD block than parse three thousand words of prose.
Concretely: an Organization block with sameAs links resolves identity ambiguity ("is this the same X as the X on Wikipedia?"). A Product block with offers, currencies, and availability lets a comparison query include accurate prices without the model having to guess. An FAQPage block surfaces the brand's own answers to common questions, which is often where citations end up.
The citation loop, and how to break in
The retrieval index has a strong recency-and-authority bias. A brand cited by a recent authoritative source about its category gets crawled more, ranks higher on the rerank, and ends up cited more — which feeds back into more authoritative coverage. This is the positive feedback loop that established brands enjoy and small brands need to break into.
Two reliable ways to break in:
- Be quoted by name in third-party content. Pitch a guest post to a category authority, get a quote into a roundup article, or sponsor a credible review. The goal is one or two recent (last 12 months) mentions on domains the retrieval system already trusts.
- Be your own canonical source. Publish a single, structured, machine-readable profile page that the LLMs can lean on when no third-party source is at hand. This is what Citorial Brand Hubs do — a page that lives on a domain LLMs already trust, contains your structured data, your FAQ, and your sameAs links, and ships with explicit robots.txt opt-in for AI crawlers.
What this means for your GEO roadmap
If you map your work back to the four-stage pipeline, the prioritization is straightforward. Audit first: figure out which prompts you're losing on, and at which stage (retrieval, relevance, or generation). Then fix the most common failure mode for your category — usually retrieval-stage absences, fixable by structured data plus a small set of third-party citations. Only after retrieval is solid does it pay to optimize for in-answer prominence, which is the harder problem.
For the higher-level introduction to GEO as a discipline, see What is Generative Engine Optimization. For a per-product breakdown of how the four major AI engines differ in practice, read ChatGPT vs Perplexity vs Gemini for e-commerce traffic.