2.1 How LLMs retrieve information

To understand GEO, you need to understand how a Large Language Model decides what it tells you. Two fundamental sources: what it learned during training, and what it retrieves at the moment of your query.

Source 1: Training data

An LLM is trained on a vast amount of text. That training has a cut-off date. Content that has been online for a long time, consistently, and is widely cited, has a higher chance of being part of the training data — and therefore part of the model’s base knowledge.

Implication: authority and presence over time count. GEO is not a sprint.

Source 2: Live retrieval (RAG)

Many modern AI systems are extended with a retrieval component. When you ask a question, the system fetches relevant documents as context. ChatGPT with web search, Perplexity and Google AI Overviews all work on a form of RAG.

Practical implication: fresh, well-indexed content can be retrieved and used, even after the training cut-off.

What does this mean in practice?

  • Write content that gives direct answers to specific questions
  • Ensure technical accessibility: fast load times, correct HTML, no blocks for crawlers
  • Build long-term authority through consistent publishing and external mentions
  • Use structured data to make context explicit
  • Be present on platforms that AI systems use as sources