How to Optimize for LLMs and Get Cited in AI Outputs

Malte Landwehr

Jul 30, 2024

min read

GAIO (Generative AI Optimization) is the art of manipulating and influencing the output of LLM-based systems.

In this article, I will dive into the theories behind it and practical tips you can apply today to get cited in AI outputs.

Before we start, I need to introduce two basic concepts. We will discuss attack vectors for both. But the distinction between the two is important.

Foundational Models vs RAG

Foundational Model

Foundational Large Language Models are trained on gigantic amounts of data. They have capabilities of both NLU (Natural Language Understanding) and NLG (Natural Language Generation) and can be applied to many different use cases. Examples of foundational models are GPT-3, GPT-4, PaLM2, T-5, and Claude Instant.

RAG

Retrieval-Augmented Generation (RAG) is an approach where, while formulating a response, an LLM has access to data that is used to inform it. This can be structured data (like the Knowledge Graph) or unstructured data.

RAG greatly reduces the number of wrong answers given by LLMs and allows LLMs to answer questions about facts/events that only happened after the foundational model was trained.

RAG can also be used to steer an LLM towards using reliable sources - even when unreliable sources were part of the training data. When unclear terminology is used, RAG can clear up any confusion an LLM might have about terms like golf or bark.

Lastly, RAG can help with local language differences. An LLM trained in US and British English documents might not know the correct answer when asked to explain what trainers or solicitors are. If the user’s location is known, RAG can help to solve this confusion.

(For those who do not know: in the UK, trainers are sports shoes and solicitors are lawyers, while in the US, trainers are coaches and solicitors are salespeople.)

In the context of AI Overviews (SGE), we can guess, based on patent US11769017B1 (Generative summaries for search results), that Google is using content from relevant “search result documents” to generate an AI Overview answer on the fly.

How to Optimize Your Website for Foundational Models

Let’s start with the bad news. Stable (or outdated) models, like GPT-2 or LLaMA, might never be re-trained. Whatever is in there, is in there. We can only optimize for future models (like GPT-5) – or via RAG.

Imagine we wanted to manipulate GPT-5. Here is how we could theoretically do it.

First, we need to ask ourselves how foundational models are trained and how we can inject ourselves into that.

A good example we can learn from is GPT-3. It was trained on half a trillion words from CommonCrawl, WebText2, Wikipedia, and two book datasets. While more than 80% of tokens came from CommonCrawl, their weight was reduced to 60%. The weight of books was slightly increased. The weight of Wikipedia was quintupled, and the weight of WebText2 was increased by a factor of 6x!

Token share and training weight of datasets used to train GPT-3.

To be included in Common Crawl is rather simple – just have a somewhat popular website and do not block the user agent CCBot. If you achieved this, as long as you are not filtered out by sanitization jobs, you officially made an impact on GPT-3.

Congratulations!

For books, you could try to self-publish a ton of books that talk negatively about your competitors and positively about you. But with the rise of AI-generated books, I am sure every future LLM training that involves books will be heavily sanitized. The simplest filter would be to exclude all books not published by a reputable publishing house – unless they made it on some kind of bestseller list. But even entries on best-seller lists can be bought rather easily. With a lot of money, this avenue is promising to inject yourself into future LLMs. But to scale it is almost impossible! This approach is probably interesting for state actors or special interest groups – but not for individual SEOs.

Next on the list is Wikipedia. This is an obvious choice to train LLMs. Wikipedia has an article on almost every topic, with semi-structured content often available in multiple languages and a lot of semantic cross-linking. All of this is properly categorized, updated almost in real-time, and heavily moderated.

A lot has been written about how to manipulate Wikipedia in the SEO and ORM (Online Reputation Management) community. All I want to say here is this: if you do it, do it with finesse. Otherwise, you will fail. And any failure to manipulate Wikipedia will make future endeavors more difficult. And sometimes, it actively backfires. I must know! There was a discussion on the German Wikipedia page where someone asked the Wikimedia Foundation for money to buy a ticket for an SEO conference, go there, and throw a cake in my face during my talk on Wikipedia.

Lastly, there is WebText2. WebText was created by scraping all outbound links from Reddit on posts with at least three upvotes. Wikipedia was filtered out to avoid duplication and allow for benchmarking against Wikipedia.

OpenAI, the company behind GPT-3, is not the only AI company that sees value in the Reddit data. Reddit expects to make more than $200 million in coming years from data licensing agreements with Google and others.

How does this help us?

Well, sharing our links on Reddit and receiving upvotes/Karma seems like a good idea! Please note that it is highly likely that additional sanitization will be used when creating a similar dataset in the future.

I know many of you will ignore this advice, but please do not spam Reddit!

What other websites might be given extra weight to train LLMs?

Wikipedia and Reddit both aggregate user-generated content. They both moderate this content, and there are metrics to identify especially popular content.

Which other websites fulfill these criteria?

Quora and Medium are similar websites that come to mind. And of course, large editorial news sites like the New York Times, Bloomberg, FT, or CNBC. They are represented in Common Crawl, cited on Wikipedia, and upvoted on Reddit. If you get your story into these, you will likely be part of future LLM trainings.

How to optimize your website for RAG

On the surface, optimizing for RAG is very similar to modern SEO. You want your documents to be relevant in whatever index is used to pull input.

One thing all SGE, now AI Overviews studies agree upon is that organic rankings are not identical with what is cited/listed by AI Overviews. This was first reported by Authoritas, and later confirmed by Onely, ZipTie, iPullRank, PeakAce, SERanking, and Brightedge.

So far, we know very little about the ranking factors used for AI Overviews. One thing is obvious: the current implementation of AI Overviews prefers lightweight websites.

Coming back to general RAG optimization, database-style websites like Crunchbase, Yelp, or IMDB are logical sources—and multiple SGE/AI Overviews studies have proven this. If these websites match your brand, please make sure you are in there—with up-to-date and favorable entries.

Another obvious kind of content with RAG relevance is anything on topics an LLM cannot answer. Time is your friend here. Do you do book reviews? Don’t do the fiftieth review of “How to Win Friends and Influence People”. Review brand-new books! When you write about recent news, events, or entities that have only recently come to exist (like a new TV show or new celebrities), any LLM-based answer engine needs to use RAG. And at that moment, you can influence the answer and get cited as a source.

Concrete steps to take today

If we put all these thoughts together, here is what you can do today to prepare for the LLM-based future of search:

Keep page load time (in Google Search Console) below 500ms.
Have as short of a rendering time as possible.
Have all the main content available without any JavaScript dependencies.
Have a lot of text content. And summarize it either at the top or bottom of your page.
Have an up-to-date presence on relevant database-style websites like Yelp, Crunchbase, and IMDB.
Be included on the most important community-moderated sites like Wikipedia, Reddit, and Quora.
Get (positive) coverage in relevant large news and media websites.
If you can, get mentioned in books.

One more thing regarding spam

I mentioned it twice above, but I want to mention it again. Your takeaway from this article should not be that you need to spam Reddit and Wikipedia. Doing so will only give yourself – and the total SEO industry – a bad reputation!

Google, OpenAI, and others love these websites because they are heavily moderated. Your spam will be removed from Wikipedia —I guarantee it! On Reddit, you might get away with hiding your spam in a subreddit where no one ever looks. But the people cleaning the data before LLMs are trained on it are not stupid. They will simply filter out any kind of no- or low-engagement content.

What you can – and should – do is think about how to get mentioned on these sites by the community.

Pro Tip

AI Overviews are becoming more prominent in Google SERPs every day. Discover the current frequency of AI Overviews in search results and their impact on organic search visibility 👇

Pro Tip

AI Overviews are becoming more prominent in Google SERPs every day. Discover the current frequency of AI Overviews in search results and their impact on organic search visibility 👇

Pro Tip

AI Overviews are becoming more prominent in Google SERPs every day. Discover the current frequency of AI Overviews in search results and their impact on organic search visibility 👇

Article by

Malte Landwehr

Malte Landwehr is the CMO of Peec AI, former VP of SEO at Idealo and VP of Product at Searchmetrics. Malte has worked in management consulting, conducted research on social media at WWU, co-founded the agency seoFactory, and built one of the 50 largest blogs in Germany. In addition to search engine optimization and product management, he is interested in e-commerce, AI, and LLMs like ChatGPT.