In the context of generative AI, should we talk about information pollution?

Dirk Songuer
4 min readFeb 7, 2023

--

After writing about what excites me about generative AI tools, I felt like I should also write about what worries me a bit.

There is this interesting part of the web that does not exist to be read by humans — it only exists to be parsed by search robots to try and manipulate their algorithms.

For Google search (or indeed any search) to work, the search system needs to understand the “knowledge” it possesses — or in this case, the pages it indexes: Is this page relevant for people interested in [search query] (good), or is it talking about something else (bad)?

That’s PageRank in a nutshell. Google states it ranks a page according to its meaning, relevance, quality, usability and context.

But there are financial incentives to manipulate Google search. Making your page rank higher than those of your competitors means more sales. And driving traffic to your pages that serve ads means more eyeballs and clicks.

One way of doing that is Link Farms. These are a network of pages that are cross-linking each other, thus boosting their perceived relevance. Since all these websites seem reputable and high quality (they all link to each other), they can now all link to a specific website to elevate its ranking even further (they all link to a single page).

Google hates Link Farms because they manipulate results, so they have spent a lot of time trying to differentiate between: Is this page genuinely talking about [search query] (good), or only talking about it superficially to drive traffic?

To keep Link Farms working, you need to raise the quality of their content above the “superficial” level. That’s where Content Farms come in: “Companies that employ large numbers of freelance writers to generate a large amount of textual web content which is specifically designed to satisfy algorithms for maximal retrieval by automated search engines, known as SEO.

I think you see where this is going and why generative AI might be a problem:

  1. Take a set of web templates
  2. Specify the topic you want to push your PageRank to ChatGPT
  3. Pipe the output into text sections of the template
  4. Add stock images into image sections of the template
  5. Programmatically repeat thousands of times with slight variations of text output
  6. Monitor performance of individual pages in search, optimize text output from ChatGPT to maximize PageRank

This is the code red Google is currently facing. Not the fact that ChatGPT exists and is a threat to Google. They have that technology as well. In fact, Google invented it. It’s that spammers and scammers now use them to attack Googles core business.

About intentionality

ChatGPT, Bard and the current wave of generative AI solutions are based on Large Language Models (LLMs). It’s important to remember that these do not care about intentions or goals, nor do they understand what they even output — they work based on word distributions. They are a more powerful version of the “let text prediction write my message” meme.

People warn that ChatGPT shouldn’t be trusted because “it will confidently lie to you”. Well, lying implies an agenda and intentionality, which would be high praise. In fact, LLMs do not have any concept of reality and do not have any intentions. In terms of information theory, their output is closer to noise than to a signal. If they happen to create sentences that make sense, it’s because we make sense of them.

Think of ChatGPT output more like sociopathic hallucinations.

They are stochastic parrots and really bad for search because creating a system without a concept of “intentionality” (real, simulated or curated), that is also prone to confabulation, which gives you a definitive answer on a search query is just asking for trouble. And this is the reason Google didn’t do it sooner. The whole point of being in the search business is that people can trust your results.

But for Link & Content Farms, you don’t strictly care about the correctness of results, you care about if an algorithm thinks they are correct.

Will we see Link Farms compete in an ever-escalating race to put out slight variations of a topic faster than others?

DALL-E: An abstract collage showing a flood of information polluting the internet

Are we talking about pollution?

The pessimistic take on this are endless instances of digital sociopaths banging on their typewriters, flooding the web with millions of near-identical pieces of information, without any intent of being relevant.

Information that, despite appearances, might not entail accuracy, informational value, or trustworthiness.

Automated polluting machines, spewing their hallucinations onto the web.

Optimistic take: If ChatGPT and others manage to provide relevant output for a given search query, it’s not really spam, is it? Is it pollution just because of the sheer amount? Clipart also didn’t break the web, even back then. Isn’t digital space effectively endless anyway?

Pessimistic take: At some point signal-to-noise might be high enough to require proactive measures. Do we really need an endless number of iterations of any piece of information — only for the purpose of selling a few more things or cash in on some ad revenue? Using a machine that even the CEO says has “eye-watering compute costs”? Digital space only seems endless, and haven’t we learned that rethink, refuse and reduce are part of sustainability?

The scary part is that there might not be a technical solution for this.

The answer cannot be to ban generative AI models because they do have value. But we should also look at the negative first and second order problems they (might) create: Automation.

Maybe it’s time to step back and start with another technology-sustainability-“R”: Taking a step back and re-evaluate. Talk about the narratives & visions we want to enable with it and how we prevent misuse.

--

--

Dirk Songuer

Living in Berlin / Germany, loving technology, society, good food, well designed games and this world in general. Views are mine, k?