Wednesday, August 13th, 2025 at 7:23 PM PDT

🤦🏻‍♂️ Between an Archive and the Decaying Web

This is a journey that begins with the news that Reddit is planning to block the Internet Archive:

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit.

A lot of people are talking about this. One piece of commentary that caught my eye from Nick Heer on Pixel Envy:

Unfortunately for many publishers, the [Internet] Archive seems to be perfectly happy with scrapers and is unbothered if its collection is used to train artificial intelligence. While the Wayback Machine preserves a copy of a website’s robots.txt file, any publisher serious about restricting A.I. training on their material must also block the Internet Archive for fear this could happen to them. That would be a terrible loss for all of us.

I had no idea the Internet Archive is okay with generative AI scraping everything it can get its grubby little ~~hands~~ bots on.

That second link in the quote above goes to a post on the Internet Archive's blog which contains the following:

Moreover, even aside from openly licensed material, there are vast troves of technically copyrighted but not actively rights-managed content on the open web; these are also used to train AI models. Millions, if not billions, of individuals have contributed to these data sources, and because none of them are required to register their work for copyright to arise, it does not seem possible or sensible to try to identify all of the relevant copyright owners–let alone negotiate with each of them–before development can continue. Recognizing these and a variety of other concerns, the European Union has already codified copyright exceptions which permit the use of copyright-protected material as training data for generative AI models, subject to an opt-out in commercial situations and potential new transparency obligations.

This is the same "it would be difficult or impossible to legally and ethically obtain the training data we need, so we're going to illegally strip mine the data from every source we can find without worrying about consent" bullshit argument that all the generative AI companies have made.

Yes, it's hard. That doesn't mean it isn't worth doing. The Archive's crawler respects robots.txt files and meta nofollow elements. They have a page explaining how you can request removal of previously archived content. Hard doesn't mean impossible.

And yet, here we are. Finding out the Internet Archive has taken the same stance on this as the generative AI folks frankly sickens me.

Back to their post:

To be sure, there are legitimate concerns over how generative AI could impact creative workers and cause other kinds of harm. But it is important for copyright policymakers to recognize that artificial intelligence technology has the potential to promote the progress of science and the useful arts on a tremendous scale.

It might turn out to be good, so let's ignore consent and laws and ethics and push blindly forward while ignoring the myriad downsides? Really, Internet Archive? Really?

It is both sensible and lawful as a matter of US copyright law to let the robots read. Let’s make sure that the process described by Professor Litman does not get in the way of building AI tools that work for everyone.

It matters why the robots are reading, though, and to what end.

To paraphrase one of the most astute observations I've ever encountered, generative AI allows wealth to access skill while denying the skilled access to wealth. Generative AI, by its very nature, is a massive exploitation engine. Its effectiveness is being directly measured as a massive consolidation of wealth to fewer and fewer individuals. How can such a thing possibly "work for everyone" when it's literally designed to do the opposite?

A robot creating a digital copy of a book to preserve the content for the present and future public good is one thing. A robot obtaining training data for generative AI is quite another. But, somehow, from the Internet Archive's perspective, there seems to be no difference.

That's bonkers.

From the Internet Archive's about page (emphasis added):

The Internet Archive, a 501(c)(3) non-profit, is building a digital library of Internet sites and other cultural artifacts in digital form. Like a paper library, we provide free access to researchers, historians, scholars, people with print disabilities, and the general public. Our mission is to provide Universal Access to All Knowledge.

I've respected, admired, and believed in this mission for a long time, but now it seems I was looking at this through an overly idealistic lens. If the Internet Archive thinks it's a good idea to provide universal access to all knowledge without exception—even to people and systems that actively damage, diminish, and exploit those who pursue knowledge, create art, and contribute to the public good—that's not a mission I can get behind.

This reminds me of the paradox of tolerance. Just as you can't have a tolerant society while tolerating intolerance, you can't have free, open, and universal access to knowledge if you give unfettered access to those who would use that knowledge to actively degrade and prevent the dissemination and discovery of knowledge.

So now I need to figure out what to do about this.

I made this site for people, for humans. I want visitors to this site—you—I want you to find joy here. I want you to learn and grow. I want to help you if I can.

What I don't want to do is help or contribute to an exploitation machine or the people who own them.

Before today, I would have said it was great if this site was preserved in the Internet Archive, but now I'm not so sure. If the price of being in the Archive is opening a backdoor to generative AI crawlers to ingest my work for their own gain, to the detriment of both me and everyone else, I have to ask myself if that's too high a price to pay.

There's also the broader picture to consider. The Web is built on trust, but the people behind generative AI have, and continue to, erode that trust. As more sites decide—or are forced—to defend themselves against unruly generative AI crawlers, we lose more the open web. If I want to be the change I want to see in the world, whatever I decide to do here should stand as an example, or even a recommendation, for others to follow. I want to find a solution that bolsters the open web and the current and future public good, but it needs to be weighed carefully against those who are taking advantage of the the trust inherent in the pursuit of those goals.

Let's ponder some choices.

I already have a robots.txt file which blocks as many generative AI crawlers as possible. (Does the Internet Archive count as a generative AI crawler now?)

Cloudflare is doing some interesting things to try and stop generative AI crawlers, but at the same time they're also partnering with and generally helping generative AI companies. Beyond those concerns, I also have no interest in adding a third-party dependency or putting anything between you and my server that doesn't absolutely have to be there.

I could aggressively block crawlers and other bots using something like Anubis, but that will likely result in blocking too much. One thing I don't want to risk, for example, is blocking people from accessing my Atom feed, which might be fetched by a legitimate, good bot working on behalf of a feed service like Feedbin.

Another option is outright aggression. Strategic deployment of gzip bombs could be both an effective deterrent and go a step further by actively disrupting a malicious bot's ability to function. There are significant downsides with this approach, though. First, a gzip bomb would likely increase energy and resource use by overwhelming and exhausting a bot's resources. One of my objections to generative AI is excessive use of resources and the resulting impact on the climate, so that alone is a dealbreaker. Second, some of these crawlers are running on regular people's devices inside apps (developers get paid by scraping companies to include their web scraping framework which, in turn, uses the Internet connection of devices where the app is installed to scrape data from the web) or they've been installed maliciously through exploits. These people are victims, and I have no desire to make things even worse for them by causing sudden and inexplicable spikes in processor and memory use.

There's also the option of poisoning my content in a way that's invisible to humans and then letting all the bots in so they can gorge themselves. Adding prompt injections and/or gibberish is one way to taint and generally interfere with generative AI. I'm under no illusions that adding a bit of malicious text to my small, out of the way site will have any meaningful impact on its own, but as I said earlier, part of this is setting an example. Also, it would make me feel better to be doing something to fight back. Every time I get mad at generative AI I could let off some steam by coming up with a new prompt to inject! It also feels wrong to acquiesce and do nothing.

But doing nothing is an option. I could let the bots in, let the Internet Archive do its thing, and just ignore it all. I do, after all, believe generative AI has no long-term future, at least in the broad sense. It's currently being heavily subsidized by investors, and the money isn't going to last forever. The bubble will pop at some point. It's also being forced down people's throats because it can't succeed on its own merits, and people are already getting sick of it. From a technical standpoint, it's not going to magically transform into AGI or ascend to some magical new level of functionality, no matter how much the investors and CEOs wish it. If anything, it's probably already at or near its peak performance. As the money dries up and the training data is tainted by the noise of its own output, generative AI will fizzle out at some point. It won't go away completely, but at some point the hype will fade, some shiny new thing will take its place, and there will come a day when generative AI takes up no more of our collective attention than the things that came and went before it, like blockchains and NFTs.

But that day isn't upon us yet, so I need to make a choice between several bad options. Batten down the hatches and wait it out? Actively fight back? Ignore it and hope it goes away like a high school bully?

I'm leaning toward actively, but carefully and strategically, fighting back. A "No Generative AI Allowed!" sign in the form of a comprehensive robots.txt file and adding some hidden prompt injections and random gibberish humans can't see seems like a good enough balance (assuming I can figure out a good way to do that while still providing a full content Atom feed), but I haven't decided for sure yet.

I'd love to hear what you think. Please share your thoughts on Mastodon or send me an email using the link in the footer.