robots.txt: the Nuclear Option

A Reasonable Question

Brad at IndieSeek recently asked if it was time to block Google and AI bots. As I said in a comment on his post, it was time to block Google five years ago. Not to mention Bing, Yandex, Yahoo!, and every other corporate search engine. Likewise Semrush and Ahrefs. Likewise also Twitter and Facebook (or X and Meta if you still think corporations are people and shouldn’t be deadnamed to keep alive the memory of their malfeasance).

They take what they want, and give nothing of value in return. They certainly aren’t paying us for access to our websites.

An Unreasonable Answer

I am therefore resolved to exercise the nuclear option by blocking all web crawlers by default, and only explicitly allowing specific crawlers. Here is my implementation.

User-agent: ia_archiver
Disallow:

User-agent: MojeekBot
Disallow:

User-agent: DuckDuckBot
Disallow:

User-agent: search.marginalia.nu
Disallow:

User-agent: *
Disallow: /

Sitemap: https://starbreaker.org/sitemap.xml

a sample robots.txt file that blocks all compliant crawlers by default and explicitly grants access to certain crawlers

A reasonable reader might ask if I am not being unreasonable or shooting myself in the foot by blocking search engines that might send readers my way. First, I am not obligated to be “reasonable” on my own website. Second, I no longer believe that Google sends readers to independently owned and operated personal websites. Third, and as I’ve demonstrated above, I can grant access to specific crawlers at my discretion.

If Google wants to crawl my website, they can negotiate a licensing deal that will give them access in exchange for attribution and compensation. They can certainly afford to make such deals if they think personal websites are sufficiently valuable.

Or, they can simply clone my website’s git repository and run make to build a local copy of my website. It’s not that hard; I do it all the time. How do you think I build this site and deploy it, anyway?

The Ripley Threshold?

This should probably be a separate blog post, but whatever. The Ripley Threshold is the point at which mass destruction starts to seem reasonable, inspired by Signourney Weaver’s line and thoroughly world-weary delivery in Aliens: I say we take off and nuke the entire site from orbit... It’s the only way to be sure.

Ellen Ripley doesn’t get paid enough for any of this shit.

I honestly think Google is long past this threshold; there is nothing wrong with it that can’t be fixed with the sort of Federal antitrust enforcement not seen since the breakup of Standard Oil or the breakup of AT&T into the Baby Bells — or an orbital kinetic bombardment platform delivering a few rods from God. Of course, waiting for either to happen is a fool’s errand. Direct action is what gets results, and blocking Google via robots.txt is the sort of direct action in which any webmaster can engage without fear of being subjected to state violence at the hands of police.

Update for 2024-06-13: Manuel Moreale on `robots.txt`

Manuel Moreale had been kind enough to send me an email in response to my recent blog post on robots.txt, in which he reminded me of his own experiment with blocking bots and the results of his experiment. Unfortunately, it didn’t work out as well for him as one might have hoped. Most bots and crawlers seem to disregard robots.txt, unfortunately. Frankly, this says something about the character of their developers and the people who sign their paychecks, none of it flattering.

I had written the following in response, because I have some opinions about the attitude of Big Tech toward the personal web.

I had seen this (referring to Moreale’s results), and I suspect that my use of robots.txt is mainly an empty gesture of defiance akin to raising one’s middle finger to the heavens during a thunderstorm and shouting at the gods, “You missed, assholes! Is that all you’ve got?!” Nevertheless, I think it’s important to show that I do *not* consent to having my personal website used as a source by corporate search engines or AI developers.

Like you, I write for people. My writing is freely available for personal, non-commercial use. If a corporation wants to use my writing, they can damn well negotiate a licensing deal with me first.

God knows they could afford it. Corporations have for entirely too long been permitted to presume consent and require people to “opt out” instead of being required to get informed consent and provide compensation. When called on it, as we’ve seen with Maven scraping the Fediverse and importing posts without getting informed consent, they complain about how difficult or expensive it is to get consent from individual operators, sounding like college kids in 1999 who got busted using Napster to get music without paying for it.

I have no sympathy, patience, or charity to spare for these techbros any longer. They are generally, and quite frankly, the sort of men who would assume that if a woman didn’t or couldn’t refuse their advances then it was permissible to use her for their own gratification. They are nothing but intellectual and artistic rapists in spirit if not in deed, with no respect whatsoever for the work they appropriate for their own ends.

I understand that the EU is getting better about this, but I live in the US where the law protects the rich and binds the rest, instead of protecting and binding all in equal measure. So much for the motto engraved above the entrance to the US Supreme Court: "Equal Justice Under Law". I am thus determined to resort to direct action, to the extent that I can do so without drawing the ire of the state.

He also asked the following:

On Thu, Jun 13, 2024, at 2:33 PM, Manuel Moreale wrote:

Do you think is worth going one step further and try block crawlers at
the server level by sniffing the UA? I suspect most companies won’t respect
robots in the future considering the state of things with all the AI
nonsense going on.

I had the following to say in response.

I’ve considered doing this myself. I’m given to understand that the operator of Cheapskate’s Guide and Blue Dwarf blocks crawlers at the server level, but for practical reasons as well as ideological ones. He’s self-hosting on a low-power machine on a residential connection and finds bot traffic overwhelming.

In my case it currently isn’t practical because my website is hosted by Nearly Free Speech and their server configuration doesn’t let me check user agents or IP address ranges in .htaccess. I would need to migrate to a VPS or co-located bare metal if I want full control over the server config.

What I would like to do, in particular, is to send HTTP 402 (payment required) to known corporate search engines and AI crawlers instead of the usual HTTP 403 (forbidden). While 402 isn’t yet standardized, let alone implemented, I think it does a better job of conveying where these bots go wrong.

The internet belongs to the people, or at least it damned well should.

I’m surprised Googlebot still respects robots.txt; I think it’s a holdover from before their IPO that nobody ever got around to ‘fixing’.

Thanks again to Manuel Moreale for emailing me, and for granting permission to quote his email in this entry.

Update for 2024-06-14: Blocking Bots in `.htaccess`

Cory Dransfeldt brought my attention to a post by Robb Knight about blocking bots in Nginx, which referenced Blockin’ Bots by Ethan Marcotte. Since starbreaker.org is hosted on Apache, I was particularly interested in Marcotte’s post. However, based on failed attempts to block traffic referred from Hacker News, I had been under the impression that identifying referrers or user-agents in .htaccess and suggesting that they listen to Black Sabbath wasn’t viable on Nearly Free Speech.

cover for Black Sabbath’s 1995 album — HTTP Error 403: Forbidden

Nevertheless, I decided to try out the following directive, based from code from Marcotte’s blog post.

RewriteEngine on
RewriteBase /

# Google, SEO bot, or AI bot?
# Computer says, ‘fuck you.’
RewriteCond %{HTTP_USER_AGENT} (Googlebot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (AdsBot-Google) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (AdsBot-Google-Mobile) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Google-Safety) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Feedfetcher-Google) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Googlebot-Image) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Googlebot-Mobile) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Googlebot-News) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Googlebot-Video) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (GoogleOther) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (bingbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (BingPreview) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DiscordBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (LinkedInBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (redditbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (TelegramBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (adbeat_bot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Turnitin) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (360Spider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Baiduspider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (HaoSouSpider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (msnbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (msnbot-media) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Yandex) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (YandexBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (YandexImages) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (YandexRenderResourcesBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (AhrefsBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (AhrefsSiteAudit) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (MJ12bot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Semrushbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (SiteAuditBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (SemrushBot-BA) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (SemrushBot-SI) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (SemrushBot-SWA) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (SemrushBot-CT) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (SplitSignalBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (SemrushBot-COUB) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Dotbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Rogerbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Twitterbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Pinterestbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (FacebookExternalHit) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Amazonbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Applebot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Applebot-Extended) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (AwarioRssBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (AwarioSmartBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Bytespider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (CCBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (ChatGPT) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (ChatGPT-User) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Claude-Web) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (ClaudeBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (cohere-ai) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DataForSeoBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Diffbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (FacebookBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Google-Extended) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (GPTBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (ImagesiftBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (magpie-crawler) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (omgili) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Omgilibot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (peer39_crawler) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (PerplexityBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (YouBot) [NC]
RewriteRule ^ – [F]

a sample .htaccess directive that refuses connections from known bad actors

I also have a X-Robots-Tag header now, courtesy of @@zgp. See his post X-Robots-Tag for GPC for additional details.

Header set X-Robots-Tag: "noai, noimageai, GPC"

setting ground rules for robots in .htaccess

I figured that if I cleared my cache, hit my site, and got HTTP 500 I could just undo the change and push a new .htaccess with no lasting harm done. To my surprise, it appears to work.

This means I can stop pussyfooting in robots.txt. As Marcotte notes, that file is for suggestions. .htaccess is for commands to be obeyed on pain of death.

I, therefore, have a much shorter and simpler robots.txt file than I’ve had before if you strip out the posturing in the comments.

# If I want to block bots, I’ll do it in .htaccess or httpd.conf.

# robots.txt is for polite suggestions.
# I need not politely suggest on my own website.
# I shall crown myself in independence, and COMMAND.

# As your empires fall,
# let my will be done on Earth,
# and Heaven be damned.

User-agent: *
Disallow:
Sitemap: https://starbreaker.org/sitemap.xml

# In my own name, amen.

my final robots.txt file

Sure, I’m playing the edgelord here (and I manscape with that edge), but comments like these are at least half the fun for me when I wear my sysadmin hat. Otherwise I’d just get my cats to deal with it. It’s not like I write edgy comments in code I write for my day job. If you can’t be unprofessional on your personal website, then where can you?

Update for 2024-06-18: Come at me, techbros...

Looks like Taylor Troesh has shared this on Hacker News. Never mind that I had asked him not to share my posts there.

Well, to Hell with them — preferably one of their own making. Let’s see if any of them have any novel arguments for why personal website operators should simply allow their sites to be scraped by corporations and fed into virtual meat grinders so that LLMs can churn out more pink slime and make rich assholes even richer at the expense of writers and artists with day jobs.

While we’re at it, let’s see if any of them have the guts to post their arguments on their own websites instead of hiding beind their HN handles. I suspect most of them don’t even have their own websites. They’re probably content to post on Medium or Substack.

I already know these companies can spoof their user agent. I subscribe to Robb Knight’s RSS feed, too, you see. So I’ve heard about Perplexity AI’s shenanigans.

Nor am I interested in hearing that I should just accept it because I am just one man shouting defiance at a corporation that can do as it likes with impunity. This is schoolyard logic. It is no different from being told I should just hand over my lunch money because the bully is bigger than me and has more friends that I do. The use of such logic by HN’s commenters and those they claim to admire for their success says much about their character — none of it flattering.

I really ought to figure out how to block unwelcome referrers in .htaccess without causing Nearly Free Speech to throw a hissy-fit. I’m not asking for much; I’d be content to 403 their asses.