Preventing LLM Scraping

Sometimes I wonder if there’s any way to prevent LLMs from scraping my website short of issuing a “deny all” directive in robots.txt, but then I remember a couple of things:

It might be nice to be discoverable in search engines.
There’s no guarantee that LLM scraper bots will honor robots.txt in the first place.

Still, I’d rather ensure that my writing isn’t used to train an AI. It might not make the AI more racist and misogynistic than training it on Twitter might, but I doubt we want the AI saying shit like “the existence of billionaires is a market failure”.

After all, we wouldn’t want Sam Altman and other AI bigwigs shitting themselves in public.

In the meantime, I’ve got the following in each page’s <head>:

<meta name="robots" content="index, follow, noai, noimageai">

No guarantee it will be honored, but least I’m making it clear that LLM scrapers aren’t welcome on starbreaker.org even if I can’t hunt down the people behind them and give them rebar suppositories in the tradition of Vlad Tepes…

Then again, there’s no guarantee the following will be honored, either.

<link href="https://creativecommons.org/licenses/by-nc-sa/4.0/"
      rel="license text">
<link href="https://www.gnu.org/licenses/gpl-3.0.en.html"
      rel="license code">

Anything you publish to the internet might as well be de facto public domain.