robots.txt 🤖
A collection of robots.txt files for sites where it's the only option to defend against scraping like Neocities or Nekoweb.
Use Instructions:
1: Pick a file you want to use.
-
blacklist-robots.txt - a large robots.txt file around 24Kb in size that attempts to block everything bad, this includes search engines, AI crawlers, scrapers, and SEO bots.
-
whitelist-robots.txt - a small robots.txt file around 1.3Kb (you can shrink it). Currently only allows Wiby and Marginalia Search.
2: Change your desired file's name to 'robots.txt'
2.5: If you have a sitemap, add it inside file where instructed, if you don't, delete the line.
3: Upload it to your site!
Request to Modify Repository 🛠️
Would you like a bot added or removed from either the whitelist or blacklist file? My email is Here, Specify the following in your email:
1: File to modify
2: Bot to add or remove
3: name of User-agent for crawler and website if possible
4: why the bot should be added or removed.
Currently this gitea instance does not allow for new sign-ups and therefore new PR's, email is currently the only way and will change if this instance ever opens up again. This also allows for anyone who already has an email address to make requests and hopefully should be easier to manage.
Where did you get the User-agents from?
These User-agents were manually obtained from the members of Baccyflap's No AI webring, I went through the list, found the disallowed User-agents, put them all into a list, and ran:
sort <File> | uniq -u > final_output.txt
if you wanted to create a list of your own.