mirror of
https://git.qwik.space/left4code/robots.txt.git
synced 2025-08-14 21:09:31 +05:30
Added Robots files
This commit is contained in:
39
README.md
39
README.md
@@ -1,3 +1,38 @@
|
||||
# robots.txt
|
||||
# robots.txt 🤖
|
||||
|
||||
A collection of robots.txt files for sites where it's the only option to defend against scraping like Neocities or Nekoweb.
|
||||
A collection of robots.txt files for sites where it's the only option to defend against scraping like Neocities or Nekoweb.
|
||||
|
||||
## Use Instructions:
|
||||
|
||||
###1: Pick a file you want to use.
|
||||
|
||||
- **blacklist-robots.txt** - a large robots.txt file around **24Kb** in size that attempts to block everything bad, this includes search engines, AI crawlers, scrapers, and SEO bots.
|
||||
|
||||
- **whitelist-robots.txt** - a small robots.txt file around **1.3Kb** (you can shrink it). Currently only allows [Wiby](https://wiby.org/) and [Marginalia Search](https://marginalia-search.com/).
|
||||
|
||||
###2: Change your desired file's name to 'robots.txt'
|
||||
|
||||
###2.5: If you have a sitemap, add it inside file where instructed, if you don't, delete the line.
|
||||
|
||||
###3: Upload it to your site!
|
||||
|
||||
## Request to Modify Repository
|
||||
|
||||
Would you like a bot added or removed from either the whitelist or blacklist file? My email is [Here](https://left4code.neocities.org/left4code_gpg.txt), Specify the following in your email:
|
||||
|
||||
1: File to modify
|
||||
2: Bot to add or remove
|
||||
3: name of User-agent for crawler and website if possible
|
||||
4: why the bot should be added or removed.
|
||||
|
||||
Currently this gitea instance does not allow for new sign-ups and therefore new PR's, email is currently the only way and will change if this instance ever opens up again. This also allows for anyone who already has an email address to make requests and hopefully should be easier to manage.
|
||||
|
||||
## Where did you get the User-agents from?
|
||||
|
||||
These user agents were manually obtained from [Baccyflap's No AI webring](https://baccyflap.com/noai/), I went through the list, found the disallowed user agents, put them all into a list, and ran:
|
||||
|
||||
`sort <File> | uniq -u > final_output.txt`
|
||||
|
||||
if you wanted to create a list of your own.
|
||||
|
||||
## I do not guarantee this will make you impervious to bots, this is what I use. Help would be appreciated in keeping the list updated, managed, and hopefully in the future, documented.
|
||||
|
||||
1077
blacklist-robots.txt
Normal file
1077
blacklist-robots.txt
Normal file
File diff suppressed because it is too large
Load Diff
22
whitelist-robots.txt
Normal file
22
whitelist-robots.txt
Normal file
@@ -0,0 +1,22 @@
|
||||
#[WHITELIST-ROBOTS.TXT VERSION 1.0]
|
||||
#[MAINTAINED AT: https://git.qwik.space/Left4Code/robots.txt] GET A COPY OR REPORT ISSUES THERE:
|
||||
#_________________________________________________________________________
|
||||
|
||||
#lots of AI companies and scrapers seem to use alternative non-published user-agents, only some get caught and shamed. Consider using some form of AI blocking if possible like Go-away, or Anubis. For static sites hosted with Neocities, Nekoweb, etc. This is another option.
|
||||
|
||||
#___________________________________________________
|
||||
|
||||
User-agent: WibyBot
|
||||
User-agent: search.marginalia.nu
|
||||
Disallow:
|
||||
|
||||
User-agent: *
|
||||
Disallow: /
|
||||
Disallow: *
|
||||
DisallowAITraining: /
|
||||
DisallowAITraining: *
|
||||
Content-Usage: ai=n
|
||||
|
||||
sitemap: !REPLACE WITH SITEMAP LINK!
|
||||
|
||||
#___________________________________________________
|
||||
Reference in New Issue
Block a user