New README
This commit is contained in:
320
README.md
320
README.md
@@ -19,6 +19,214 @@ If you have some suggestion or issue, feel free to open a [New Issue](https://gi
|
||||
For real-time chat and other support join IRC on [##go-away](ircs://irc.libera.chat/##go-away) on Libera.Chat. The channel may not be monitored at all times, feel free to ping the operators there.
|
||||
|
||||
|
||||
## Features
|
||||
|
||||
### Rich rule matching
|
||||
|
||||
[Common Expression Language (CEL)](https://cel.dev/overview/cel-overview) is used to allow arbitrary selection of client properties, not only limited to regex. Boolean operators are supported.
|
||||
|
||||
Templates can be defined in the Policy to allow reuse of such conditions on rule matching. Challenges can also be gated behind conditions.
|
||||
|
||||
See the [CEL Language Definition](https://github.com/google/cel-spec/blob/master/doc/langdef.md) for the syntax.
|
||||
|
||||
Rules and conditions are served with this environment:
|
||||
|
||||
```
|
||||
remoteAddress (net.IP) - Connecting client remote address from headers or properties
|
||||
host (string) - HTTP Host
|
||||
method (string) - HTTP Method/Verb
|
||||
userAgent (string) - HTTP User-Agent header
|
||||
path (string) - HTTP request Path
|
||||
query (map[string]string) - HTTP request Query arguments
|
||||
headers (map[string]string) - HTTP request headers
|
||||
|
||||
Only available when TLS is enabled
|
||||
fpJA3N (string) JA3N TLS Fingerprint
|
||||
fpJA4 (string) JA4 TLS Fingerprint
|
||||
```
|
||||
|
||||
Additionally, these functions are available:
|
||||
```
|
||||
Check whether a given IP is listed on the underlying defined network or CIDR
|
||||
inNetwork(networkName string, address net.IP) bool
|
||||
inNetwork(networkCIDR string, address net.IP) bool
|
||||
|
||||
Check whether a given IP is listed on the provided DNSBL
|
||||
inDNSBL(address net.IP) bool
|
||||
```
|
||||
|
||||
### Template support
|
||||
|
||||
Internal or external templates can be loaded to customize the look of the challenge or error page. Additionally, themes can be configured to change the look of these quickly.
|
||||
|
||||
These templates are included by default:
|
||||
|
||||
* `anubis`: An anubis-like themed challenge.
|
||||
* `forgejo`: Uses the Forgejo template and assets from your own instance. Supports specifying themes like `forgejo-light` and `forgejo-dark`.
|
||||
|
||||
External templates for your site can be loaded specifying a full path to the `.gohtml` file. See [embed/templates/](embed/templates/) for examples to follow.
|
||||
|
||||
### Extended rule actions
|
||||
|
||||
In addition to the common PASS / CHALLENGE / DENY rules, we offer CHECK and POISON.
|
||||
|
||||
CHECK allows the client to be challenged but continue matching rules after these.
|
||||
|
||||
POISON sends defined responses to bad clients that will annoy them.
|
||||
|
||||
### Multiple challenge matching
|
||||
|
||||
Several challenges can be offered as options for rules. This allows users that have passed other challenges before to not be affected.
|
||||
|
||||
For example:
|
||||
```yaml
|
||||
- name: standard-browser
|
||||
action: challenge
|
||||
challenges: [http-cookie-check, self-preload-link, self-meta-refresh, self-resource-load, js-pow-sha256]
|
||||
conditions:
|
||||
- '($is-generic-browser)'
|
||||
```
|
||||
|
||||
This rule has the user be checked against a backend, then attempts pass a few browser challenges.
|
||||
|
||||
In this case the processing would stop at `self-meta-refresh` due to the behavior of earlier challenges.
|
||||
|
||||
Any of these listed challenges being passed in the past will allow the client through, including non-offered `self-resource-load` and `js-pow-sha256`.
|
||||
|
||||
### Non-Javascript challenges
|
||||
|
||||
Several challenges that do not require JavaScript are offered, some targeting the HTTP stack and others a general browser behavior, or consulting with a backend service.
|
||||
|
||||
These can be used for light checking of requests that eliminate most of the low effort scraping.
|
||||
|
||||
See [Challenges](#challenges) below for a list of them.
|
||||
|
||||
### Custom proof-of-work JS / WASM challenges
|
||||
|
||||
A WASM interface for server-side proof generation and checking is offered. We provide `js-pow-sha256` as an example of one.
|
||||
|
||||
An internal test has shown you can implement Captchas or other browser fingerprinting tests within this interface.
|
||||
|
||||
If you are interested in creating your own, see the [Development](#development) section below.
|
||||
|
||||
### Upstream PROXY support
|
||||
|
||||
Support for [HAProxy PROXY protocol](https://github.com/haproxy/haproxy/blob/master/doc/proxy-protocol.txt) can be enabled.
|
||||
|
||||
This allows sending the client IP without altering the connection or HTTP headers.
|
||||
|
||||
Supported by HAProxy, [Caddy](https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#proxy_protocol), [nginx](https://nginx.org/en/docs/stream/ngx_stream_proxy_module.html#proxy_protocol) and others.
|
||||
|
||||
### Automatic TLS support and HTTP/2 support
|
||||
|
||||
You can enable automatic certificate generation and TLS for the site via any ACME directory, which enables HTTP/2.
|
||||
|
||||
Without TLS, HTTP/2 cleartext is supported, but you will need to configure the upstream proxy to send this protocol (`h2c://` on Caddy for example).
|
||||
|
||||
|
||||
### TLS Fingerprinting
|
||||
|
||||
When running with TLS via autocert, TLS Fingerprinting of the incoming client is done.
|
||||
|
||||
This can be targeted on conditions or other application logic.
|
||||
|
||||
Read more about [JA3](https://medium.com/salesforce-engineering/tls-fingerprinting-with-ja3-and-ja3s-247362855967) and [JA4](https://github.com/FoxIO-LLC/ja4/blob/main/technical_details/README.md).
|
||||
|
||||
|
||||
### DNSBL
|
||||
|
||||
You can configure a [DNSBL (Domain Name System blocklist)](https://en.wikipedia.org/wiki/Domain_Name_System_blocklist) to be queried on rules and conditions.
|
||||
|
||||
This allows you to serve harder or different challenges to higher risk clients, or block them from specific sections.
|
||||
|
||||
Only rules that match DNSBL will cause a query to be sent, meaning the bulk of requests will not be sent to this service upstream.
|
||||
|
||||
Results will be temporarily cached
|
||||
|
||||
By default, [DroneBL](https://dronebl.org/) is used.
|
||||
|
||||
### Network range loading
|
||||
|
||||
Network ranges can be loaded via fetched JSON / TXT / HTML pages, or via lists. You can filter these using _jq_ or a regex.
|
||||
|
||||
Example for _jq_:
|
||||
```yaml
|
||||
aws-cloud:
|
||||
- url: https://ip-ranges.amazonaws.com/ip-ranges.json
|
||||
jq-path: '(.prefixes[] | select(has("ip_prefix")) | .ip_prefix), (.prefixes[] | select(has("ipv6_prefix")) | .ipv6_prefix)'
|
||||
```
|
||||
|
||||
Example for _regex_:
|
||||
```yaml
|
||||
cloudflare:
|
||||
- url: https://www.cloudflare.com/ips-v4
|
||||
regex: "(?P<prefix>[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+/[0-9]+)"
|
||||
- url: https://www.cloudflare.com/ips-v6
|
||||
regex: "(?P<prefix>[0-9a-f:]+::/[0-9]+)"
|
||||
```
|
||||
|
||||
|
||||
### Sharing of signing seed across instances
|
||||
|
||||
You can share the signing secret across multiple of your instances if you'd like to deploy multiple across the world.
|
||||
|
||||
That way signed secrets will be verifiable across all the instances.
|
||||
|
||||
By default, a random temporary key is generated every run.
|
||||
|
||||
### Multiple backend support
|
||||
|
||||
Multiple backends are supported, and rules specific on backend can be defined, and conditions and rules can match this as well.
|
||||
|
||||
This allows one instance to run multiple domains or subdomains.
|
||||
|
||||
### Package path
|
||||
|
||||
You can modify the path where challenges are served and package name, if you don't want its presence to be easily discoverable.
|
||||
|
||||
No source code editing or forking necessary!
|
||||
|
||||
## Why?
|
||||
In the past few years this small git instance has been hit by waves and waves of scraping.
|
||||
This was usually fought back by random useragent blocks for bots that did not follow [robots.txt](/robots.txt), until the past half year, where low-effort mass scraping was used more prominently.
|
||||
|
||||
Recently these networks go from using residential IP blocks to sending requests at several hundred rps.
|
||||
|
||||
If the server gets sluggish, more requests pile up. Even when denied they scrape for weeks later. Effectively spray and pray scraping, process later.
|
||||
|
||||
At some point about 300Mbit/s of incoming requests (not including the responses) was hitting the server. And all at nonsense URLs
|
||||
|
||||
If AI is so smart, why not just git clone the repositories?
|
||||
|
||||
|
||||
Xe (anubis creator) has written about similar frustrations in several blogposts:
|
||||
|
||||
* [Amazon's AI crawler is making my git server unstable](https://xeiaso.net/notes/2025/amazon-crawler/) [01/17/2025]
|
||||
* [Anubis works](https://xeiaso.net/notes/2025/anubis-works/) [04/12/2025]
|
||||
|
||||
Drew DeVault (sourcehut) has posted several articles regarding the same issues:
|
||||
* [Please stop externalizing your costs directly into my face](https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html) [17/03/2025]
|
||||
* (fun tidbit: I'm the one quoted as having the feedback discussion interrupted to deal with bots!)
|
||||
|
||||
Others were also suffering at the same time [[1]](https://donotsta.re/notice/AreSNZlRlJv73AW7tI) [[2]](https://community.ipfire.org/t/suricata-ruleset-to-prevent-ai-scraping/11974) [[3]](https://gabrielsimmer.com/blog/stop-scraping-git-forge) [[4]](https://gabrielsimmer.com/blog/stop-scraping-git-forge) [[5]](https://blog.nytsoi.net/2025/03/01/obliterated-by-ai).
|
||||
|
||||
---
|
||||
Initially I deployed Anubis, and yeah, it does work!
|
||||
|
||||
This tool started as a way to replace [Anubis](https://anubis.techaro.lol/) as it was not found as featureful as desired.
|
||||
|
||||
go-away may not be as straight to configure as Anubis but this was chosen to reduce impact on legitimate users, and offers many more options to dynamically target new waves.
|
||||
|
||||
### Can't scrapers adapt?
|
||||
|
||||
Yes, they can. At the moment their spray-and-pray approach is cheap for them.
|
||||
|
||||
If they have to start adding an active browser in their scraping, that makes their collection expensive and slow.
|
||||
|
||||
This would more or less eliminate the high rate low effort passive scraping and replace it with an active model.
|
||||
|
||||
go-anubis offers a highly configurable set of challenges and rules that you can adapt to new ways.
|
||||
|
||||
## Example policies
|
||||
|
||||
### Forgejo
|
||||
@@ -150,6 +358,118 @@ services:
|
||||
|
||||
```
|
||||
|
||||
## Challenges
|
||||
|
||||
#### http
|
||||
|
||||
Verify incoming requests against a specified backend to allow the user through. Cookies and some other headers are passed.
|
||||
|
||||
For example, this allows verifying the user cookies against the backend to have the user skip all other challenges.
|
||||
|
||||
Example on Forgejo, checks that current user is authenticated:
|
||||
```yaml
|
||||
http-cookie-check:
|
||||
mode: http
|
||||
url: http://forgejo:3000/user/stopwatches
|
||||
# url: http://forgejo:3000/repo/search
|
||||
# url: http://forgejo:3000/notifications/new
|
||||
parameters:
|
||||
http-method: GET
|
||||
http-cookie: i_like_gitea
|
||||
http-code: 200
|
||||
```
|
||||
|
||||
#### preload-link
|
||||
|
||||
Requires HTTP/2+ response parsing and logic, silent challenge (does not display a challenge page).
|
||||
|
||||
Browsers that support [103 Early Hints](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/103) are indicated to fetch a CSS resource via [Link](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Link) preload that solves the challenge.
|
||||
|
||||
The server waits until solved or defined timeout, then continues on other challenges if failed.
|
||||
|
||||
Example:
|
||||
```yaml
|
||||
self-preload-link:
|
||||
condition: '"Sec-Fetch-Mode" in headers && headers["Sec-Fetch-Mode"] == "navigate"'
|
||||
mode: "preload-link"
|
||||
runtime:
|
||||
# verifies that result = key
|
||||
mode: "key"
|
||||
probability: 0.1
|
||||
parameters:
|
||||
preload-early-hint-deadline: 3s
|
||||
key-code: 200
|
||||
key-mime: text/css
|
||||
key-content: ""
|
||||
```
|
||||
|
||||
#### header-refresh
|
||||
|
||||
Requires HTTP response parsing and logic, displays challenge site instantly.
|
||||
|
||||
Have the browser solve the challenge by following the URL listed on HTTP [Refresh](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Refresh) instantly.
|
||||
|
||||
|
||||
#### meta-refresh
|
||||
|
||||
Requires HTTP and HTML response parsing and logic, displays challenge site instantly.
|
||||
|
||||
Have the browser solve the challenge by following the URL listed on HTML `<meta http-equiv=refresh>` tag instantly. Equivalent to above.
|
||||
|
||||
#### resource-load
|
||||
|
||||
Requires HTTP and HTML response parsing and logic, displays challenge site.
|
||||
|
||||
Servers a challenge page with a linked resource that is loaded by the browser, which solves the challenge. Page refreshes a few seconds later via [Refresh](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Refresh).
|
||||
|
||||
Example:
|
||||
```yaml
|
||||
self-resource-load:
|
||||
mode: "resource-load"
|
||||
runtime:
|
||||
# verifies that result = key
|
||||
mode: "key"
|
||||
probability: 0.1
|
||||
parameters:
|
||||
key-code: 200
|
||||
key-mime: text/css
|
||||
key-content: ""
|
||||
```
|
||||
|
||||
#### cookie
|
||||
|
||||
Requires HTTP parsing and a Cookie Jar, silent challenge (does not display a challenge page unless failed).
|
||||
|
||||
Serves the client with a Set-Cookie that solves the challenge, and redirects it back to the same page. Browser must present the cookie to load.
|
||||
|
||||
Several tools implement this, but usually not mass scrapers.
|
||||
|
||||
#### js-pow-sha256
|
||||
|
||||
Requires JavaScript and workers, displays challenge site.
|
||||
|
||||
Has the user solve a Proof of Work using SHA256 hashes, with configurable difficulty.
|
||||
|
||||
Example:
|
||||
```yaml
|
||||
js-pow-sha256:
|
||||
# Asset must be under challenges/{name}/static/{asset}
|
||||
# Other files here will be available under that path
|
||||
mode: js
|
||||
asset: load.mjs
|
||||
parameters:
|
||||
# difficulty is number of bits that must be set to 0 from start
|
||||
# Anubis challenge difficulty 5 becomes 5 * 8 = 20
|
||||
difficulty: 20
|
||||
runtime:
|
||||
mode: wasm
|
||||
# Verify must be under challenges/{name}/runtime/{asset}
|
||||
asset: runtime.wasm
|
||||
probability: 0.02
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
## Development
|
||||
|
||||
|
@@ -254,18 +254,7 @@ conditions:
|
||||
# user activity tab
|
||||
- 'path.matches("^/[^/]+$") && "tab" in query && query.tab == "activity"'
|
||||
|
||||
# Rules and conditions are served this environment
|
||||
# remoteAddress (net.IP) - Connecting client remote address from headers or properties
|
||||
# host (string) - HTTP Host
|
||||
# method (string) - HTTP Method/Verb
|
||||
# userAgent (string) - HTTP User-Agent header
|
||||
# path (string) - HTTP request Path
|
||||
# query (map[string]string) - HTTP request Query arguments
|
||||
# headers (map[string]string) - HTTP request headers
|
||||
#
|
||||
# Additionally these functions are available
|
||||
# inNetwork(networkName string, address net.IP) bool
|
||||
# inNetwork(networkCIDR string, address net.IP) bool
|
||||
|
||||
rules:
|
||||
- name: allow-well-known-resources
|
||||
conditions:
|
||||
|
@@ -164,18 +164,7 @@ conditions:
|
||||
- 'userAgent.matches("^Mozilla/[1-4]")'
|
||||
|
||||
|
||||
# Rules and conditions are served this environment
|
||||
# remoteAddress (net.IP) - Connecting client remote address from headers or properties
|
||||
# host (string) - HTTP Host
|
||||
# method (string) - HTTP Method/Verb
|
||||
# userAgent (string) - HTTP User-Agent header
|
||||
# path (string) - HTTP request Path
|
||||
# query (map[string]string) - HTTP request Query arguments
|
||||
# headers (map[string]string) - HTTP request headers
|
||||
#
|
||||
# Additionally these functions are available
|
||||
# inNetwork(networkName string, address net.IP) bool
|
||||
# inNetwork(networkCIDR string, address net.IP) bool
|
||||
|
||||
rules:
|
||||
- name: allow-well-known-resources
|
||||
conditions:
|
||||
|
Reference in New Issue
Block a user