From 6d5aaddd0376088135bb65e73712910efe165715 Mon Sep 17 00:00:00 2001 From: WeebDataHoarder <57538841+WeebDataHoarder@users.noreply.github.com> Date: Sun, 13 Apr 2025 13:10:56 +0200 Subject: [PATCH] New README --- README.md | 320 +++++++++++++++++++++++++++++++++++++++++++ examples/forgejo.yml | 13 +- examples/generic.yml | 13 +- 3 files changed, 322 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 6da6d06..4dfe633 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,214 @@ If you have some suggestion or issue, feel free to open a [New Issue](https://gi For real-time chat and other support join IRC on [##go-away](ircs://irc.libera.chat/##go-away) on Libera.Chat. The channel may not be monitored at all times, feel free to ping the operators there. +## Features + +### Rich rule matching + +[Common Expression Language (CEL)](https://cel.dev/overview/cel-overview) is used to allow arbitrary selection of client properties, not only limited to regex. Boolean operators are supported. + +Templates can be defined in the Policy to allow reuse of such conditions on rule matching. Challenges can also be gated behind conditions. + +See the [CEL Language Definition](https://github.com/google/cel-spec/blob/master/doc/langdef.md) for the syntax. + +Rules and conditions are served with this environment: + +``` +remoteAddress (net.IP) - Connecting client remote address from headers or properties +host (string) - HTTP Host +method (string) - HTTP Method/Verb +userAgent (string) - HTTP User-Agent header +path (string) - HTTP request Path +query (map[string]string) - HTTP request Query arguments +headers (map[string]string) - HTTP request headers + +Only available when TLS is enabled + fpJA3N (string) JA3N TLS Fingerprint + fpJA4 (string) JA4 TLS Fingerprint +``` + +Additionally, these functions are available: +``` +Check whether a given IP is listed on the underlying defined network or CIDR + inNetwork(networkName string, address net.IP) bool + inNetwork(networkCIDR string, address net.IP) bool + +Check whether a given IP is listed on the provided DNSBL + inDNSBL(address net.IP) bool +``` + +### Template support + +Internal or external templates can be loaded to customize the look of the challenge or error page. Additionally, themes can be configured to change the look of these quickly. + +These templates are included by default: + +* `anubis`: An anubis-like themed challenge. +* `forgejo`: Uses the Forgejo template and assets from your own instance. Supports specifying themes like `forgejo-light` and `forgejo-dark`. + +External templates for your site can be loaded specifying a full path to the `.gohtml` file. See [embed/templates/](embed/templates/) for examples to follow. + +### Extended rule actions + +In addition to the common PASS / CHALLENGE / DENY rules, we offer CHECK and POISON. + +CHECK allows the client to be challenged but continue matching rules after these. + +POISON sends defined responses to bad clients that will annoy them. + +### Multiple challenge matching + +Several challenges can be offered as options for rules. This allows users that have passed other challenges before to not be affected. + +For example: +```yaml + - name: standard-browser + action: challenge + challenges: [http-cookie-check, self-preload-link, self-meta-refresh, self-resource-load, js-pow-sha256] + conditions: + - '($is-generic-browser)' +``` + +This rule has the user be checked against a backend, then attempts pass a few browser challenges. + +In this case the processing would stop at `self-meta-refresh` due to the behavior of earlier challenges. + +Any of these listed challenges being passed in the past will allow the client through, including non-offered `self-resource-load` and `js-pow-sha256`. + +### Non-Javascript challenges + +Several challenges that do not require JavaScript are offered, some targeting the HTTP stack and others a general browser behavior, or consulting with a backend service. + +These can be used for light checking of requests that eliminate most of the low effort scraping. + +See [Challenges](#challenges) below for a list of them. + +### Custom proof-of-work JS / WASM challenges + +A WASM interface for server-side proof generation and checking is offered. We provide `js-pow-sha256` as an example of one. + +An internal test has shown you can implement Captchas or other browser fingerprinting tests within this interface. + +If you are interested in creating your own, see the [Development](#development) section below. + +### Upstream PROXY support + +Support for [HAProxy PROXY protocol](https://github.com/haproxy/haproxy/blob/master/doc/proxy-protocol.txt) can be enabled. + +This allows sending the client IP without altering the connection or HTTP headers. + +Supported by HAProxy, [Caddy](https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#proxy_protocol), [nginx](https://nginx.org/en/docs/stream/ngx_stream_proxy_module.html#proxy_protocol) and others. + +### Automatic TLS support and HTTP/2 support + +You can enable automatic certificate generation and TLS for the site via any ACME directory, which enables HTTP/2. + +Without TLS, HTTP/2 cleartext is supported, but you will need to configure the upstream proxy to send this protocol (`h2c://` on Caddy for example). + + +### TLS Fingerprinting + +When running with TLS via autocert, TLS Fingerprinting of the incoming client is done. + +This can be targeted on conditions or other application logic. + +Read more about [JA3](https://medium.com/salesforce-engineering/tls-fingerprinting-with-ja3-and-ja3s-247362855967) and [JA4](https://github.com/FoxIO-LLC/ja4/blob/main/technical_details/README.md). + + +### DNSBL + +You can configure a [DNSBL (Domain Name System blocklist)](https://en.wikipedia.org/wiki/Domain_Name_System_blocklist) to be queried on rules and conditions. + +This allows you to serve harder or different challenges to higher risk clients, or block them from specific sections. + +Only rules that match DNSBL will cause a query to be sent, meaning the bulk of requests will not be sent to this service upstream. + +Results will be temporarily cached + +By default, [DroneBL](https://dronebl.org/) is used. + +### Network range loading + +Network ranges can be loaded via fetched JSON / TXT / HTML pages, or via lists. You can filter these using _jq_ or a regex. + +Example for _jq_: +```yaml + aws-cloud: + - url: https://ip-ranges.amazonaws.com/ip-ranges.json + jq-path: '(.prefixes[] | select(has("ip_prefix")) | .ip_prefix), (.prefixes[] | select(has("ipv6_prefix")) | .ipv6_prefix)' +``` + +Example for _regex_: +```yaml + cloudflare: + - url: https://www.cloudflare.com/ips-v4 + regex: "(?P[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+/[0-9]+)" + - url: https://www.cloudflare.com/ips-v6 + regex: "(?P[0-9a-f:]+::/[0-9]+)" +``` + + +### Sharing of signing seed across instances + +You can share the signing secret across multiple of your instances if you'd like to deploy multiple across the world. + +That way signed secrets will be verifiable across all the instances. + +By default, a random temporary key is generated every run. + +### Multiple backend support + +Multiple backends are supported, and rules specific on backend can be defined, and conditions and rules can match this as well. + +This allows one instance to run multiple domains or subdomains. + +### Package path + +You can modify the path where challenges are served and package name, if you don't want its presence to be easily discoverable. + +No source code editing or forking necessary! + +## Why? +In the past few years this small git instance has been hit by waves and waves of scraping. +This was usually fought back by random useragent blocks for bots that did not follow [robots.txt](/robots.txt), until the past half year, where low-effort mass scraping was used more prominently. + +Recently these networks go from using residential IP blocks to sending requests at several hundred rps. + +If the server gets sluggish, more requests pile up. Even when denied they scrape for weeks later. Effectively spray and pray scraping, process later. + +At some point about 300Mbit/s of incoming requests (not including the responses) was hitting the server. And all at nonsense URLs + +If AI is so smart, why not just git clone the repositories? + + +Xe (anubis creator) has written about similar frustrations in several blogposts: + +* [Amazon's AI crawler is making my git server unstable](https://xeiaso.net/notes/2025/amazon-crawler/) [01/17/2025] +* [Anubis works](https://xeiaso.net/notes/2025/anubis-works/) [04/12/2025] + +Drew DeVault (sourcehut) has posted several articles regarding the same issues: +* [Please stop externalizing your costs directly into my face](https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html) [17/03/2025] + * (fun tidbit: I'm the one quoted as having the feedback discussion interrupted to deal with bots!) + +Others were also suffering at the same time [[1]](https://donotsta.re/notice/AreSNZlRlJv73AW7tI) [[2]](https://community.ipfire.org/t/suricata-ruleset-to-prevent-ai-scraping/11974) [[3]](https://gabrielsimmer.com/blog/stop-scraping-git-forge) [[4]](https://gabrielsimmer.com/blog/stop-scraping-git-forge) [[5]](https://blog.nytsoi.net/2025/03/01/obliterated-by-ai). + +--- +Initially I deployed Anubis, and yeah, it does work! + +This tool started as a way to replace [Anubis](https://anubis.techaro.lol/) as it was not found as featureful as desired. + +go-away may not be as straight to configure as Anubis but this was chosen to reduce impact on legitimate users, and offers many more options to dynamically target new waves. + +### Can't scrapers adapt? + +Yes, they can. At the moment their spray-and-pray approach is cheap for them. + +If they have to start adding an active browser in their scraping, that makes their collection expensive and slow. + +This would more or less eliminate the high rate low effort passive scraping and replace it with an active model. + +go-anubis offers a highly configurable set of challenges and rules that you can adapt to new ways. + ## Example policies ### Forgejo @@ -150,6 +358,118 @@ services: ``` +## Challenges + +#### http + +Verify incoming requests against a specified backend to allow the user through. Cookies and some other headers are passed. + +For example, this allows verifying the user cookies against the backend to have the user skip all other challenges. + +Example on Forgejo, checks that current user is authenticated: +```yaml + http-cookie-check: + mode: http + url: http://forgejo:3000/user/stopwatches + # url: http://forgejo:3000/repo/search + # url: http://forgejo:3000/notifications/new + parameters: + http-method: GET + http-cookie: i_like_gitea + http-code: 200 +``` + +#### preload-link + +Requires HTTP/2+ response parsing and logic, silent challenge (does not display a challenge page). + +Browsers that support [103 Early Hints](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/103) are indicated to fetch a CSS resource via [Link](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Link) preload that solves the challenge. + +The server waits until solved or defined timeout, then continues on other challenges if failed. + +Example: +```yaml + self-preload-link: + condition: '"Sec-Fetch-Mode" in headers && headers["Sec-Fetch-Mode"] == "navigate"' + mode: "preload-link" + runtime: + # verifies that result = key + mode: "key" + probability: 0.1 + parameters: + preload-early-hint-deadline: 3s + key-code: 200 + key-mime: text/css + key-content: "" +``` + +#### header-refresh + +Requires HTTP response parsing and logic, displays challenge site instantly. + +Have the browser solve the challenge by following the URL listed on HTTP [Refresh](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Refresh) instantly. + + +#### meta-refresh + +Requires HTTP and HTML response parsing and logic, displays challenge site instantly. + +Have the browser solve the challenge by following the URL listed on HTML `` tag instantly. Equivalent to above. + +#### resource-load + +Requires HTTP and HTML response parsing and logic, displays challenge site. + +Servers a challenge page with a linked resource that is loaded by the browser, which solves the challenge. Page refreshes a few seconds later via [Refresh](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Refresh). + +Example: +```yaml + self-resource-load: + mode: "resource-load" + runtime: + # verifies that result = key + mode: "key" + probability: 0.1 + parameters: + key-code: 200 + key-mime: text/css + key-content: "" +``` + +#### cookie + +Requires HTTP parsing and a Cookie Jar, silent challenge (does not display a challenge page unless failed). + +Serves the client with a Set-Cookie that solves the challenge, and redirects it back to the same page. Browser must present the cookie to load. + +Several tools implement this, but usually not mass scrapers. + +#### js-pow-sha256 + +Requires JavaScript and workers, displays challenge site. + +Has the user solve a Proof of Work using SHA256 hashes, with configurable difficulty. + +Example: +```yaml + js-pow-sha256: + # Asset must be under challenges/{name}/static/{asset} + # Other files here will be available under that path + mode: js + asset: load.mjs + parameters: + # difficulty is number of bits that must be set to 0 from start + # Anubis challenge difficulty 5 becomes 5 * 8 = 20 + difficulty: 20 + runtime: + mode: wasm + # Verify must be under challenges/{name}/runtime/{asset} + asset: runtime.wasm + probability: 0.02 +``` + + + ## Development diff --git a/examples/forgejo.yml b/examples/forgejo.yml index 4e3e142..59972bc 100644 --- a/examples/forgejo.yml +++ b/examples/forgejo.yml @@ -254,18 +254,7 @@ conditions: # user activity tab - 'path.matches("^/[^/]+$") && "tab" in query && query.tab == "activity"' -# Rules and conditions are served this environment -# remoteAddress (net.IP) - Connecting client remote address from headers or properties -# host (string) - HTTP Host -# method (string) - HTTP Method/Verb -# userAgent (string) - HTTP User-Agent header -# path (string) - HTTP request Path -# query (map[string]string) - HTTP request Query arguments -# headers (map[string]string) - HTTP request headers -# -# Additionally these functions are available -# inNetwork(networkName string, address net.IP) bool -# inNetwork(networkCIDR string, address net.IP) bool + rules: - name: allow-well-known-resources conditions: diff --git a/examples/generic.yml b/examples/generic.yml index c38afd0..793dbb8 100644 --- a/examples/generic.yml +++ b/examples/generic.yml @@ -164,18 +164,7 @@ conditions: - 'userAgent.matches("^Mozilla/[1-4]")' -# Rules and conditions are served this environment -# remoteAddress (net.IP) - Connecting client remote address from headers or properties -# host (string) - HTTP Host -# method (string) - HTTP Method/Verb -# userAgent (string) - HTTP User-Agent header -# path (string) - HTTP request Path -# query (map[string]string) - HTTP request Query arguments -# headers (map[string]string) - HTTP request headers -# -# Additionally these functions are available -# inNetwork(networkName string, address net.IP) bool -# inNetwork(networkCIDR string, address net.IP) bool + rules: - name: allow-well-known-resources conditions: