https://github.com/simon987/architeuthis

MITM HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.
https://github.com/simon987/architeuthis

influxdb load-balancer proxy redis scraping

Last synced: 10 months ago
JSON representation

MITM HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.

Host: GitHub
URL: https://github.com/simon987/architeuthis
Owner: simon987
License: gpl-3.0
Created: 2019-05-28T13:59:09.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2020-01-03T17:29:21.000Z (over 6 years ago)
Last Synced: 2025-04-17T21:16:03.324Z (over 1 year ago)
Topics: influxdb, load-balancer, proxy, redis, scraping
Language: Go
Homepage:
Size: 308 KB
Stars: 41
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Architeuthis 🦑

[![CodeFactor](https://www.codefactor.io/repository/github/simon987/architeuthis/badge)](https://www.codefactor.io/repository/github/simon987/architeuthis)

![GitHub](https://img.shields.io/github/license/simon987/Architeuthis.svg)

[![Build Status](https://ci.simon987.net/buildStatus/icon?job=architeuthis_builds)](https://ci.simon987.net/job/architeuthis_builds/)

HTTP(S) proxy with integrated load-balancing, rate-limiting

and error handling. Built for automated web scraping.

* Strictly obeys configured rate-limiting for each IP & Host

* Seamless exponential backoff retries on timeout or error HTTP codes

* Requires no additional configuration for integration into existing programs

* Configurable per-host behavior

* Monitoring with InfluxDB

![grafana](grafana.png)

### Typical use case

![user_case](use_case.png)

### Usage

```bash

git clone https://github.com/simon987/Architeuthis

vim config.json # Configure settings here

docker-compose up

```

You can add proxies using the `/add_proxy` API:

```bash

curl http://:5050/add_proxy?url=&name=

```

Or automatically using Proxybroker:

```bash

python3 import_from_broker.py http://:5050

```

### Example usage with wget

```bash

export http_proxy="http://localhost:5050"

# --no-check-certificates is necessary for https mitm

# You don't need to specify user-agent if it's already in your config.json

wget -m -np -c --no-check-certificate -R index.html* http http://ca.releases.ubuntu.com/

```

With `"every": "500ms"` and a single proxy, you should see

```

...

level=trace msg=Sleeping wait=414.324437ms

level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA1SUMS.gpg"

level=trace msg=Sleeping wait=435.166127ms

level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA256SUMS"

level=trace msg=Sleeping wait=438.657784ms

level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA256SUMS.gpg"

level=trace msg=Sleeping wait=457.06543ms

level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/ubuntu-12.04.5-alternate-amd64.iso"

level=trace msg=Sleeping wait=433.394361ms

...

```

### Hot config reload

```bash

# Note: this will reset current rate limiters, if there are many active

# connections, this might cause a small request spike and go over

# the rate limits.

./reload.sh

```

### Rules

Conditions

| Left operand | Description | Allowed operators | Right operand

| :--- | :--- | :--- | :---

| body | Contents of the response | `=`, `!=` | String w/ wildcard

| body | Contents of the response | `<`, `>` | float

| status | HTTP response code | `=`, `!=` | String w/ wildcard

| status | HTTP response code | `<`, `>` | float

| response_time | HTTP response code | `<`, `>` | duration (e.g. `20s`)

| header:`` | Response header | `=`, `!=` | String w/ wildcard

| header:`` | Response header | `<`, `>` | float

Note that `response_time` can never be higher than the configured `timeout` value.

Examples:

```json

[

  {"condition":  "header:X-Test>10", "action":  "..."},

  {"condition":  "body=*Try again in a few minutes*", "action":  "..."},

  {"condition":  "response_time>10s", "action":  "..."},

  {"condition":  "status>500", "action":  "..."},

  {"condition":  "status=404", "action":  "..."},

  {"condition":  "status=40*", "action":  "..."}

]

```

Actions

| Action | Description

| :--- | :--- |

| should_retry | Override default retry behavior for http errors (by default it retries on 403,408,429,444,499,>500)

| force_retry | Always retry (Up to retries_hard times)

| dont_retry | Immediately stop retrying

In the event of a temporary network error, `should_retry` is ignored (it will always retry unless `dont_retry` is set)

Note that having too many rules for one host might negatively impact performance (especially the `body` condition for large requests)

### Sample configuration

```json

{

  "addr": "localhost:5050",

  "timeout": "15s",

  "wait": "4s",

  "multiplier": 2.5,

  "retries": 3,

  "hosts": [

    {

      "host": "*",

      "every": "500ms",

      "burst": 25,

      "headers": {

        "User-Agent": "Some user agent for all requests",

        "X-Test": "Will be overwritten"

      }

    },

    {

      "host": "*.reddit.com",

      "every": "2s",

      "burst": 2,

      "headers": {

        "X-Test": "Will overwrite default"

      }

    },

    {

      "host": ".s3.amazonaws.com",

      "every": "2s",

      "burst": 30,

      "rules": [

        {"condition": "status=403", "action": "dont_retry"}

      ]

    }

  ]

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/simon987/architeuthis

Awesome Lists containing this project

README