{"id":28541349,"url":"https://github.com/archivebox/archivebox-proxy","last_synced_at":"2025-07-07T15:31:10.257Z","repository":{"id":235918972,"uuid":"747403676","full_name":"ArchiveBox/archivebox-proxy","owner":"ArchiveBox","description":"Official ArchiveBox MITM proxy: saves URLs of all requests passing through to an ArchiveBox server for archival. ","archived":false,"fork":false,"pushed_at":"2024-07-12T01:42:03.000Z","size":13,"stargazers_count":24,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-09T20:08:13.166Z","etag":null,"topics":["archivebox","digipres","digital-preservation","https-proxy","internet-archiving","mitmproxy","proxy","web-archiving","web-proxy"],"latest_commit_sha":null,"homepage":"https://github.com/ArchiveBox/archivebox-proxy","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArchiveBox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-23T21:23:22.000Z","updated_at":"2025-06-05T12:17:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"75cdd2b2-d408-4b13-9e63-786bf0ac6246","html_url":"https://github.com/ArchiveBox/archivebox-proxy","commit_stats":null,"previous_names":["archivebox/archivebox-proxy"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ArchiveBox/archivebox-proxy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Farchivebox-proxy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Farchivebox-proxy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Farchivebox-proxy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Farchivebox-proxy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArchiveBox","download_url":"https://codeload.github.com/ArchiveBox/archivebox-proxy/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Farchivebox-proxy/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264102921,"owners_count":23557848,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archivebox","digipres","digital-preservation","https-proxy","internet-archiving","mitmproxy","proxy","web-archiving","web-proxy"],"created_at":"2025-06-09T20:08:10.214Z","updated_at":"2025-07-07T15:31:10.251Z","avatar_url":"https://github.com/ArchiveBox.png","language":"Python","readme":"# ArchiveBox Proxy\n\nA proxy that saves navigated URLs to [ArchiveBox](https://github.com/mitmproxy/mitmproxy), implemented with [mitmproxy](https://github.com/mitmproxy/mitmproxy).\n\n*✨ Contributed by [Bruno Schroeder (@brunocek)](https://github.com/brunocek/archivebox-proxy)*\n\n---\nThis project is intended to meet ArchiveBox's ticket 557: [Feature Request: Browser extension to submit either all history or certain URLs to a given ArchiveBox instance](https://github.com/ArchiveBox/ArchiveBox/issues/577).\n\nThe main challenge is to serve ios, as ios does not allow firefox plugins to be installed.\n\n\n## Installation\n\nPre-requisites: \n\n- ArchiveBox is installed and runs.\n- User can run `archivebox add \u003curl\u003e` on the machine where this project will run, and have it add the URL to the desired instance of ArchiveBox\n- python3 and pipenv are installed (else, on Debian, run `sudo apt-get install python3 pipenv` )\n\n### Debian\n\n1. Clone this repository and have `mitmdump` on the same path (e.g. `/home/user/archivebox-proxy` )\n\nPlease follow ( https://docs.mitmproxy.org/stable/overview-installation/ ) to install mitmdump.\n\n```\nPATH=/home/user/archivebox-proxy\ngit clone https://codeberg.org/brunoschroeder/archivebox-proxy.git $PATH\n\n# mitmdump installation as of 2024-01 - alternatives exist\ncd $PATH\nwget https://downloads.mitmproxy.org/10.2.1/mitmproxy-10.2.1-linux-x86_64.tar.gz\ntar xf mitmproxy-10.2.1-linux-x86_64.tar.gz\n\n```\n\n2. Edit with vim `config-archivebox-proxy.yaml` and re-define `archivebox-path`\n\nIf you wish to switch mode from **record** to **archive**, review parameter: `mode` (read section bellow about **Modes**).\n\n3. On `$PATH`, run:\n\n```\ncd $PATH\npipenv shell\npipenv install\nexit\n\n```\n\n4. Edit with vim `archivebox-proxy.service` and re-define `User`, `Group`, and `WorkingDirectory`\n\n5. Run: \n\n```\nsudo ln -sv $PATH/archivebox-proxy.service /etc/systemd/system/archivebox-proxy.servicey\nsudo systemctl enable archivebox-proxy\nsudo systemctl start archivebox-proxy\nsudo systemctl status archivebox-proxy\nq\n\n```\n\n6. The **archivebox-proxy** should be up and running on port 8080. Test it by configuring a browser or device:\n\n\u003e HTTP Proxy: \u003cdebian-server-ip\u003e\n\u003e HTTP Port: 8080\n\u003e HTTPS Proxy: \u003cdebian-server-ip\u003e\n\u003e HTTPS Port: 8080\n\nHTTPS will not be working yet (bellow section on the TLS Certificate installation), test HTTP works by navigating to an HTTP only website, e.g.: `http://mitm.it`\n\n## Configuring an HTTPS Client\n\nOne needs to install the TLS Certificate on each of the clients in order to proxy HTTPS flow.\n\nThere are several ways to do it, please refer to: ( https://docs.mitmproxy.org/stable/concepts-certificates/ )\n\nThis solution does not involve transparent proxy or services that would suffer from traffic that goes to certificate pinning endpoints.\n\n## Modes\n\nConfig file ( `config-archivebox-proxy.yaml` ) holds a parameter for `mode` that can be: **record**, or **archive**.\n\nThe reason for two modes is explained in the section bellow **Identifying User HTTP Requests - not trivial**.\n\nOn **record mode**, archivebox-proxy will record all the navigation on `record.yaml` file, and the user will need to latter on manually run `archivebox add record.yaml`. The user may edit the file with vim and remove some of the lines ( `dd` ) with URLs not for archiving.\n\nOn **archive mode**, archivebox-proxy will run `archivebox add` to each of the identified URLs. Please read section **Identifying User HTTP requests - not trivial** bellow, before using this mode.\n\n\n## Comments\n### Identifying User HTTP Requests - not trivial\n\nWhen developing this proxy, I came across research papers trying to solve the open problem of identifying User Actions in a HTTP Flow. It is not a trivial problem to solve as you can attest in the article bellow.\n\nAt that time (2016), some of the evidence on HTTP flows:\n\n\u003e \"..a single request for the Huffington Post website results in the download of 408 objects from 113 unique domains. A similar analysis by Butkiewicz et al. [4], of 1,700 popular websites showed that the median landing page consists of at least 40 objects, requested from 10 or more servers, most of which are operated by third-party services.\"\n\n\u003e \"..Here, the pool of starting pages is randomly selected from the top-1,000 most popular webpages according to alexa.com, excluding HTTPS pages and Chinese websites (using non-Roman script). HTTPS pages were omitted to allow fair head-to-head comparison. On average, each trace of 500 page requests resulted in 29,506 HTTP requests, distributed over 14.168 connections.\"\n\n---\n\nAs of today, 2024, traffic is HTTPS, but this problem still exists. \n\nI implemented filters based on the authors insights and these can be tweaked by changing the float constants `__time_window_next` and `__reset_timer` in the script. (I may externalise them to the config file if users demand to constantly tinker with it).\n\nSome more filters may be in place:\n\n- mitmdump filter expressions, specially `'!~a'`, that filters out webpage assets. For more on these: (https://docs.mitmproxy.org/stable/concepts-filters/ )\n- commercial VPNs such as ivpn and mullvad filter adverts\n- projects such as [pihole](https://github.com/arkenfox/user.js)\n- firefox hardening such as [arkenfox](https://github.com/arkenfox/user.js)\n\nWith all these filters running, I still get a lot of URLs that are not user action. More research must be invested on this. I count with your help on the issues forum.\n\n---\n\nGeorgios Rizothanasis; Niklas Carlsson; Aniket Mahanti\nIdentifying User Actions from HTTP(S) Traffic\nIEEE, 2016\n( https://ieeexplore.ieee.org/document/7796839 )\n\n\n### historic\n\n2024-01 Bruno Schroeder kick-starts and asks for contribution with the architectural decisions, and delivers a script for mitmproxy.\n\n### ios alternative solution\n\nFor each tab:\n\n1. Hit share, and share it to iMarkdown or Obsidian \n1. Obsidian asks which file to append to - one may have one file per tag/subject\n1. ios appends the url there (but sometimes it appends the page title and work must be re-done)\n1. Tab must be closed\n\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchivebox%2Farchivebox-proxy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farchivebox%2Farchivebox-proxy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchivebox%2Farchivebox-proxy/lists"}