{"id":21035742,"url":"https://github.com/archiveteam/newsgrabber-warrior","last_synced_at":"2025-05-15T14:31:28.067Z","repository":{"id":113415518,"uuid":"88302747","full_name":"ArchiveTeam/NewsGrabber-Warrior","owner":"ArchiveTeam","description":null,"archived":false,"fork":false,"pushed_at":"2019-07-03T09:51:35.000Z","size":113,"stargazers_count":8,"open_issues_count":0,"forks_count":8,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-10-30T00:55:48.178Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArchiveTeam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-04-14T20:54:47.000Z","updated_at":"2022-07-20T00:13:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"c973415e-6801-4c10-b12e-8609803e630a","html_url":"https://github.com/ArchiveTeam/NewsGrabber-Warrior","commit_stats":{"total_commits":137,"total_committers":9,"mean_commits":"15.222222222222221","dds":0.3211678832116789,"last_synced_commit":"3de01b0bde452afd2e95272a7dcf57a7080fd625"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FNewsGrabber-Warrior","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FNewsGrabber-Warrior/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FNewsGrabber-Warrior/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FNewsGrabber-Warrior/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArchiveTeam","download_url":"https://codeload.github.com/ArchiveTeam/NewsGrabber-Warrior/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225357197,"owners_count":17461615,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T13:16:00.627Z","updated_at":"2024-11-19T13:16:03.781Z","avatar_url":"https://github.com/ArchiveTeam.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"NewsGrabber-Warrior\n=============\n\nMore information about the archiving project can be found on the ArchiveTeam wiki: [NewsGrabber](http://archiveteam.org/index.php?title=NewsGrabber)\n\nSetup instructions\n=========================\n\nThere are now several ways to run this; the preffered method is via the included Dockerfile.\n\nBe sure to replace `YOURNICKHERE` with the nickname that you want to be shown as, on the tracker. You don't need to register it, just pick a nickname you like.\n\nIn most of the below cases (with the exception of docker by default), there will be a web interface running at http://localhost:8001/. If you don't know or care what this is, you can just ignore it—otherwise, it gives you a fancy view of what's going on.\n\n**If anything goes wrong while running the commands below, please scroll down to the bottom of this page. There's troubleshooting information there.**\n\nRunning with docker\n--------------------\n\n\u003cimg alt=\"Docker logo\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/79/Docker_%28container_engine%29_logo.png\" height=\"100px\"\u003e\n\nAssuming this is a stand alone box, not part of a swarm etc, basic instructions for configuring your docker instance can be found at [docker documentation](https://docs.docker.com/install/) or for [Ubuntu](https://docs.docker.com/install/linux/docker-ce/ubuntu/) / [Debian](https://docs.docker.com/install/linux/docker-ce/debian/).\n\nMake a directory, cd into the directry and copy the included dockerfile into it; the rest of the files are not required. Edit the final line to include the concurrency (CPU bound due to the deduplication, recommend 1.5 times CPU / vCPU) and replace `UnknownDocker` with your username.\n\nBuild the container with the following arguments;\n\n    docker build -t \u003c\u003cdockername\u003e\u003e \u003c\u003cfoldername\u003e\u003e/\n\nfor example\n\n    docker build -t newsgrabber-warrior newsgrabber-warrior/\n    \nThen simply run the container with either;\n\n    docker run -d -it newsgrabber-warrior\n\nor if you want to give it a known name and make it easier to run commands;\n\n    docker run -d -it --name newsgrabber-warrior newsgrabber-warrior\n\nor if you really want that web page to be available;\n\n    docker run -d -it -p 8001:8001 --name newsgrabber-warrior newsgrabber-warrior\n\nStopping the container (clean);\n\n    docker run -d -it --name \u003c\u003ccontainername\u003e\u003e touch STOP\n\nStopping the container (hard);\n\n    docker stop \u003c\u003ccontainername\u003e\u003e\n\nConnecting to the container console;\n\n    docker attach \u003c\u003ccontainername\u003e\u003e\n\nRunning with a warrior\n-------------------------\n\nFollow the [instructions on the ArchiveTeam wiki](http://archiveteam.org/index.php?title=Warrior) for installing the Warrior, and select the \"NewsGrabber\" project in the Warrior interface.\n\nRunning without a warrior\n-------------------------\nTo run this outside the warrior, clone this repository, cd into its directory and run:\n\n    pip install --upgrade seesaw\n\nthen start downloading with:\n\n    run-pipeline pipeline.py --concurrent 2 YOURNICKHERE\n\nFor more options, run:\n\n    run-pipeline --help\n\nIf you don't have root access and/or your version of pip is very old, you can replace \"pip install --upgrade seesaw\" with:\n\n    wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py ; python get-pip.py --user ; ~/.local/bin/pip install --user seesaw\n\nso that pip and seesaw are installed in your home, then run\n\n    ~/.local/bin/run-pipeline pipeline.py --concurrent 2 YOURNICKHERE\n\nRunning multiple instances on different IPs\n-------------------------------------------\n\nThis feature requires seesaw version 0.0.16 or greater. Use `pip install --upgrade seesaw` to upgrade.\n\nUse the `--context-value` argument to pass in `bind_address=123.4.5.6` (replace the IP address with your own).\n\nExample of running 2 threads, no web interface, and Wget binding of IP address:\n\n    run-pipeline pipeline.py --concurrent 2 YOURNICKHERE --disable-web-server --context-value bind_address=123.4.5.6\n\nDistribution-specific setup\n-------------------------\n### For Debian/Ubuntu:\n\n    adduser --system --group --shell /bin/bash archiveteam\n    apt-get update \u0026\u0026 apt-get install -y git-core libgnutls-dev screen python-dev python-pip bzip2 zlib1g-dev unzip\n    pip install --upgrade seesaw requests warcio dnspython\n    su -c \"cd /home/archiveteam; git clone https://github.com/ArchiveTeam/NewsGrabber-Warrior.git\" archiveteam\n    su -c \"cd /home/archiveteam/NewsGrabber-Warrior/; wget https://launchpad.net/wpull/trunk/v1.2.3/+download/wpull-1.2.3-linux-x86_64-3.4.3-20160302011013.zip; unzip wpull-1.2.3-linux-x86_64-3.4.3-20160302011013.zip; chmod +x ./wpull\" archiveteam\n    screen su -c \"cd /home/archiveteam/NewsGrabber-Warrior/; run-pipeline pipeline.py --concurrent 2 --address '127.0.0.1' YOURNICKHERE\" archiveteam\n    [... ctrl+A D to detach ...]\n\n### For CentOS:\n\nEnsure that you have the CentOS equivalent of bzip2 installed as well. You might need the EPEL repository to be enabled.\n\n    yum -y install gnutls-devel python-pip zlib-devel unzip\n    pip install --upgrade seesaw requests warcio dnspython\n    [... pretty much the same as above ...]\n\n### For openSUSE:\n\n    zypper install screen python-pip libgnutls-devel bzip2 python-devel gcc make unzip\n    pip install --upgrade seesaw requests warcio dnspython\n    [... pretty much the same as above ...]\n\n### For OS X:\n\nYou need Homebrew. Ensure that you have the OS X equivalent of bzip2 installed as well.\n\n    brew install python gnutls unzip\n    pip install --upgrade seesaw requests warcio dnspython\n    [... pretty much the same as above ...]\n\n**There is a known issue with some packaged versions of rsync. If you get errors during the upload stage, NewsGrabber-Warrior will not work with your rsync version.**\n\nThis supposedly fixes it:\n\n    alias rsync=/usr/local/bin/rsync\n\n### For Arch Linux:\n\nEnsure that you have the Arch equivalent of bzip2 installed as well.\n\n1. Make sure you have `python2-pip` installed.\n2. Run `pip2 install seesaw`.\n3. Modify the run-pipeline script in seesaw to point at `#!/usr/bin/python2` instead of `#!/usr/bin/python`.\n4. `useradd --system --group users --shell /bin/bash --create-home archiveteam`\n5. `su -c \"cd /home/archiveteam; git clone https://github.com/ArchiveTeam/NewsGrabber-Warrior.git\" archiveteam`\n6. `su -c \"cd /home/archiveteam/NewsGrabber-Warrior/; wget https://launchpad.net/wpull/trunk/v1.2.3/+download/wpull-1.2.3-linux-x86_64-3.4.3-20160302011013.zip; unzip wpull-1.2.3-linux-x86_64-3.4.3-20160302011013.zip; chmod +x ./wpull\" archiveteam`\n7. `screen su -c \"cd /home/archiveteam/NewsGrabber-Warrior/; run-pipeline pipeline.py --concurrent 2 --address '127.0.0.1' YOURNICKHERE\" archiveteam`\n\n### For FreeBSD:\n\nNothing specific here. If not so, please do let us know on IRC (irc.efnet.org #archiveteam).\n\nTroubleshooting\n=========================\n\nBroken? These are some of the possible solutions:\n\n### Wpull not successfully running\n\nIf you have trouble getting Wpull running, please see http://wpull.readthedocs.org/en/master/install.html.\n\n### Problem with gnutls or openssl during building\n\nPlease ensure that gnutls-dev(el) and openssl-dev(el) are installed.\n\n### ImportError: No module named seesaw\n\nIf you're sure that you followed the steps to install `seesaw`, permissions on your module directory may be set incorrectly. Try the following:\n\n    chmod o+rX -R /usr/local/lib/python2.7/dist-packages\n\n### run-pipeline: command not found\n\nInstall `seesaw` using `pip2` instead of `pip`.\n\n    pip2 install seesaw\n\n### Issues in the code\n\nIf you notice a bug and want to file a bug report, please use the GitHub issues tracker.\n\nAre you a developer? Help write code for us! Look at our [developer documentation](http://archiveteam.org/index.php?title=Dev) for details.\n\n### Other problems\n\nHave an issue not listed here? Join us on IRC and ask! We can be found at irc.efnet.org #newsgrabber.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchiveteam%2Fnewsgrabber-warrior","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farchiveteam%2Fnewsgrabber-warrior","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchiveteam%2Fnewsgrabber-warrior/lists"}