Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ArchiveTeam/wpull
Wget-compatible web downloader and crawler.
https://github.com/ArchiveTeam/wpull
Last synced: about 1 month ago
JSON representation
Wget-compatible web downloader and crawler.
- Host: GitHub
- URL: https://github.com/ArchiveTeam/wpull
- Owner: ArchiveTeam
- License: gpl-3.0
- Created: 2013-12-07T13:03:15.000Z (about 11 years ago)
- Default Branch: develop
- Last Pushed: 2024-04-29T12:41:59.000Z (8 months ago)
- Last Synced: 2024-10-30T00:56:00.243Z (about 1 month ago)
- Language: HTML
- Homepage:
- Size: 3.92 MB
- Stars: 556
- Watchers: 23
- Forks: 77
- Open Issues: 198
-
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-digital-preservation - WPull - Wget-compatible web downloader and crawler. (Web archiving / Crawlers)
- awesome-datahoarding - wpull - compatible web downloader and crawler (Download utilities / General)
- awesome-datahoarder - wpull - compatible web downloader and crawler (Download utilities / General)
README
=====
Wpull
=====Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web
downloader and crawler... image:: https://raw.githubusercontent.com/chfoo/wpull/master/icon/wpull_logo_full.png
:target: https://github.com/chfoo/wpull
:alt: A dog pulling a box via a harness.Notable Features:
* Written in Python: lightweight, modifiable, robust, & scriptable
* Graceful stopping; on-disk database resume
* PhantomJS & youtube-dl integration (experimental)Install
=======Wpull uses `Python 3 `_.
Once Python is installed, download Wpull from PyPI using pip::
pip3 install wpull
For detailed installation instructions and potential caveats, please see
https://wpull.readthedocs.io/en/master/install.html.Example Commands
================To download the About page of Google.com::
wpull google.com/about
To archive a website::
wpull billy.blogsite.example \
--warc-file blogsite-billy \
--no-check-certificate \
--no-robots --user-agent "InconspiuousWebBrowser/1.0" \
--wait 0.5 --random-wait --waitretry 600 \
--page-requisites --recursive --level inf \
--span-hosts-allow linked-pages,page-requisites \
--escaped-fragment --strip-session-id \
--sitemaps \
--reject-regex "/login\.php" \
--tries 3 --retry-connrefused --retry-dns-error \
--timeout 60 --session-timeout 21600 \
--delete-after --database blogsite-billy.db \
--quiet --output-file blogsite-billy.logTo see all options::
wpull --help
Documentation
=============Documentation is located at https://wpull.readthedocs.io/. Please have
a look at it before using Wpull's advanced features.Help
====Need help? Please see our `Help
`_ page which contains
frequently asked questions and support information.The issue tracker is located at https://github.com/chfoo/wpull/issues.
Dev
===.. image:: https://travis-ci.org/ArchiveTeam/wpull.png
:target: https://travis-ci.org/ArchiveTeam/wpull
:alt: Travis CI build status.. image:: https://coveralls.io/repos/chfoo/wpull/badge.png
:target: https://coveralls.io/r/chfoo/wpull
:alt: Coveralls reportContributions and feedback are greatly appreciated.
Credits
=======Copyright 2013-2016 by Christopher Foo and others. License GPL v3.
This project contains third-party source code licensed under different terms:
* wpull.backport.logging
* wpull.thirdparty.robotexclusionrulesparser
* wpull.thirdparty.dammitWe would like to acknowledge the authors of GNU Wget as Wpull uses algorithms
from Wget.