{"id":26037648,"url":"https://github.com/willf/dl","last_synced_at":"2026-04-16T20:38:01.050Z","repository":{"id":279257178,"uuid":"938210335","full_name":"willf/dl","owner":"willf","description":"Bulk downloader","archived":false,"fork":false,"pushed_at":"2025-02-26T21:29:58.000Z","size":267,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-06T11:06:50.526Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/willf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-24T15:41:16.000Z","updated_at":"2025-02-26T21:30:01.000Z","dependencies_parsed_at":"2025-02-24T17:06:05.897Z","dependency_job_id":"dbc83398-1092-43d1-b77a-3349d80971aa","html_url":"https://github.com/willf/dl","commit_stats":null,"previous_names":["willf/dl"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/willf%2Fdl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/willf%2Fdl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/willf%2Fdl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/willf%2Fdl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/willf","download_url":"https://codeload.github.com/willf/dl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242357887,"owners_count":20114890,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-07T08:29:15.972Z","updated_at":"2026-04-16T20:37:56.005Z","avatar_url":"https://github.com/willf.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dl: A Bulk Downloader with Grace and Tenacity\n\nAttempt to bulk download a list of URLs with some tenacity, but also\nsome grace. Attempts to honor the server's rate limiting and retries\non various failures.\n\n## Understanding the use case\n\nThis was originally written to download approximately 226,000 URLs\nfrom a website that allows only 1000 requests per hour. Well, that's\ntoo many hours, but a subset of the data could be collected of only\n59,000 URLs. There is a HTTP 429 response code, to indicate when\ntoo many requests are being made, and the server will return a\n`Retry-After` header to indicate how long to wait before retrying.\nThese are web standards; some web servers also return headers\nthat indicate:\n\n1. The maximum number of requests allowed per hour\n2. The number of requests remaining in the current hour\n3. The time when the rate limit will reset\n\nDifferent servers call these by different names. The rate limit header usually\nlooks something like `RateLimit-Limit`. The remaining requests header usually looks like\n`RateLimit-Remaining`. The reset time header usually looks like `RateLimit-Reset`,\nalthough the reset might be a duration (the rate limit will reset in so many\nseconds) or a point in time (the rate limit will reset at this time). Servers\nare free to do what they will, of course, and not of these are exactly\npromises.\n\nOf course, there are lots of other things that can go wrong.\n\nThis script attempts to not go beyond the server's declared rate limits,\nand pay attention to the other headers if they are present. It also. It tries\nto be tenacious in the face of failures, and tries a number of times to\ndownload a file before giving up, using exponential backoff.\n\n## Installation\n\nThis script requires Python 3.8 or later, and the `uv` package. This\nis not a package yet, so you have to clone the repository and run it\nfrom the repo directory.\n\nYou can install it with:\n\n```bash\n$ git clone git@github.com:willf/dl.git\n$ cd dl\n$ uv sync\n```\n\n## Usage\n\n```bash\n$ uv run dl.py --help\nUsage: dl.py [OPTIONS]\n\nUsage: dl.py [OPTIONS]\n\nOptions:\n  --url-file TEXT            Path to a file containing URLs (defaults to\n                             stdin).\n  --download-dir PATH        Directory to save downloads.  [default: download]\n  --prefixes-to-remove TEXT  Prefixes to remove from the URL path when saving\n                             the file.\n  --auto-remove-prefix       Remove the longest common prefix from the URL\n                             paths\n  --regex TEXT               Regular expression to match URLs to download.\n  --reverse                  Reverse the regex match, i.e., download URLs that\n                             do not match the regex.\n  --randomize                Randomize the order of the URLs\n  --log-file FILE            Path to a file to log output.\n  --log-level TEXT           Logging level.  [default: INFO]\n  --max-tries INTEGER        Maximum number of retries on request failures\n                             [default: 10]\n  --version                  Show the version and exit.\n  --dry-run                  If set, do not actually download the files, just\n                             log what would be done.\n  --help                     Show this message and exit.\n\n```\n\nExamples:\n\n```bash\n$ uv run dl.py --url-file urls.txt --download-dir downloads --auto-remove-prefix\n$ cat urls.txt | uv run dl.py --download-dir downloads --max-tries 5\n```\n\n## Examplation of the options\n\n- `--url-file`: Path to a file containing URLs. This is the only required\n  option. Each line in the file should contain a single URL. Blank lines\n  and lines starting with `#` are ignored.\n- `--download-dir`: Directory to save downloads. If not specified, it defaults to `download` in the current directory.\n- `--prefixes-to-remove`: This can be used multiple times to specify\n  prefixes to remove from the URL path when saving the file. For example,\n  if the URL is `https://example.com/foo/bar/baz.txt`, and you specify\n  `--prefixes-to-remove foo`, then the file will\n  be saved as `bar/baz.txt`. This is useful if you want to save the files\n  in a directory structure that is shallower than the URL path.\n- `--auto-remove-prefix`: If set, the script will automatically remove the longest common prefix from the URL paths when saving the file. For example, if the URLs are all under `https://example.com/foo/`, then the files will be saved in the `foo` directory.\n- `--regex`: Regular expression to match URLs to download. If specified,\n  only URLs that match the regex will be downloaded. This is useful if\n  you want to download a subset of the URLs in the file.\n- `--reverse`: If set, the script will download URLs that do _not_ match\n  the regex. This is useful if you want to download all URLs except\n  a subset.\n- `--randomize`: If set, the script will randomize the order of the URLs before downloading them. You might want to do this in order to collect a random set of results before the server is no longer available.\n- `--log-file`: Path to a file to log output. If not specified, the log will only be printed to the console.\n- `--log-level`: Logging level. This can be one of `TRACE`, `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. The default is `INFO`.\n- `--max-tries`: Maximum number of retries on request failures. The default is 10 (all told, this is about half an hour of waiting if everything fails). The max is around 20 (around 83 weeks total, hehe).\n- `--dry-run`: If set, the script will not actually download the files, but will log what would be done. This is useful if you want to see what the script would do without actually downloading the files.\n- `--version`: Show the version and exit.\n- `--help`: Show this message and exit.\n\nNotice that this is a single-threaded script; it is most useful when rate limits are likely to be present.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwillf%2Fdl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwillf%2Fdl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwillf%2Fdl/lists"}