https://github.com/threatpatrols/hibp-downloader
Efficiently download new pwned password hashes from api.pwnedpasswords.com fast
https://github.com/threatpatrols/hibp-downloader
haveibeenpwned haveibeenpwned-downloader hibp hibp-downloader ntlm sha1
Last synced: 4 months ago
JSON representation
Efficiently download new pwned password hashes from api.pwnedpasswords.com fast
- Host: GitHub
- URL: https://github.com/threatpatrols/hibp-downloader
- Owner: threatpatrols
- License: bsd-3-clause
- Created: 2023-07-30T12:18:26.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-18T12:34:35.000Z (7 months ago)
- Last Synced: 2025-05-31T19:03:06.736Z (5 months ago)
- Topics: haveibeenpwned, haveibeenpwned-downloader, hibp, hibp-downloader, ntlm, sha1
- Language: Python
- Homepage: https://threatpatrols.github.io/hibp-downloader/
- Size: 1.42 MB
- Stars: 23
- Watchers: 2
- Forks: 3
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# hibp-downloader
[](https://pypi.python.org/pypi/hibp-downloader/)
[](https://github.com/threatpatrols/hibp-downloader/)
[](https://github.com/threatpatrols/hibp-downloader/actions/workflows/build-tests.yml)
[](https://github.com/threatpatrols/hibp-downloader)This is a CLI tool to efficiently download a local copy of the pwned password hash data from the very awesome
[HIBP](https://haveibeenpwned.com/Passwords) pwned passwords [api-endpoint](https://api.pwnedpasswords.com) using all the good bits;
multiprocessing, async-processes, local-caching, content-etags and http2-connection pooling to probably make things
as fast as is Pythonly possible.## Features
- Interface to directly `query` for compromised password values from the *compressed* file data-store!
- Download and store acquired data in gzip'd compressed to save on storage and speed up queries.
- Download the full dataset in under 45 mins (generally CPU bound)
- Easily resume interrupted `download` operations into a `--data-path` without re-clobbering api-source.
- Only download hash-prefix content blocks when the source content has changed (via content ETAG values); making it
easy to periodically sync-up when needed.
- Query interface performance is efficient enough to attach a user web-service with reasonable loads (ie don't waste
your own resources decompressing the dataset and storing in a database!)
- Ability to generate a single text file with in-order pwned password hash values, similar to [PwnedPasswordsDownloader](https://github.com/HaveIBeenPwned/PwnedPasswordsDownloader) from
the awesome HIBP team.
- Per prefix file metadata in JSON format for easy data reuse by other tooling if required.## Install
```commandline
pipx install hibp-downloader
```## Usage (download)
## Performance
Sample download activity log; host with 32 cores on 500Mbit/s connection.
```text
...
2024-05-16T10:18:01-0400 | INFO | hibp-downloader | prefix=f80c7 source=[lc:13616 et:3 rc:1002358 ro:25 xx:1] processed=[17836.6MB ~414462H/s] api=[918req/s 17597.4MB] runtime=36.4min
2024-05-16T10:18:02-0400 | INFO | hibp-downloader | prefix=f81af source=[lc:13616 et:3 rc:1002558 ro:25 xx:1] processed=[17840.1MB ~414454H/s] api=[918req/s 17600.9MB] runtime=36.4min
2024-05-16T10:18:02-0400 | INFO | hibp-downloader | prefix=f826f source=[lc:13616 et:3 rc:1002758 ro:25 xx:1] processed=[17843.6MB ~414454H/s] api=[918req/s 17604.4MB] runtime=36.4min
2024-05-16T10:18:03-0400 | INFO | hibp-downloader | prefix=f833f source=[lc:13616 et:3 rc:1002958 ro:25 xx:1] processed=[17847.1MB ~414450H/s] api=[918req/s 17607.9MB] runtime=36.4min
```- 918x requests per second to `api.pwnedpasswords.com`
- Log sources are shorthand:
- `lc`: 13616 from local-cache (lc) - request-responses handled locally without hitting the network.
- `et`: 3 etag-matched (et) - request-responses that confirmed our local data was up-to-date and did not require a new download.
- `rc`: 1002958 from remote-cache (rc) - request-responses that were downloaded to local, but came from the remote-server cache.
- `ro`: 25 from remote-origin (ro) - request-responses that were downloaded to local, and the download needed to be fetched from remote origin source.
- `xx`: 1 failed responses - request-responses that failed (and successfully retried).
- ~17GB downloaded in ~36 minutes (full dataset)
- Approx ~414k hash values received per second
- Processing in this example appears to be CPU bound, measured traffic around ~160 Mbit/s.## Usage (query)
## Project
- Docs - [threatpatrols.github.io/hibp-downloader](https://threatpatrols.github.io/hibp-downloader)
- PyPI - [pypi.org/project/hibp-downloader/](https://pypi.org/project/hibp-downloader/)
- Github - [github.com/threatpatrols/hibp-downloader](https://github.com/threatpatrols/hibp-downloader)