Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/philipperemy/3.7-billion-passwords-tools
Tools to manipulate the data behind Collection #1 (and #2–5) - AntiPublic.
https://github.com/philipperemy/3.7-billion-passwords-tools
Last synced: 3 months ago
JSON representation
Tools to manipulate the data behind Collection #1 (and #2–5) - AntiPublic.
- Host: GitHub
- URL: https://github.com/philipperemy/3.7-billion-passwords-tools
- Owner: philipperemy
- Created: 2020-03-23T12:07:04.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-04-21T03:07:43.000Z (over 4 years ago)
- Last Synced: 2024-10-03T12:19:45.228Z (3 months ago)
- Language: Python
- Homepage:
- Size: 24.4 KB
- Stars: 51
- Watchers: 1
- Forks: 10
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
Awesome Lists containing this project
- awesome-hacking-lists - philipperemy/3.7-billion-passwords-tools - Tools to manipulate the data behind Collection #1 (and #2–5) - AntiPublic. (Python)
README
# Collections 1, Collections 2-5 and AntiPublic Parser
Command line tools to manipulate the data from those multi-billion passwords collections.
The full processing will take a couple of days and will generate a file structure that can be queried almost in o(1).
```
$ [email protected]
[email protected]:toto123
```The total number of unique records in the final dataset (Collection 1 to 5 + AntiPublic + Breach Compilation) is around 3.72 billions (_3,372,591,561_ to be precise).
### Set Up
Create a virtual environment and install the package.
```
virtualenv -p python3 venv
source venv/bin/activate
make install
```## Extraction
https://superuser.com/questions/1308374/how-to-extract-all-tar-gz-files-present-in-multiple-folders-at-a-time-to-another
```bash
find "$1" -name '*.tar.gz' -execdir tar -xzvf '{}' -C extracted \;
find . -name "*.rar" -exec unrar x -o+ {} \;
```## Processing
Processing the Collection 1 is much faster than the Collections 2-5. The estimates for Collections 2-5 are reported below.
The parsing took around 20 hours on my server (CPU i7-8700K, 32GB of memory). I didn't have a large enough SSD to store all the temporary computations so everything
was done on a standard HDD. A faster disk will surely make the processing faster.The sorting/removing duplicates step took 15 hours in total.
The splitting into the smaller files (this file struct makes every query almost instantaneous) took a couple of hours at most.
In total, expect around 2 days to process the Collections 2-5.
```bash
breach parse --path /path/to/extracted --success_file success.txt --failure_file failure.txt --cython_acceleration
rm -rf tmp && mkdir tmp # you need like 750GB in tmp/. By default /tmp/ is not enough for this!
cat success.txt | pv -cN cut | sort -T tmp -u -S 90% --parallel=12 | pv -cN cut > success_sorted.txt
breach split --file success_sorted.txt --out data
```## Converting the format of BreachCompilation to the new format
The dataset is available here: https://github.com/philipperemy/tensorflow-1.4-billion-password-analysis
It's easy to convert the large `BreachCompilation` dataset to this format by running those commands.
Expect those commands to take some time (less than a day).
```bash
find /path/to/BreachCompilation/ -type f -exec cat {} + > breach_compilation.txt
rm -rf tmp && mkdir tmp # By default /tmp/ is not enough for this!
cat breach_compilation.txt | pv -cN cut | sort -T tmp -u -S 90% --parallel=12 | pv -cN cut > breach_compilation_sorted.txt
breach split --file breach_compilation_sorted.txt --out data_breach_compilation_sorted
```From there a simple `breach merge` will be enough to merge it to the Collections 1 & 2 to 5.
## Merging all the datasets together
Run the Collection 1 and Collections 2-5 through the processing step described above.
You will have two directories: `/path/to/collections1_data` and `/path/to/collections2_5_data`.
Additionally, if you have the other dataset BreachCompilation, you will have another directory `/path/to/data_breach_compilation_sorted`, generated by the step above.
The merge is destructive so it's better to create a copy of the output first and then merge each one into the output.
```
cp -rf /path/to/collections1_data /path/to/big_dataset
breach merge --src /path/to/collections2_5_data --dest /path/to/big_dataset
breach merge --src /path/to/data_breach_compilation_sorted --dest /path/to/big_dataset
```## Usage
The manual of the command line tool can be fetched by running `breach dumphelp`.
```
Usage: cli [OPTIONS] COMMAND [ARGS]...Options:
--debug / --no-debug
--help Show this message and exit.Commands:
chunk chunk large TXT files into smaller files.
clean Cleans a query friendly folder PATH. Move incorrect records and
sort the files.
dumphelp
evaluate Evaluates some metrics such as precision/recall (e.g. is OLD into
NEW).
merge Merges dataset SRC into dataset DEST.
parse Parses an unstructured folder PATH of many files and generates two
files: SUCCESS_FILE and FAILURE_FILE. All valid email:password
will go to SUCCESS_FILE.
sort Sorts a query friendly folder PATH. Target is itself.
split Converts a large FILE to a query friendly folder OUT (e.g. a/b/c).
Use RESTART_FROM to resume from the i-th line.
test Infers passwords of a list of emails defined in FILE with a query
friendly folder DATASET.Usage: cli dumphelp [OPTIONS]
Options:
--help Show this message and exit.Usage: cli split [OPTIONS]
Options:
--file FILE [required]
--out DIRECTORY [required]
--restart_from INTEGER [default: 0]
--help Show this message and exit.Usage: cli chunk [OPTIONS]
Options:
--path DIRECTORY [required]
--size INTEGER [default: 50]
--help Show this message and exit.Usage: cli sort [OPTIONS]
Options:
--path DIRECTORY [required]
--help Show this message and exit.Usage: cli clean [OPTIONS]
Options:
--path DIRECTORY [required]
--help Show this message and exit.Usage: cli test [OPTIONS]
Options:
--file FILE [required]
--dataset [breach_compilation|collections_1|collections_2_5|all]
[required]
--help Show this message and exit.Usage: cli parse [OPTIONS]
Options:
--path DIRECTORY [required]
--success_file FILE [required]
--failure_file FILE [required]
--cython_acceleration / --no-cython_acceleration
--help Show this message and exit.Usage: cli merge [OPTIONS]
Options:
--src DIRECTORY [required]
--dest DIRECTORY [required]
--help Show this message and exit.Usage: cli evaluate [OPTIONS]
Options:
--old DIRECTORY [required]
--new DIRECTORY [required]
--help Show this message and exit.
```