Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/philipperemy/3.7-billion-passwords-tools

Tools to manipulate the data behind Collection #1 (and #2–5) - AntiPublic.
https://github.com/philipperemy/3.7-billion-passwords-tools

Last synced: 3 months ago
JSON representation

Tools to manipulate the data behind Collection #1 (and #2–5) - AntiPublic.

Awesome Lists containing this project

README

        

# Collections 1, Collections 2-5 and AntiPublic Parser

Command line tools to manipulate the data from those multi-billion passwords collections.

The full processing will take a couple of days and will generate a file structure that can be queried almost in o(1).

```
$ [email protected]
[email protected]:toto123
```

The total number of unique records in the final dataset (Collection 1 to 5 + AntiPublic + Breach Compilation) is around 3.72 billions (_3,372,591,561_ to be precise).

### Set Up

Create a virtual environment and install the package.

```
virtualenv -p python3 venv
source venv/bin/activate
make install
```

## Extraction

https://superuser.com/questions/1308374/how-to-extract-all-tar-gz-files-present-in-multiple-folders-at-a-time-to-another

```bash
find "$1" -name '*.tar.gz' -execdir tar -xzvf '{}' -C extracted \;
find . -name "*.rar" -exec unrar x -o+ {} \;
```

## Processing

Processing the Collection 1 is much faster than the Collections 2-5. The estimates for Collections 2-5 are reported below.

The parsing took around 20 hours on my server (CPU i7-8700K, 32GB of memory). I didn't have a large enough SSD to store all the temporary computations so everything
was done on a standard HDD. A faster disk will surely make the processing faster.

The sorting/removing duplicates step took 15 hours in total.

The splitting into the smaller files (this file struct makes every query almost instantaneous) took a couple of hours at most.

In total, expect around 2 days to process the Collections 2-5.

```bash
breach parse --path /path/to/extracted --success_file success.txt --failure_file failure.txt --cython_acceleration
rm -rf tmp && mkdir tmp # you need like 750GB in tmp/. By default /tmp/ is not enough for this!
cat success.txt | pv -cN cut | sort -T tmp -u -S 90% --parallel=12 | pv -cN cut > success_sorted.txt
breach split --file success_sorted.txt --out data
```

## Converting the format of BreachCompilation to the new format

The dataset is available here: https://github.com/philipperemy/tensorflow-1.4-billion-password-analysis

It's easy to convert the large `BreachCompilation` dataset to this format by running those commands.

Expect those commands to take some time (less than a day).

```bash
find /path/to/BreachCompilation/ -type f -exec cat {} + > breach_compilation.txt
rm -rf tmp && mkdir tmp # By default /tmp/ is not enough for this!
cat breach_compilation.txt | pv -cN cut | sort -T tmp -u -S 90% --parallel=12 | pv -cN cut > breach_compilation_sorted.txt
breach split --file breach_compilation_sorted.txt --out data_breach_compilation_sorted
```

From there a simple `breach merge` will be enough to merge it to the Collections 1 & 2 to 5.

## Merging all the datasets together

Run the Collection 1 and Collections 2-5 through the processing step described above.

You will have two directories: `/path/to/collections1_data` and `/path/to/collections2_5_data`.

Additionally, if you have the other dataset BreachCompilation, you will have another directory `/path/to/data_breach_compilation_sorted`, generated by the step above.

The merge is destructive so it's better to create a copy of the output first and then merge each one into the output.

```
cp -rf /path/to/collections1_data /path/to/big_dataset
breach merge --src /path/to/collections2_5_data --dest /path/to/big_dataset
breach merge --src /path/to/data_breach_compilation_sorted --dest /path/to/big_dataset
```

## Usage

The manual of the command line tool can be fetched by running `breach dumphelp`.

```
Usage: cli [OPTIONS] COMMAND [ARGS]...

Options:
--debug / --no-debug
--help Show this message and exit.

Commands:
chunk chunk large TXT files into smaller files.
clean Cleans a query friendly folder PATH. Move incorrect records and
sort the files.
dumphelp
evaluate Evaluates some metrics such as precision/recall (e.g. is OLD into
NEW).
merge Merges dataset SRC into dataset DEST.
parse Parses an unstructured folder PATH of many files and generates two
files: SUCCESS_FILE and FAILURE_FILE. All valid email:password
will go to SUCCESS_FILE.
sort Sorts a query friendly folder PATH. Target is itself.
split Converts a large FILE to a query friendly folder OUT (e.g. a/b/c).
Use RESTART_FROM to resume from the i-th line.
test Infers passwords of a list of emails defined in FILE with a query
friendly folder DATASET.

Usage: cli dumphelp [OPTIONS]

Options:
--help Show this message and exit.

Usage: cli split [OPTIONS]

Options:
--file FILE [required]
--out DIRECTORY [required]
--restart_from INTEGER [default: 0]
--help Show this message and exit.

Usage: cli chunk [OPTIONS]

Options:
--path DIRECTORY [required]
--size INTEGER [default: 50]
--help Show this message and exit.

Usage: cli sort [OPTIONS]

Options:
--path DIRECTORY [required]
--help Show this message and exit.

Usage: cli clean [OPTIONS]

Options:
--path DIRECTORY [required]
--help Show this message and exit.

Usage: cli test [OPTIONS]

Options:
--file FILE [required]
--dataset [breach_compilation|collections_1|collections_2_5|all]
[required]
--help Show this message and exit.

Usage: cli parse [OPTIONS]

Options:
--path DIRECTORY [required]
--success_file FILE [required]
--failure_file FILE [required]
--cython_acceleration / --no-cython_acceleration
--help Show this message and exit.

Usage: cli merge [OPTIONS]

Options:
--src DIRECTORY [required]
--dest DIRECTORY [required]
--help Show this message and exit.

Usage: cli evaluate [OPTIONS]

Options:
--old DIRECTORY [required]
--new DIRECTORY [required]
--help Show this message and exit.
```