https://github.com/EC-DIGIT-CSIRC/credentialLeakDB

A database for storing, querying and doing stats on credential leaks
https://github.com/EC-DIGIT-CSIRC/credentialLeakDB
breaches credential-leaks credential-stuffing credentials haveibeenpwned leak-data leaks spycloud
Last synced: 6 months ago
JSON representation
A database for storing, querying and doing stats on credential leaks
Host: GitHub
URL: https://github.com/EC-DIGIT-CSIRC/credentialLeakDB
Owner: EC-DIGIT-CSIRC
Archived: true
Created: 2021-02-13T21:57:51.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-05-23T07:09:47.000Z (about 2 years ago)
Last Synced: 2024-08-17T08:06:42.900Z (10 months ago)
Topics: breaches, credential-leaks, credential-stuffing, credentials, haveibeenpwned, leak-data, leaks, spycloud
Language: Python
Homepage:
Size: 691 KB
Stars: 37
Watchers: 7
Forks: 7
Open Issues: 3
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG
- Contributing: CONTRIBUTING.md
- Security: SECURITY.md
Awesome Lists containing this project

README

        # credentialleakDB

[![Pylint](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/pylint.yml/badge.svg)](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/pylint.yml)

[![flak8 and pytest](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/python-app.yml/badge.svg)](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/python-app.yml)

[![CodeQL](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/codeql-analysis.yml)

[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=digits2_credentialLeakDB&metric=alert_status&token=cee9c8232570fa1000ab4770feb571fd3e85ff39)](https://sonarcloud.io/dashboard?id=digits2_credentialLeakDB)

[![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=digits2_credentialLeakDB&metric=sqale_rating&token=cee9c8232570fa1000ab4770feb571fd3e85ff39)](https://sonarcloud.io/dashboard?id=digits2_credentialLeakDB)

[![Reliability Rating](https://sonarcloud.io/api/project_badges/measure?project=digits2_credentialLeakDB&metric=reliability_rating&token=cee9c8232570fa1000ab4770feb571fd3e85ff39)](https://sonarcloud.io/dashboard?id=digits2_credentialLeakDB)

[![Security Rating](https://sonarcloud.io/api/project_badges/measure?project=digits2_credentialLeakDB&metric=security_rating&token=cee9c8232570fa1000ab4770feb571fd3e85ff39)](https://sonarcloud.io/dashboard?id=digits2_credentialLeakDB)

[![codecov](https://codecov.io/gh/EC-DIGIT-CSIRC/credentialLeakDB/branch/main/graph/badge.svg?token=SS5F8EXQON)](https://codecov.io/gh/EC-DIGIT-CSIRC/credentialLeakDB)

A database structure to store leaked credentials. 

Think: our own, internal [HaveIBeenPwned](https://haveibeenpwned.com/) database.

## Why?

1. To quickly find duplicates before sending it on to further process the data

2. To have a way to load diverse credential breaches into a common structure and do common queries on it

3. To quickly generate statistics on credential leaks

4. To have a well defined interface to pass on data to pass it on to other automation steps

## Documentation

### Installation

#### Docker

#### Via pip and venv

```bash

git clone https://github.com/EC-DIGIT-CSIRC/credentialLeakDB.git

cd credentialLeakDB

# create a virtualenv

virtualenv --python=python3.7 venv

source venv/bin/activate

pip install -r requirements.txt

```

Next, make sure the following files exist:

  * ``VIPs.txt`` ... a \n separated list of email addresses which you would consider VIPs.

  * api/config.py ... see below

 

### Database structure

Search in Confluence for "credentialLeakDB" in the Automation space.

SQL structure: [db.sql](db.sql)

The EER diagram __intentionally__ got simplified a lot. If we are going to store billions of repeated ``text`` datatype records, we can 

go back to more normalization. For now, however, this seems to be enough.

![EER Diagram](EER.png)

### Meaning of the fields

#### Table ``leak``

|      Column       |           Type           | Collation | Nullable |  Description          |                                          

|------------------ | ------------------------ | --------- | -------- | ----------------------------------------------------------------------------------------------------------------- |

| ``id``                | integer                  |           | not null | _primary key. Auto-generated_. |

| ``breach_ts``         | timestamp with time zone |           |          | If known, the timestamp when the breach happened. |

| ``source_publish_ts`` | timestamp with time zone |           |          | The timestamp according when the source (f.ex. Spycloud) published the data. |

| ``ingestion_ts``      | timestamp with time zone |           | not null | The timestamp when we ingested the data. |

| ``summary``           | text                     |           | not null | A short summary (slug) of the leak. Used for displaying it somewhere |

| ``ticket_id``         | text                     |           |          |  |

| ``reporter_name``     | text                     |           |          | The name of the reporter where we got the notification from. E.g. CERT-eu, Spycloud, etc... Who sent us the data? |

| ``source_name``       | text                     |           |          | The name of the source where this leak came from. Either the name of a collection or some other name. |

``` 
Indexes: 
    "leak_pkey" 
Referenced by: 
    TABLE 
 ``` 
 
 #### Table ``leak_data`` 
 
|        Column 
--------------------- 
 ``id`` 
 ``leak_id`` 
 ``email`` 
 ``password`` 
 ``password_plain`` 
 ``password_hashed`` 
 ``hash_algo`` 
 ``ticket_id`` 
 ``email_verified`` 
 ``password_verified_ok`` 
 ``ip`` 
 ``domain`` 
 ``browser`` 
 ``malware_name`` 
 ``infected_machine`` 
 ``dg`` 
 ``count_seen``

PRIMARY KEY, btree (id) "leak_data" CONSTRAINT "leak_data_leak_id_fkey" FOREIGN KEY (leak_id) REFERENCES leak(id)

|  Type   | Collation | Nullable |  Description | ------- | --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- | integer |           | not null | _primary key, auto-generated_. | | integer |           | not null | references a ``leak(id)`` | | text    |           | not null | The email address associated with the leak. | | text    |           | not null | Either the encrypted or unencrypted password. If the unencrypted password is available, that is what is going to be in this field. | | text    |           |          | The plaintext password, if known. | | text    |           |          | The hashed password, if known. | | text    |           |          | If we can determine the hashing algo and the password_hashed field is set, for example "md5" or "sha1" | | text    |           |          | References the ticket systems' ticket ID associated with handling this credential leak . This ticket could contain infos on how we contacted the affected user. | | boolean |           |          | If the email address was verified if it does exist and is active | | boolean |           |          | Was that password still valid / active? | | inet    |           |          | IP address of the client PC in case of a password stealer. | | text    |           |          | Domain address of the user's email address. | | text    |           |          | If the password was leaked via a password stealer malware, then the browser of the user goes here. Otherwise empty. | | text    |           |          | If the password was leaked via a password stealer malware, then the malware name goes here. Otherwise empty. | | text    |           |          | If the password was leaked via a password stealer malware, then the infected (Windows) PC name (some ID for the machine) goes here. | | text    |           | not null | The affected DG (in other organisations, this would be called "department") | integer |           |          | How often did we already see this unique combination (leak, email, password, domain). I.e. this is a duplicate counter.  |

```

Indexes:

    "leak_data_pkey" PRIMARY KEY, btree (id)

    "constr_unique_leak_data_leak_id_email_password_domain" UNIQUE CONSTRAINT, btree (leak_id, email, password, domain)

    "idx_leak_data_unique_leak_id_email_password_domain" UNIQUE, btree (leak_id, email, password, domain)

    "idx_leak_data_dg" btree (dg)

    "idx_leak_data_email" btree (upper(email))

    "idx_leak_data_email_password_machine" btree (email, password, infected_machine)

    "idx_leak_data_malware_name" btree (malware_name)

Foreign-key constraints:

    "leak_data_leak_id_fkey" FOREIGN KEY (leak_id) REFERENCES leak(id)

```    

# Usage of the API

Here is how to use the API endpoints: you can start the server (follow the instructions below) and go to ``$servername/docs`` where $servername is of course the domain / IP address you installed it under. The ``docs/`` endpoint hosts a swagger / OpenAPI 3 

## GET parameters

These are pretty self-explanatory thanks to the swagger UI.

## POST and PUT

For HTTP POST (a.k.a INSERT into DB) you will need to provide the following JSON info:

### leak object

```json

{

  "id": 0,

  "ticket_id": "string",

  "summary": "string",

  "reporter_name": "string",

  "source_name": "string",

  "breach_ts": "2021-03-29T12:21:56.370Z",

  "source_publish_ts": "2021-03-29T12:21:56.370Z"

}

```

The ``id`` field *only* needs to be filled out when PUTing data there (a.k.a UPDATE statement). Otherwise please leave it out when POSTing a new leak_data row.

The id is the internal automatically generated primary key (ID) and will be assigned. So when you use the ``HTTP POST /leak`` endpoint, please leave out ``id``. The answer will be a JSON array with a dict with the id inside, such as:

```json

{

  "meta": {

    "version": "0.5",

    "duration": 0.006,

    "count": 1

  },

  "data": [

    {

      "id": 18

    }

  ],

  "error": null

}

```

Meaning: the version of the API was 0.5, the query duration was 0.006 sec (6 millisec), one answer. The ``data`` array contains one element: id=18. Meaning, the ID of the inserted leak object was 18. You can now reference this in the leak_data object insertion.

### leak_data object

Same as the leak object, here the ``id`` field *only* needs to be filled out when PUTing data there (a.k.a UPDATE statement). Otherwise please leave it out when POSTing a new leak_data row. **Note well**: the leak_id field needs to be filled out in this case. You **first** have to create leak object and then afterwards the leak_data object.

```json

{

  "id": 0,

  "leak_id": 0,

  "email": "[email protected]",

  "password": "string",

  "password_plain": "string",

  "password_hashed": "string",

  "hash_algo": "string",

  "ticket_id": "string",

  "email_verified": true,

  "password_verified_ok": true,

  "ip": "string",

  "domain": "string",

  "browser": "string",

  "malware_name": "string",

  "infected_machine": "string",

  "dg": "string"

}

```

## ``import/csv/`` endpoint

Also pretty self-explanatory. You need to first create a leak object, give it's ID as a GET-style parameter and upload the CSV in spycloud format via the Form.

## Installation

1. Install git and checkout this repository:

```bash

apt install git

git clone ...

cd credentialLeakDB

```

3. Install Postgresql:

```bash 

# in Ubuntu:

apt install postgresql-12           

# alternatively, if you are in Debian 10, you can also use postgresql-11, both work:

# apt install postgresql-11

```

2. as user postgres:

```bash

sudo su - postgres

createdb credentialleakdb

createuser credentialleakdb

psql -c "ALTER ROLE credentialleakdb WITH PASSWORD ''" template1

```

3. create the DB:

```psql -u credentialleakdb credentialleakdb < db.sql```

5. set the env vars: 

```bash

export PORT=8080

export DBNAME=credentialleakdb

export DBUSER=credentialleakdb

export DBPASSWORD=...  ...

export DBHOST=localhost

```

5. Create a virtual environment if it does not exist yet:

   ```bash

   virtualenv --python=python3.7 venv

   source venv/bin/activate

   pip install -r requirements.txt

   ```

5. start the program from the main directory:

```bash

export PYTHONPATH=$(pwd); uvicorn --reload --host 0.0.0.0 --port $PORT api.main:app

```

## Configuration.

Please copy the file ``config.SAMPLE.py`` to ``api/config.py`` and adjust accordingly.

Here you can set API keys etc.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/EC-DIGIT-CSIRC/credentialLeakDB

Awesome Lists containing this project

README