{"id":14156175,"url":"https://github.com/EC-DIGIT-CSIRC/credentialLeakDB","last_synced_at":"2025-08-06T02:32:12.785Z","repository":{"id":40269593,"uuid":"338677765","full_name":"EC-DIGIT-CSIRC/credentialLeakDB","owner":"EC-DIGIT-CSIRC","description":"A database for storing, querying and doing stats on credential leaks","archived":true,"fork":false,"pushed_at":"2023-05-23T07:09:47.000Z","size":708,"stargazers_count":37,"open_issues_count":3,"forks_count":7,"subscribers_count":7,"default_branch":"main","last_synced_at":"2024-08-17T08:06:42.900Z","etag":null,"topics":["breaches","credential-leaks","credential-stuffing","credentials","haveibeenpwned","leak-data","leaks","spycloud"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EC-DIGIT-CSIRC.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-13T21:57:51.000Z","updated_at":"2024-08-17T08:06:44.384Z","dependencies_parsed_at":"2024-08-17T08:16:49.670Z","dependency_job_id":null,"html_url":"https://github.com/EC-DIGIT-CSIRC/credentialLeakDB","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EC-DIGIT-CSIRC%2FcredentialLeakDB","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EC-DIGIT-CSIRC%2FcredentialLeakDB/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EC-DIGIT-CSIRC%2FcredentialLeakDB/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EC-DIGIT-CSIRC%2FcredentialLeakDB/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EC-DIGIT-CSIRC","download_url":"https://codeload.github.com/EC-DIGIT-CSIRC/credentialLeakDB/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228829079,"owners_count":17978147,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["breaches","credential-leaks","credential-stuffing","credentials","haveibeenpwned","leak-data","leaks","spycloud"],"created_at":"2024-08-17T08:05:16.072Z","updated_at":"2024-12-09T03:31:11.679Z","avatar_url":"https://github.com/EC-DIGIT-CSIRC.png","language":"Python","funding_links":[],"categories":["others"],"sub_categories":[],"readme":"# credentialleakDB\n\n[![Pylint](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/pylint.yml/badge.svg)](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/pylint.yml)\n[![flak8 and pytest](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/python-app.yml/badge.svg)](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/python-app.yml)\n[![CodeQL](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/EC-DIGIT-CSIRC/credentialLeakDB/actions/workflows/codeql-analysis.yml)\n[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=digits2_credentialLeakDB\u0026metric=alert_status\u0026token=cee9c8232570fa1000ab4770feb571fd3e85ff39)](https://sonarcloud.io/dashboard?id=digits2_credentialLeakDB)\n[![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=digits2_credentialLeakDB\u0026metric=sqale_rating\u0026token=cee9c8232570fa1000ab4770feb571fd3e85ff39)](https://sonarcloud.io/dashboard?id=digits2_credentialLeakDB)\n[![Reliability Rating](https://sonarcloud.io/api/project_badges/measure?project=digits2_credentialLeakDB\u0026metric=reliability_rating\u0026token=cee9c8232570fa1000ab4770feb571fd3e85ff39)](https://sonarcloud.io/dashboard?id=digits2_credentialLeakDB)\n[![Security Rating](https://sonarcloud.io/api/project_badges/measure?project=digits2_credentialLeakDB\u0026metric=security_rating\u0026token=cee9c8232570fa1000ab4770feb571fd3e85ff39)](https://sonarcloud.io/dashboard?id=digits2_credentialLeakDB)\n[![codecov](https://codecov.io/gh/EC-DIGIT-CSIRC/credentialLeakDB/branch/main/graph/badge.svg?token=SS5F8EXQON)](https://codecov.io/gh/EC-DIGIT-CSIRC/credentialLeakDB)\n\n\nA database structure to store leaked credentials. \n\nThink: our own, internal [HaveIBeenPwned](https://haveibeenpwned.com/) database.\n\n## Why?\n\n1. To quickly find duplicates before sending it on to further process the data\n2. To have a way to load diverse credential breaches into a common structure and do common queries on it\n3. To quickly generate statistics on credential leaks\n4. To have a well defined interface to pass on data to pass it on to other automation steps\n\n## Documentation\n\n### Installation\n\n#### Docker\n\n#### Via pip and venv\n\n```bash\ngit clone https://github.com/EC-DIGIT-CSIRC/credentialLeakDB.git\ncd credentialLeakDB\n# create a virtualenv\nvirtualenv --python=python3.7 venv\nsource venv/bin/activate\npip install -r requirements.txt\n```\n\nNext, make sure the following files exist:\n  * ``VIPs.txt`` ... a \\n separated list of email addresses which you would consider VIPs.\n  * api/config.py ... see below\n \n### Database structure\nSearch in Confluence for \"credentialLeakDB\" in the Automation space.\n\nSQL structure: [db.sql](db.sql)\n\nThe EER diagram __intentionally__ got simplified a lot. If we are going to store billions of repeated ``text`` datatype records, we can \ngo back to more normalization. For now, however, this seems to be enough.\n\n\n![EER Diagram](EER.png)\n\n\n\n### Meaning of the fields\n\n#### Table ``leak``\n\n|      Column       |           Type           | Collation | Nullable |  Description          |                                          \n|------------------ | ------------------------ | --------- | -------- | ----------------------------------------------------------------------------------------------------------------- |\n| ``id``                | integer                  |           | not null | _primary key. Auto-generated_. |\n| ``breach_ts``         | timestamp with time zone |           |          | If known, the timestamp when the breach happened. |\n| ``source_publish_ts`` | timestamp with time zone |           |          | The timestamp according when the source (f.ex. Spycloud) published the data. |\n| ``ingestion_ts``      | timestamp with time zone |           | not null | The timestamp when we ingested the data. |\n| ``summary``           | text                     |           | not null | A short summary (slug) of the leak. Used for displaying it somewhere |\n| ``ticket_id``         | text                     |           |          |  |\n| ``reporter_name``     | text                     |           |          | The name of the reporter where we got the notification from. E.g. CERT-eu, Spycloud, etc... Who sent us the data? |\n| ``source_name``       | text                     |           |          | The name of the source where this leak came from. Either the name of a collection or some other name. |\n\n```\nIndexes:\n    \"leak_pkey\" PRIMARY KEY, btree (id)\nReferenced by:\n    TABLE \"leak_data\" CONSTRAINT \"leak_data_leak_id_fkey\" FOREIGN KEY (leak_id) REFERENCES leak(id)\n ```\n \n #### Table ``leak_data``\n                                                                                                                    \n|        Column        |  Type   | Collation | Nullable |  Description                                                             \n--------------------- | ------- | --------- | -------- | -----------------------------------------------------------------------------------------------------------------------------------\n ``id``                   | integer |           | not null | _primary key, auto-generated_. | \n ``leak_id``              | integer |           | not null | references a ``leak(id)`` | \n ``email``                | text    |           | not null | The email address associated with the leak. | \n ``password``             | text    |           | not null | Either the encrypted or unencrypted password. If the unencrypted password is available, that is what is going to be in this field. |\n ``password_plain``       | text    |           |          | The plaintext password, if known. |\n ``password_hashed``      | text    |           |          | The hashed password, if known. |\n ``hash_algo``            | text    |           |          | If we can determine the hashing algo and the password_hashed field is set, for example \"md5\" or \"sha1\" |\n ``ticket_id``            | text    |           |          | References the ticket systems' ticket ID associated with handling this credential leak . This ticket could contain infos on how we contacted the affected user. | \n ``email_verified``       | boolean |           |          | If the email address was verified if it does exist and is active | \n ``password_verified_ok`` | boolean |           |          | Was that password still valid / active? | \n ``ip``                   | inet    |           |          | IP address of the client PC in case of a password stealer. | \n ``domain``               | text    |           |          | Domain address of the user's email address. | \n ``browser``              | text    |           |          | If the password was leaked via a password stealer malware, then the browser of the user goes here. Otherwise empty. | \n ``malware_name``         | text    |           |          | If the password was leaked via a password stealer malware, then the malware name goes here. Otherwise empty. |\n ``infected_machine``     | text    |           |          | If the password was leaked via a password stealer malware, then the infected (Windows) PC name (some ID for the machine) goes here. |\n ``dg``                   | text    |           | not null | The affected DG (in other organisations, this would be called \"department\")\n ``count_seen``           | integer |           |          | How often did we already see this unique combination (leak, email, password, domain). I.e. this is a duplicate counter.  | \n\n```\nIndexes:\n    \"leak_data_pkey\" PRIMARY KEY, btree (id)\n    \"constr_unique_leak_data_leak_id_email_password_domain\" UNIQUE CONSTRAINT, btree (leak_id, email, password, domain)\n    \"idx_leak_data_unique_leak_id_email_password_domain\" UNIQUE, btree (leak_id, email, password, domain)\n    \"idx_leak_data_dg\" btree (dg)\n    \"idx_leak_data_email\" btree (upper(email))\n    \"idx_leak_data_email_password_machine\" btree (email, password, infected_machine)\n    \"idx_leak_data_malware_name\" btree (malware_name)\nForeign-key constraints:\n    \"leak_data_leak_id_fkey\" FOREIGN KEY (leak_id) REFERENCES leak(id)\n```    \n\n\n# Usage of the API\n\nHere is how to use the API endpoints: you can start the server (follow the instructions below) and go to ``$servername/docs`` where $servername is of course the domain / IP address you installed it under. The ``docs/`` endpoint hosts a swagger / OpenAPI 3 \n\n## GET parameters\n\nThese are pretty self-explanatory thanks to the swagger UI.\n\n## POST and PUT\n\nFor HTTP POST (a.k.a INSERT into DB) you will need to provide the following JSON info:\n\n### leak object\n```json\n{\n  \"id\": 0,\n  \"ticket_id\": \"string\",\n  \"summary\": \"string\",\n  \"reporter_name\": \"string\",\n  \"source_name\": \"string\",\n  \"breach_ts\": \"2021-03-29T12:21:56.370Z\",\n  \"source_publish_ts\": \"2021-03-29T12:21:56.370Z\"\n}\n\n```\n\nThe ``id`` field *only* needs to be filled out when PUTing data there (a.k.a UPDATE statement). Otherwise please leave it out when POSTing a new leak_data row.\nThe id is the internal automatically generated primary key (ID) and will be assigned. So when you use the ``HTTP POST /leak`` endpoint, please leave out ``id``. The answer will be a JSON array with a dict with the id inside, such as:\n\n```json\n{\n  \"meta\": {\n    \"version\": \"0.5\",\n    \"duration\": 0.006,\n    \"count\": 1\n  },\n  \"data\": [\n    {\n      \"id\": 18\n    }\n  ],\n  \"error\": null\n}\n```\n\nMeaning: the version of the API was 0.5, the query duration was 0.006 sec (6 millisec), one answer. The ``data`` array contains one element: id=18. Meaning, the ID of the inserted leak object was 18. You can now reference this in the leak_data object insertion.\n\n### leak_data object\n\nSame as the leak object, here the ``id`` field *only* needs to be filled out when PUTing data there (a.k.a UPDATE statement). Otherwise please leave it out when POSTing a new leak_data row. **Note well**: the leak_id field needs to be filled out in this case. You **first** have to create leak object and then afterwards the leak_data object.\n\n```json\n{\n  \"id\": 0,\n  \"leak_id\": 0,\n  \"email\": \"user@example.com\",\n  \"password\": \"string\",\n  \"password_plain\": \"string\",\n  \"password_hashed\": \"string\",\n  \"hash_algo\": \"string\",\n  \"ticket_id\": \"string\",\n  \"email_verified\": true,\n  \"password_verified_ok\": true,\n  \"ip\": \"string\",\n  \"domain\": \"string\",\n  \"browser\": \"string\",\n  \"malware_name\": \"string\",\n  \"infected_machine\": \"string\",\n  \"dg\": \"string\"\n}\n```\n\n## ``import/csv/`` endpoint\n\nAlso pretty self-explanatory. You need to first create a leak object, give it's ID as a GET-style parameter and upload the CSV in spycloud format via the Form.\n\n\n## Installation\n\n1. Install git and checkout this repository:\n```bash\napt install git\ngit clone ...\ncd credentialLeakDB\n```\n\n3. Install Postgresql:\n```bash \n# in Ubuntu:\napt install postgresql-12           \n# alternatively, if you are in Debian 10, you can also use postgresql-11, both work:\n# apt install postgresql-11\n```\n\n2. as user postgres:\n```bash\nsudo su - postgres\ncreatedb credentialleakdb\ncreateuser credentialleakdb\npsql -c \"ALTER ROLE credentialleakdb WITH PASSWORD '\u003cinsert some random password here\u003e'\" template1\n```\n\n3. create the DB:\n```psql -u credentialleakdb credentialleakdb \u003c db.sql```\n\n5. set the env vars: \n```bash\nexport PORT=8080\nexport DBNAME=credentialleakdb\nexport DBUSER=credentialleakdb\nexport DBPASSWORD=... \u003cinsert the password you gave the user\u003e ...\nexport DBHOST=localhost\n```\n5. Create a virtual environment if it does not exist yet:\n   ```bash\n   virtualenv --python=python3.7 venv\n   source venv/bin/activate\n   pip install -r requirements.txt\n   ```\n5. start the program from the main directory:\n```bash\nexport PYTHONPATH=$(pwd); uvicorn --reload --host 0.0.0.0 --port $PORT api.main:app\n```\n\n## Configuration.\n\nPlease copy the file ``config.SAMPLE.py`` to ``api/config.py`` and adjust accordingly.\nHere you can set API keys etc.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEC-DIGIT-CSIRC%2FcredentialLeakDB","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FEC-DIGIT-CSIRC%2FcredentialLeakDB","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEC-DIGIT-CSIRC%2FcredentialLeakDB/lists"}