Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/elliotwutingfeng/twitter200m
Simple analysis of the Twitter 200M Data Dump of January 2023.
https://github.com/elliotwutingfeng/twitter200m
200m data-science hacktoberfest haveibeenpwned leak osint twitter
Last synced: 2 days ago
JSON representation
Simple analysis of the Twitter 200M Data Dump of January 2023.
- Host: GitHub
- URL: https://github.com/elliotwutingfeng/twitter200m
- Owner: elliotwutingfeng
- License: bsd-3-clause
- Created: 2023-01-10T06:01:52.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-18T05:26:31.000Z (4 months ago)
- Last Synced: 2024-07-18T07:21:51.668Z (4 months ago)
- Topics: 200m, data-science, hacktoberfest, haveibeenpwned, leak, osint, twitter
- Language: Jupyter Notebook
- Homepage:
- Size: 158 KB
- Stars: 8
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Twitter200M
[![GitHub license](https://img.shields.io/badge/LICENSE-BSD--3--CLAUSE-GREEN?style=for-the-badge)](LICENSE)
Simple analysis of the [Twitter 200M Data Dump](https://haveibeenpwned.com/PwnedWebsites#Twitter200M) of January 2023.
Download links for the data dump are **not** included in this repository.
## Background
Quote from haveibeenpwned.com,
> In early 2023, over 200M records scraped from Twitter appeared on a popular hacking forum. The data was obtained sometime in 2021 by abusing an API that enabled email addresses to be resolved to Twitter profiles. The subsequent results were then composed into a corpus of data containing email addresses alongside public Twitter profile information including names, usernames and follower counts.
The data dump analysed in this repository is a "cleaned-up" version by a user on the aforementioned forum.
## Findings
### Caveats
- Not all user accounts have been leaked; Twitter has much more than 200 million accounts.
- It is impossible to verify that the leaked datasets have not been tampered with falsified data.The following findings are made on the assumption that this dataset is representative of Twitter's actual userbase.
### Most popular email providers
```bash
┌────────────────┬─────────────────┐
│ Email Provider ┆ Number of Users │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════════╪═════════════════╡
│ gmail.com ┆ 73314131 │
│ hotmail.com ┆ 40509492 │
│ yahoo.com ┆ 33051713 │
│ aol.com ┆ 4025882 │
│ hotmail.co.uk ┆ 3298152 │
│ mail.ru ┆ 3289923 │
│ hotmail.fr ┆ 3128568 │
│ live.com ┆ 1945940 │
│ msn.com ┆ 1321923 │
│ yahoo.co.uk ┆ 1313553 │
│ yahoo.fr ┆ 1245996 │
│ ymail.com ┆ 1142144 │
│ yandex.ru ┆ 1125810 │
│ icloud.com ┆ 1093533 │
│ comcast.net ┆ 1091726 │
└────────────────┴─────────────────┘
```Over **75%** of Twitter users use either Google, Microsoft, or Yahoo email addresses.
### Account creation times
Twitter first experienced rapid user growth in 2009, with its highest new account signup rates from 2011 to 2013.
From 2016 onwards, new account signups dipped below 2009 levels, and have been on a constant decline ever since.
## Requirements
Tested on Linux x64
- Fast multicore CPU
- At least 16 GB available RAM
- Python 3.11.7
- [7zip](https://7-zip.org)## Setup
```shell
python3 -m venv venv
venv/bin/python3 -m pip install --upgrade pip
venv/bin/python3 -m pip install -r requirements.txt
```## Run Jupyter Notebook
```shell
venv/bin/python3 -m jupyter notebook main.ipynb
```## Formatting
```shell
venv/bin/python3 -m black .
```