Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tobilg/analyze-twitter-export

Analyze exported Twitter data
https://github.com/tobilg/analyze-twitter-export

dataanalytics duckdb twitter

Last synced: 12 days ago
JSON representation

Analyze exported Twitter data

Host: GitHub
URL: https://github.com/tobilg/analyze-twitter-export
Owner: tobilg
License: mit
Created: 2024-11-14T10:08:31.000Z (3 months ago)
Default Branch: main
Last Pushed: 2024-11-15T01:25:35.000Z (3 months ago)
Last Synced: 2024-11-15T02:24:19.593Z (3 months ago)
Topics: dataanalytics, duckdb, twitter
Language: Shell
Homepage: https://sql-workbench.com
Size: 445 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # analyze-twitter-export

Analyze your Twitter export data with the help of [DuckDB](https://duckdb.org/).

## Usage

The following steps are required to analyze your Twitter export data.

1. Install DuckDB.  

    This can be done with running the `scripts/install_duckdb.sh` script (it assumes you're on a Linux machine). Otherwise you could do a `brew install duckdb` on MacOS, or follow the [instructions](https://duckdb.org/docs/installation) for your platform from the DuckDB website.

2. Copy the downloaded Twitter export data to the `src-data` directory.  

    This should be the zip file you downloaded from Twitter.

3. Prepare the Twitter export data for import into DuckDB.  

    The data needs to be converted into a format that can be imported into DuckDB. This can be done with running the `scripts/prepare_tweets.sh` script.

4. Create a DuckDB database from your Twitter export data.  

    This can be done with running the `scripts/create_database.sh` script. The result will be a file called `twitter.duckdb` in the `data` directory.

5. Analyze the data.  

    This can be done with running `duckdb data/twitter.duckdb` in the project root directory, and then executing the SQL queries inside the started DuckDB CLI.

 

## Entity Relationship Diagram

The following diagram shows the structure of the resulting database.

![Twitter Export Database ERD](docs/erd.png)

## SQL Workbench

You can use [SQL Workbench](https://sql-workbench.com) to analyze the data locally, in the browser. Just drag & drop the `data/twitter.duckdb` file in SQL Workbench's file dropping area.

You need to make sure though that you **add the database name as prefix to the table names** in your queries (e.g. `SELECT * FROM twitter.tweet LIMIT 10;`).

![SQL Workbench](docs/screenshot.png)

## Example Queries

The following example queries can be used to analyze the data.

### Show all tweets and replies

```sql

SELECT 

    * 

FROM 

    tweet

ORDER BY created_at DESC;

```

### Show all tweets with expanded content (w/o replies)

```sql

SELECT 

    tweet_id, created_at, content_expanded, favorite_count, retweet_count, language

FROM 

    tweet

WHERE

    is_reply = false

ORDER BY created_at DESC;

```

### Show most liked tweets

```sql

SELECT 

    tweet_id, created_at, content_expanded, favorite_count, retweet_count

FROM 

    tweet

ORDER BY favorite_count DESC;

```

### Number of tweets per day

```sql

SELECT 

    strftime(created_at, '%Y-%m-%d') as day, COUNT(*) as count

FROM 

    tweet

GROUP BY day

ORDER BY day;

```

### Most used hashtags

```sql

SELECT 

    h.hashtag, COUNT(distinct rh.tweet_id) as count

FROM 

    hashtag h

INNER JOIN

    rel_tweet_hashtag rh ON h.hashtag_id = rh.hashtag_id

GROUP BY h.hashtag

ORDER BY count DESC;

```

### Most mentioned users

```sql

SELECT 

    u.screen_name, COUNT(distinct ru.tweet_id) as count

FROM 

    user u

INNER JOIN

    rel_tweet_mentioned_user ru ON u.user_id = ru.user_id

GROUP BY u.screen_name

ORDER BY count DESC;

```