Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stas00/reddit-to-threads

Convert arctic_shift Reddit data dumps into thread-view documents
https://github.com/stas00/reddit-to-threads

conversion data dump reddit

Last synced: 2 months ago
JSON representation

Convert arctic_shift Reddit data dumps into thread-view documents

Awesome Lists containing this project

README

        

# Convert Reddit data dumps into thread-views

This repo helps to reconstruct similar to original [Reddit](https://reddit.com) thread views. It contains only the scripts to do so.

[ArthurHeitmann/arctic_shift](https://github.com/ArthurHeitmann/arctic_shift) provides [Reddit data dumps per sub via torrents](https://github.com/ArthurHeitmann/arctic_shift/blob/master/download_links.md).

`artic_shift` data dumps come in pairs:

1. submissions
2. comments

e.g., `careerguidance_comments.zst` and `careerguidance_submissions.zst`.

Once the dumps have been downloaded - they are in `.zst` format - one needs to extract the data and then convert it to threads while sorting the comments correctly.

I found the easiest quick and dirty way to accomplish this was to feed a pair of each dump files into an sqlite database per sub, and in another script extract each thread's submission and corresponding comments, rebuild the graph and parse the tree to flatten it into the desired per thread document.

This is far from being the most efficient tooling, but more of a quick solution to handle a few Reddit subs. There is no intention of turning this repo into something more than what it is, so the code is provided as is.

## Tools

To preprocess `.zst` to sqlite to `.jsonl` in 2 stages run:

```
./zst2sqlite.py *.zst
./sqlite2threads.py *.db
```

The end result will be `.jsonl` files per sub with flattened submission+comments in a single `text` record.

If you want each stage to be broken down read the following notes.

### Stage 1. Converting .zst to sqlite database

If going from `.zst` files directly to sqlite per sub do:

```
./zst2sqlite.py careerguidance_*.zst
```

Potential TODO: some subs have tens of millions of records and building a single sqlite db might take days, because it can't be parallelized - so probably could shard the input data to produce multiple .db files - say one per million of records or so.

### Stage 2. Converting sqlite database into jsonl flattened threads per submission

Create a flattened comments thread from `careerguidance.jsonl` from sqlite `.db` file:
```
./sqlite2threads.py careerguidance.db
```

You can tweak the `traverse` function to format the comments differently. e.g. if you want the email style reply, reply-to-reply sort of nesting, just tweak the `prefix` variable to your liking - some ideas are already in the file.

## Related tools

Some additional optional tools that go in a roundabout way of accomplishing the same. Mostly useful if you want to introspect the contents of `.zst` files via json dumps.

### Converting .zst to .jsonl files w/o thread conversion

This just extracts json data from `.zst` files. This is not really needed if the desired result is to get the converted threads as this would just add an additional step.

Convert a pair of files to `.jsonl` dumps
```
./zst2jsonl.py careerguidance_*.zst
```
convert a folder `data` with many `.zst` files:

```
./zst2jsonl.py data
```

### Converting .jsonl to sqlite database

If going from `.jsonl` files generated by `zst2jsonl.py` to flattened reddit threads use this tool:

```
./jsonl2sqlite.py careerguidance_*.jsonl
```

You need both `_submissions` and `_comments` file pairs, but you can send many other subs as well.