Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mesmere/RedditLemmyImporter
🔥 Anti-Reddit Aktion 🔥
https://github.com/mesmere/RedditLemmyImporter
json lemmy reddit sql
Last synced: 2 months ago
JSON representation
🔥 Anti-Reddit Aktion 🔥
- Host: GitHub
- URL: https://github.com/mesmere/RedditLemmyImporter
- Owner: mesmere
- License: apache-2.0
- Created: 2022-04-10T01:56:43.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-06-21T05:39:17.000Z (over 1 year ago)
- Last Synced: 2024-08-03T17:09:07.565Z (6 months ago)
- Topics: json, lemmy, reddit, sql
- Language: Kotlin
- Homepage:
- Size: 8.66 MB
- Stars: 70
- Watchers: 1
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-lemmy - RedditLemmyImporter - Reddit Aktion 🔥 | ![GitHub stars](https://img.shields.io/github/stars/mesmere/RedditLemmyImporter) ![GitHub commit activity](https://img.shields.io/github/commit-activity/y/mesmere/RedditLemmyImporter) (Projects / Tools)
README
This project translates Reddit API responses into a PL/pgSQL script which loads the data into a [Lemmy](https://github.com/LemmyNet/lemmy/) database.
In other words, it Reddit posts/comments and them into Lemmy.
## Screenshots
Here's an example of a backup of the now-banned r/GenZhou up and running on a Lemmy test instance:
Community|Post
---|---
![comm screenshot](https://user-images.githubusercontent.com/95945959/166649549-1d4eddfc-2a4e-4b83-a8c4-ef5935584b30.png)|![post screenshot](https://user-images.githubusercontent.com/95945959/166649995-df61648f-4346-4d6d-8545-ad26414cbd7d.png)## Getting input data
To get the JSON API response for a single post, you can call [the proper Reddit API](https://www.reddit.com/dev/api/#GET_comments_{article}) (requires an API key), or just append `.json` to the comments URL, like this:
```
HTML: https://www.reddit.com/r/GenZedong/comments/laucjl/china_usa/
https://www.reddit.com/r/GenZedong/comments/laucjlJSON: https://www.reddit.com/r/GenZedong/comments/laucjl/china_usa/.json?limit=10000
https://www.reddit.com/r/GenZedong/comments/laucjl.json?limit=10000
```Note that we've also added the `limit` parameter, because otherwise Reddit will pretty aggressively prune the comment tree with "Load more comments" links.
The response object contains the data for that one post and any replies. You can feed this directly into RedditLemmyImporter. However, if you want to import multiple posts, you can put multiple responses in the same input file, with each one separated by a newline. For example:
```
~ $ cat urls
https://www.reddit.com/r/GenZedong/comments/tpyft9/why_is_like_half_this_sub_made_of_trans_women/
https://www.reddit.com/r/GenZedong/comments/pet8zc/therapist_trans_stalin_isnt_real_she_cant_hurt/
https://www.reddit.com/r/GenZedong/comments/ttcyok/happy_trans_visibility_day_comrades/
https://www.reddit.com/r/GenZedong/comments/t9kbdm/women_of_genzedong_i_congratulate_you_for_your_day/
~ $ xargs -I URL curl --silent --user-agent "Subreddit archiver" --cookie "REDACTED" URL.json?limit=10000 < urls > dump.json
```## Cloning an entire subreddit
If you need a complete scraping solution, check out [this Python script](https://lemmygrad.ml/comment/130292). It pulls posts into a local MongoDB database, which means you can run it on a cron to keep a local clone of posts as they're made. To export your `dump.json` try something like this:
```
mongoexport --uri="mongodb://localhost:27017/subredditArchiveDB" --collection=GenZedong --out=dump-wrapped.json
```/r/GenZhou was scraped by `@[email protected]` using this method. Data is available up to about a week before it was banned:
https://mega.nz/file/knBwmTJL#PpqO0I3Jv-xw-o7RBWSi0JSScjSV7-4Eb3JR5HzTc5wNote that the script buries the data we need within a top-level property named `json`. RedditLemmyImporter can handle this directly using the `--json-pointer` option. For example:
```
java -jar redditLemmyImporter-0.3.jar -c genzhouarchive -u archive_bot -o import.sql --json-pointer=/json GenZhouArchive.json
```## Generating a SQL script using the release binary
Prerequisites: Java 8 or above
Download the jar file from the [releases page](https://github.com/rileynull/RedditLemmyImporter/releases) and run it:
```
java -jar redditLemmyImporter-0.3.jar -c genzhouarchive -u archive_bot -o import.sql dump.json
```In this case we're generating a PL/pgSQL script that will load the data from `dump.json` into the comm `genzhouarchive` under the user `archive_bot`. The script will be written to `import.sql`. Full command usage:
```
Usage: redditLemmyImporter [OPTIONS] dump
dump Path to the JSON dump file from the Reddit API. Required.
Specify - to read from stdin.
-c, --comm=name Target community name. Required.
-u, --user=name Target user name. Required.
--json-pointer=pointer Locate the Reddit API response somewhere within the top-level object in each input line.
See RFC 6901 for the JSON Pointer specification.
-o, --output-file=file Output file. Prints to stdout if this option isn't specified.
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
```## Generating a SQL script using the source repository
Prerequisites: JDK >=1.8, Maven 3.
Clone the repo and cd to the source tree. Run:
```
mvn compile
mvn exec:java -Dexec.args="-c genzhouarchive -u archive_bot -o import.sql path/to/dump.json"
```(This will pull down dependencies from Maven Central so you must be connected to the internet during the compile step.)
You could also package a release and then follow the instructions from the previous section:
```
mvn clean package
java -jar target/redditLemmyImporter-0.3-SNAPSHOT.jar -c genzhouarchive -u archive_bot -o import.sql dump.json
```## Running the SQL script
Copy `import.sql` to the server running Postgres and run this:
```
psql --dbname=lemmy --username=lemmy --file=import.sql
```Note that this uses the default values for the database name and database username. If you've changed them in your [Lemmy configuration](https://join-lemmy.org/docs/en/administration/configuration.html#full-config-with-default-values) then update the values accordingly.
**The target comm and target user must already exist in your Lemmy instance or the SQL script will do nothing.**
## Running the SQL script with Dockerized Lemmy
Copy `import.sql` to the server running Docker and run this:
```