Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/almayor/reddit-mods-dataset
A dataset of 25'834 largest communities on Reddit and their (anonymised) moderators.
https://github.com/almayor/reddit-mods-dataset
dataset graph graph-algorithms reddit social-network-analysis
Last synced: 20 days ago
JSON representation
A dataset of 25'834 largest communities on Reddit and their (anonymised) moderators.
- Host: GitHub
- URL: https://github.com/almayor/reddit-mods-dataset
- Owner: almayor
- License: other
- Created: 2024-02-05T22:09:00.000Z (11 months ago)
- Default Branch: master
- Last Pushed: 2024-02-09T20:23:49.000Z (11 months ago)
- Last Synced: 2024-02-09T23:17:28.021Z (11 months ago)
- Topics: dataset, graph, graph-algorithms, reddit, social-network-analysis
- Language: Jupyter Notebook
- Homepage:
- Size: 27.5 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# RedditMods: Moderators of top-25'000 subreddits
![Kaggle](https://img.shields.io/badge/Kaggle-035a7d?style=for-the-badge&logo=kaggle&logoColor=white) ![Reddit](https://img.shields.io/badge/Reddit-%23FF4500.svg?style=for-the-badge&logo=Reddit&logoColor=white)_RedditMods_ is a dataset that lists moderators of 25'834 largest and most popular communities on Reddit. The dataset is ideal for studying Reddit as a bipartite graph, where a moderator-node and a community-node are connected if the corresponding user moderates this subreddit. Clustering can then be performed to identify groups of subredits with a particular leaning, or to recommend similar communities.
## Data Collection
The data was scraped in the associated [Jupyter Notebook](code/reddit-mods-ds.ipynb). The data was publicly available and collected on 06 Feb 2024. All usernames were anonymised by hashing with SHA256, so that they cannot be linked to the moderators' Reddit accounts.
## Description of Files
The data is available both as a table and a bipartite graph.
#### GEXF – data in graph format
1. `graph.gexf`
A bipartite graph, where nodes in the first group (having attribute `bipartite=0`) are moderators and nodes in the second group (having attribute `bipartite=1`) are subreddits. A moderator-node is connected with a subreddit-node if that moderator moderates this subreddit.
Tags:
* `size` on subreddit-nodes, indicating the number of subreddit's members
#### CSV – data in table format1. `subreddits.csv`
Contains 25K subreddits from [Reddit's Top](www.reddit.com/best/communities/1/), combined with the [list](http://www.reddit.com/subreddits/) of Reddit's most popular communities. The two lists are not identical, as described in the [Jupyter notebook](code/reddit-mods-ds.ipynb). The headers are:
* `name`: Name of subreddit
* `n_members`: Number of members
2. `moderators.csv`Each row describes a subreddit-moderator pair:
* `subreddit`: Name of subreddit
* `moderator`: Username of moderator (anonymised by hashing)
3. `bots.csv`
List of moderators that were identified as bots by the primitive procedure, described in the previous section. These accounts were already removed from `moderators.csv`.
* `name`: Username of bot
## Examples
* [Visualising a cluster of subreddits moderated by a group of users](./example/example.ipynb)
## Notes and warnings
I used a very simple procedure to filter out auto-moderators: (1) a short list of known bots (e.g. `u/AutoModerator`), (2) username starts or ends with `bot`. An additional procedure to identify and remove bots might be necessary. For an example, see [this notebook](example/example.ipynb).