https://github.com/almayor/reddit-mods-dataset

A dataset of 25'834 largest communities on Reddit and their (anonymised) moderators.
https://github.com/almayor/reddit-mods-dataset

dataset graph graph-algorithms reddit social-network-analysis

Last synced: 6 months ago
JSON representation

A dataset of 25'834 largest communities on Reddit and their (anonymised) moderators.

Host: GitHub
URL: https://github.com/almayor/reddit-mods-dataset
Owner: almayor
License: other
Created: 2024-02-05T22:09:00.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-10-11T11:55:48.000Z (9 months ago)
Last Synced: 2024-12-02T20:48:05.226Z (8 months ago)
Topics: dataset, graph, graph-algorithms, reddit, social-network-analysis
Language: Jupyter Notebook
Homepage:
Size: 36.7 MB
Stars: 3
Watchers: 1
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

# RedditMods: Moderators of top-25'000 subreddits
![Kaggle](https://img.shields.io/badge/Kaggle-035a7d?style=for-the-badge&logo=kaggle&logoColor=white) ![Reddit](https://img.shields.io/badge/Reddit-%23FF4500.svg?style=for-the-badge&logo=Reddit&logoColor=white)

_RedditMods_ is a dataset that lists moderators of 25'834 largest and most popular communities on Reddit. The dataset is ideal for studying Reddit as a bipartite graph, where a moderator-node and a community-node are connected if the corresponding user moderates this subreddit. Clustering can then be performed to identify groups of subredits with a particular leaning, or to recommend similar communities.

## Data Collection

The data was scraped in the associated [Jupyter Notebook](code/reddit-mods-ds.ipynb). The data was publicly available and collected on 06 Feb 2024. All usernames were anonymised by hashing with SHA256, so that they cannot be linked to the moderators' Reddit accounts.

## Description of Files

The data is available both as a table and a bipartite graph.

#### GEXF – data in graph format

1. `graph.gexf`

A bipartite graph, where nodes in the first group (having attribute `bipartite=0`) are moderators and nodes in the second group (having attribute `bipartite=1`) are subreddits. A moderator-node is connected with a subreddit-node if that moderator moderates this subreddit.

Tags:
* `size` on subreddit-nodes, indicating the number of subreddit's members

#### CSV – data in table format

1. `subreddits.csv`

Contains 25K subreddits from [Reddit's Top](www.reddit.com/best/communities/1/), combined with the [list](http://www.reddit.com/subreddits/) of Reddit's most popular communities. The two lists are not identical, as described in the [Jupyter notebook](code/reddit-mods-ds.ipynb). The headers are:

* `name`: Name of subreddit
* `n_members`: Number of members

2. `moderators.csv`

Each row describes a subreddit-moderator pair:

* `subreddit`: Name of subreddit
* `moderator`: Username of moderator (anonymised by hashing)

3. `bots.csv`
List of moderators that were identified as bots by the primitive procedure, described in the previous section. These accounts were already removed from `moderators.csv`.

* `name`: Username of bot

## Examples

* [Visualising a cluster of subreddits moderated by a group of users](./example/example.ipynb)

## Notes and warnings

I used a very simple procedure to filter out auto-moderators: (1) a short list of known bots (e.g. `u/AutoModerator`), (2) username starts or ends with `bot`. An additional procedure to identify and remove bots might be necessary. For an example, see [this notebook](example/example.ipynb).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/almayor/reddit-mods-dataset

Awesome Lists containing this project

README