Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/huitema/socspider

Social spider for Mastodon and the fediverse, implemented in Python
https://github.com/huitema/socspider

mastodon mastodon-api mastodon-discovery

Last synced: 3 months ago
JSON representation

Social spider for Mastodon and the fediverse, implemented in Python

Host: GitHub
URL: https://github.com/huitema/socspider
Owner: huitema
License: mit
Created: 2022-12-26T22:25:53.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-01-02T02:34:00.000Z (about 2 years ago)
Last Synced: 2024-04-13T21:03:20.045Z (10 months ago)
Topics: mastodon, mastodon-api, mastodon-discovery
Language: Python
Homepage:
Size: 34.2 KB
Stars: 6
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # socspider

This toy python program is demonstrating how to build a "social spider" using the public API of Mastodon.

The program will start by reading toots in the public timeline of a "start" instance, by default

`mastodon.social`. It analyzes the toots to find names of instances and handles of users in the Fediverse,

and an approximation of the social graph, by recording for each discovered user the handles of the

users by which it is seen.

The program is not fast. It takes about 1 minute on a laptop to learn the profiles of 100 users. It could

take 2 months to process 10 million Mastodon users. But then, the program is not optimized at

all. A lot of time is spent waiting for responses of remote servers. This could be reduced

by running several queries in parallel, in multiple threads. Running on a big 256 core server,

the 10 million accounts mentioned above could be parsed in about 6 hours. Running on a cluster

of machines would be faster still.

The point here is not speed. The point is to demonstrate the power of public API like

"reading the public timeline", "reading the data of a toot", "reading a thread starting

with a toot", "reading who favorited a toot"" or "reading the public messages sent by an

account". In the Mastodon implementation, these APIs are public. (The same APIs appear to be

access controlled in servers running Pleroma.)

The power of the API could be used for good or for bad. For example, the spider could be augmented to

also collect hash-tags read by users, or assign weights to the relations between users.

On the good side, this would enable building catalog of servers or directories of users,

or to add a search function to the Fediverse. On the bad side, this is exactly the kind

of data required for "serving better ads", or to find targets of harassment.

## Using the spider

To use the spider, you need to clone this depot, then run:

```

python3 socspider.py  [start-instance-url]

```

The spidering will start at the designated instance, and will troll the fediverse

until it has learned at least 100 new user handles. The data will be saved in

JSON format in the designated file. 

You can run the program several time. If the data file already exists when the program

is launched, it will be loaded in memory, and the results of the spidering added to

the existing data.

## Participating

If you want to improve this code or otherwise comment on it, feel free to open

an issue of propose a PR here. Or, contact "[email protected]" on Mastodon.