https://github.com/cldellow/bayesky

Bluesky firehose classifier.
https://github.com/cldellow/bayesky

Last synced: about 1 month ago
JSON representation

Bluesky firehose classifier.

Host: GitHub
URL: https://github.com/cldellow/bayesky
Owner: cldellow
License: mit
Created: 2024-11-17T19:00:44.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-11-23T04:17:13.000Z (over 1 year ago)
Last Synced: 2025-03-20T17:55:24.629Z (over 1 year ago)
Language: Go
Homepage:
Size: 34.2 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# bayesky

Bluesky is designed to be hackable.

It comes stock with a reverse chronological feed and a proprietary "For you"
feed. But developers can [author their own feeds](https://docs.bsky.app/docs/starter-templates/custom-feeds)
that users can subscribe to.

My goal: write a Humans Being Bros feed. The inclusion criteria for this feed
is threads where:

1. OP asks a question
2. others respond
3. OP expresses gratitude

It could be ranked by size of thread, # of likes on question or non-OP responses,
diversity of repliers, etc.

# How to discover content?

In addition to the official Relays, Bluesky publishes a lighter-weight feed that is
consumable via websocket. This is called the [Jetsream](https://github.com/bluesky-social/jetstream?tab=readme-ov-file).

You can watch the stream of new posts via the `app.bsky.feed.post` collection:

```
$ websocat wss://jetstream2.us-east.bsky.network/subscribe\?wantedCollections=app.bsky.feed.post
```

You'll see data like this sample post:

```json
{
"did": "did:plc:w5l6zvlmyz3r2cl36bfqlq7a",
"time_us": 1731868440607689,
"type": "com",
"kind": "commit",
"commit": {
"rev": "3lb627l4oc62h",
"type": "c",
"operation": "create",
"collection": "app.bsky.feed.post",
"rkey": "3lb627kz72s2r",
"record": {
"$type": "app.bsky.feed.post",
"createdAt": "2024-11-17T18:33:58.271Z",
"langs": [
"en"
],
"text": "Test post: testing JetStream."
},
"cid": "bafyreigwgz44ovvc4lu2nyklh3meclhjxnipewxaoswcdm4mj3vqled4ee"
}
}
```

You might also want to track likes, e.g. for ranking. You can subscribe to `app.bsky.feed.like`,
to see things like:

```json
{
"did": "did:plc:ko26dqkkmj3da6yc3fmo3ate",
"time_us": 1731870626607952,
"type": "com",
"kind": "commit",
"commit": {
"rev": "3lb64aov7ii23",
"type": "c",
"operation": "create",
"collection": "app.bsky.feed.like",
"rkey": "3lb64aov4kq23",
"record": {
"$type": "app.bsky.feed.like",
"createdAt": "2024-11-17T19:10:23.248Z",
"subject": {
"cid": "bafyreibn5x7unywvqytekgfg43kwruq4zyqzjnfa4kn7dp2rc7tq2mgvoy",
"uri": "at://did:plc:65otgq6ubushgm3vk5icuxzw/app.bsky.feed.post/3lb3bqy7ibe2v"
}
},
"cid": "bafyreie2avqg2zhg4dxuibxunlnjxjr4fpzdhvvpbzxquvfteax2qvjrne"
}
}
```

# Overall approach

The firehose operates on events like "new post" or "liked post", but we want to surface
something higher-level like "threads with this kind of interaction".

The first building block will be a classifier for posts, so we can classify posts like
this:

1. Post is a top-level post that asks a question
2. Post is a non-top-level post that replies to a post in class 1, and is
by a different author. (Wrinkle: what if the question is itself a multi-post
thread?)
3. Post is a non-top-level post that replies to a post in class 2, and is
by the same author as the thread starter, and expresses gratitude.

I _think_ a naive Bayes classifier might be enough here, especially if we can help
it along by providing some clever feature extraction, e.g. emitting `AUTHOR_IS_THREAD_AUTHOR`
or `AUTHOR_IS_NOT_THREAD_AUTHOR` features.

I know LLMs are the new hotness, but they're expensive to run. A well-tuned naive Bayes
classifier should be able to handle the firehose on a single core without breaking
a sweat.

# Training the classifier

A challenge with naive Bayes is training it. The classic approach is to label a bunch
of samples as positive or negative, then train a model.

Labelling is tedious and sucks.

Maybe there's room here for an LLM to be used: you could express your desired classes
in plain language, and apply an LLM to generate best-effort labels. A human quickly
reviews them and accepts/rejects the labels, and that becomes your training set.

Perhaps Llamafile with a reasonably-sized model could be used here?

# Ops questions

The Bluesky firehose is not _that_ big at present. ~200 posts/second, ~800 likes/second.

This is just a side project, so being a little lossy is fine if it simplifies perf problems.

My overall hope is to do something like this:

- apply a Bayes classifier to the stream of posts. Hopefully we discard 99.9%+ of posts.
- track the IDs of non-discarded posts
- only track likes for non-discarded posts; buffer them in-memory and checkpoint to a SQLite
DB on some cadence so that we can interrupt/resume Jetstream processing via `cursor`
- retain persisted data for at most 7 days

The Bayes classification can be farmed out amongst threads, but the overall processing needs
to be sequential -- e.g. we have to know we've processed post X before processing any likes for it,
or before processing replies to post X.

# Golang notes

It's been years since I wrote go code. I'm relying on ChatGPT a lot. Useful commands:

```bash
$ go test ./... # run all tests, recursively

$ gofmt -w . # format all files, recursively
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cldellow/bayesky

Awesome Lists containing this project

README