{"id":21714486,"url":"https://github.com/cldellow/bayesky","last_synced_at":"2026-05-20T19:02:52.757Z","repository":{"id":263316372,"uuid":"889998390","full_name":"cldellow/bayesky","owner":"cldellow","description":"Bluesky firehose classifier.","archived":false,"fork":false,"pushed_at":"2024-11-23T04:17:13.000Z","size":35,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-20T17:55:24.629Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cldellow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-17T19:00:44.000Z","updated_at":"2024-11-23T04:17:17.000Z","dependencies_parsed_at":"2024-11-17T20:37:31.046Z","dependency_job_id":"959a9d1c-9c58-45ff-b291-cba1f4b6a2dc","html_url":"https://github.com/cldellow/bayesky","commit_stats":null,"previous_names":["cldellow/bayesky"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fbayesky","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fbayesky/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fbayesky/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fbayesky/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cldellow","download_url":"https://codeload.github.com/cldellow/bayesky/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244676451,"owners_count":20491828,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-26T00:35:36.406Z","updated_at":"2026-05-20T19:02:47.703Z","avatar_url":"https://github.com/cldellow.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# bayesky\n\nBluesky is designed to be hackable.\n\nIt comes stock with a reverse chronological feed and a proprietary \"For you\"\nfeed. But developers can [author their own feeds](https://docs.bsky.app/docs/starter-templates/custom-feeds)\nthat users can subscribe to.\n\nMy goal: write a Humans Being Bros feed. The inclusion criteria for this feed\nis threads where:\n\n1. OP asks a question\n2. others respond\n3. OP expresses gratitude\n\nIt could be ranked by size of thread, # of likes on question or non-OP responses,\ndiversity of repliers, etc.\n\n# How to discover content?\n\nIn addition to the official Relays, Bluesky publishes a lighter-weight feed that is\nconsumable via websocket. This is called the [Jetsream](https://github.com/bluesky-social/jetstream?tab=readme-ov-file).\n\nYou can watch the stream of new posts via the `app.bsky.feed.post` collection:\n\n```\n$ websocat wss://jetstream2.us-east.bsky.network/subscribe\\?wantedCollections=app.bsky.feed.post\n```\n\nYou'll see data like this sample post:\n\n```json\n{\n  \"did\": \"did:plc:w5l6zvlmyz3r2cl36bfqlq7a\",\n  \"time_us\": 1731868440607689,\n  \"type\": \"com\",\n  \"kind\": \"commit\",\n  \"commit\": {\n    \"rev\": \"3lb627l4oc62h\",\n    \"type\": \"c\",\n    \"operation\": \"create\",\n    \"collection\": \"app.bsky.feed.post\",\n    \"rkey\": \"3lb627kz72s2r\",\n    \"record\": {\n      \"$type\": \"app.bsky.feed.post\",\n      \"createdAt\": \"2024-11-17T18:33:58.271Z\",\n      \"langs\": [\n        \"en\"\n      ],\n      \"text\": \"Test post: testing JetStream.\"\n    },\n    \"cid\": \"bafyreigwgz44ovvc4lu2nyklh3meclhjxnipewxaoswcdm4mj3vqled4ee\"\n  }\n}\n```\n\nYou might also want to track likes, e.g. for ranking. You can subscribe to `app.bsky.feed.like`,\nto see things like:\n\n```json\n{\n  \"did\": \"did:plc:ko26dqkkmj3da6yc3fmo3ate\",\n  \"time_us\": 1731870626607952,\n  \"type\": \"com\",\n  \"kind\": \"commit\",\n  \"commit\": {\n    \"rev\": \"3lb64aov7ii23\",\n    \"type\": \"c\",\n    \"operation\": \"create\",\n    \"collection\": \"app.bsky.feed.like\",\n    \"rkey\": \"3lb64aov4kq23\",\n    \"record\": {\n      \"$type\": \"app.bsky.feed.like\",\n      \"createdAt\": \"2024-11-17T19:10:23.248Z\",\n      \"subject\": {\n        \"cid\": \"bafyreibn5x7unywvqytekgfg43kwruq4zyqzjnfa4kn7dp2rc7tq2mgvoy\",\n        \"uri\": \"at://did:plc:65otgq6ubushgm3vk5icuxzw/app.bsky.feed.post/3lb3bqy7ibe2v\"\n      }\n    },\n    \"cid\": \"bafyreie2avqg2zhg4dxuibxunlnjxjr4fpzdhvvpbzxquvfteax2qvjrne\"\n  }\n}\n```\n\n# Overall approach\n\nThe firehose operates on events like \"new post\" or \"liked post\", but we want to surface\nsomething higher-level like \"threads with this kind of interaction\".\n\nThe first building block will be a classifier for posts, so we can classify posts like\nthis:\n\n1. Post is a top-level post that asks a question\n2. Post is a non-top-level post that replies to a post in class 1, and is\n   by a different author. (Wrinkle: what if the question is itself a multi-post\n\t thread?)\n3. Post is a non-top-level post that replies to a post in class 2, and is\n   by the same author as the thread starter, and expresses gratitude.\n\nI _think_ a naive Bayes classifier might be enough here, especially if we can help\nit along by providing some clever feature extraction, e.g. emitting `AUTHOR_IS_THREAD_AUTHOR`\nor `AUTHOR_IS_NOT_THREAD_AUTHOR` features.\n\nI know LLMs are the new hotness, but they're expensive to run. A well-tuned naive Bayes\nclassifier should be able to handle the firehose on a single core without breaking\na sweat.\n\n# Training the classifier\n\nA challenge with naive Bayes is training it. The classic approach is to label a bunch\nof samples as positive or negative, then train a model.\n\nLabelling is tedious and sucks.\n\nMaybe there's room here for an LLM to be used: you could express your desired classes\nin plain language, and apply an LLM to generate best-effort labels. A human quickly\nreviews them and accepts/rejects the labels, and that becomes your training set.\n\nPerhaps Llamafile with a reasonably-sized model could be used here?\n\n# Ops questions\n\nThe Bluesky firehose is not _that_ big at present. ~200 posts/second, ~800 likes/second.\n\nThis is just a side project, so being a little lossy is fine if it simplifies perf problems.\n\nMy overall hope is to do something like this:\n\n- apply a Bayes classifier to the stream of posts. Hopefully we discard 99.9%+ of posts.\n- track the IDs of non-discarded posts\n- only track likes for non-discarded posts; buffer them in-memory and checkpoint to a SQLite\n  DB on some cadence so that we can interrupt/resume Jetstream processing via `cursor`\n- retain persisted data for at most 7 days\n\nThe Bayes classification can be farmed out amongst threads, but the overall processing needs\nto be sequential -- e.g. we have to know we've processed post X before processing any likes for it,\nor before processing replies to post X.\n\n# Golang notes\n\nIt's been years since I wrote go code. I'm relying on ChatGPT a lot. Useful commands:\n\n```bash\n$ go test ./... # run all tests, recursively\n\n$ gofmt -w .    # format all files, recursively\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcldellow%2Fbayesky","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcldellow%2Fbayesky","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcldellow%2Fbayesky/lists"}