{"id":13523044,"url":"https://anvaka.github.io/redsim/","last_synced_at":"2025-04-01T00:30:51.765Z","repository":{"id":35497607,"uuid":"39767286","full_name":"anvaka/redsim","owner":"anvaka","description":"reddit discovery","archived":false,"fork":false,"pushed_at":"2024-03-09T05:00:57.000Z","size":17396,"stargazers_count":93,"open_issues_count":15,"forks_count":9,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-08-02T06:14:37.280Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://anvaka.github.io/redsim/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anvaka.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2015-07-27T09:53:46.000Z","updated_at":"2024-07-27T14:08:39.000Z","dependencies_parsed_at":"2024-03-09T06:31:24.557Z","dependency_job_id":null,"html_url":"https://github.com/anvaka/redsim","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anvaka%2Fredsim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anvaka%2Fredsim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anvaka%2Fredsim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anvaka%2Fredsim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anvaka","download_url":"https://codeload.github.com/anvaka/redsim/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222688173,"owners_count":17023297,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T06:00:54.957Z","updated_at":"2024-11-02T07:31:13.314Z","avatar_url":"https://github.com/anvaka.png","language":"JavaScript","funding_links":[],"categories":["Media"],"sub_categories":["Reddit"],"readme":"# reddit discovery\n\n**NOTE:** Checkout couple more similar projects: https://anvaka.github.io/sayit/ and https://anvaka.github.io/map-of-reddit/. Description below is slightly outdated and should be updated.\n\nYour comments on reddit are not only what makes reddit fun. They can also be\nused to x-ray the friendly alien and reveal its hidden structure.\n\n# Redditors who commented to this subreddit also commented to...\n\nThis simple idea is the core of the current recommendation website. Despite\nthe simplicity it yields amazing results.\n\nBefore you read any further, please go ahead and check for yourself:\n\n* [Subreddits related to /r/programming](https://anvaka.github.io/redsim/#!?q=programming)\n* [Subreddits related to /r/Games](https://anvaka.github.io/redsim/#!?q=Games)\n* [Subreddits related to /r/linux](https://anvaka.github.io/redsim/#!?q=linux)\n\nIt works really well for subreddits with under 1 million subscribers. But how\nexactly does it work?\n\n# How exactly does it work?\n\nRecently `/u/Stuck_In_the_Matrix` publicly released [reddit's ~1.7 billion comments dataset](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/).\nEach record contains information about author's name and target subreddit.\n\nIf you post to subreddit `A` and `C` very often - it doesn't necessary mean that\n`A` and `C` are related. But if there are thousands of people posting to both\n`A` and `C` we could suspect that maybe subreddits are related.\n\nOf course sometimes `A` is way more popular than `C`, and we need to take that\ninto account. Let's consider three subreddits:\n\n* `A` - has 1,000 subscribers\n* `B` - also has 1,000 subscribers; and\n* `C` - has only 100 subscribers\n\nImagine `A` and `B` share 100 reddittors who posted to both `A` and `B`.\nAlso imagine `A` and `C` share other 100 redditors who posted to both `A`\nand `C`.\n\nWhich subreddit is more related to `A`? Is it `B` or `C`?\n\nOnly 10% of `B` has posted to `A`. While 100% of `C` has posted to `A`.\nThis means `C` has very high \"relationship index\" with `A`.\n\nTurns out this \"relationship index\" has many names and forms. One of the\nsimplest forms is called [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) (similarity).\n\n### Jaccard similarity\n\nTo find how much subreddits `A` and `B` are similar with each other, all we need to do is:\n\n1. Find how many subscribers who posted to `A` has also posted to `B` (intersection of `A` and `B`).\n2. Find how many subscribers has posted to `A` or `B` (union of `A` and `B`).\n3. Divide `1` by `2` and we'll get Jaccard similarity.\n\nIn the example above. Jaccard similarity of `A` and `C` is: `J(A, C) = 100/1000 = 0.1`,\nwhile Jaccard similarity of `A` and `B` is: `J(A, B) = 100/(900 + 100 + 900) = 0.053`\n\nThis makes `C` two times more similar to `A` than `B`.\n\n### Drawbacks\n\nThis approach works extremely well for subreddits with less than 1,000,000 subscribers.\nFor more popular subreddits results are getting saturated by popularity of those\nsubreddits. If you have an idea how to fix this please [let me know :)](https://github.com/anvaka/redsim/issues/new).\n\n# Technical details\n\nNote: The details below outline my old procedure. I didn't use it to build\nthe latest snapshot, which is based on 150 million unique comments. Still keeping\nit here for reference.\n\nTo compute similarity between subreddits I downloaded only one month worth of\npublic comments. This gives more than 50,000,000 `user ⇄ subreddit` records.\nWhich translates to almost 50,000 unique subreddits.\n\nEach record is stored into redis database in these [50 lines of code](https://github.com/anvaka/reddata/blob/db6489e60b96bf3b1d1ef841786b5cd45708fe28/lib/redisClient.js#L81).\nAnd then I'm using [SINTERSTORE](http://redis.io/commands/sinterstore) and\n[SUNIONSTORE](http://redis.io/commands/sunionstore) to compute intersection\nand union of subreddits ([code](https://github.com/anvaka/reddata/blob/db6489e60b96bf3b1d1ef841786b5cd45708fe28/lib/redisClient.js#L81)). \n\nThis is the most straightforward brute-force approach to compute similarities.\nIt took almost 70 CPU hours of my old MacBookPro friend to compare all subreddits\nwith other.\n\n# What's next?\n\nI truly hope you enjoyed the simplicity of the formula and the power of results.\nIf you have any feedback please let me know!\n\n# license\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/anvaka.github.io%2Fredsim%2F","html_url":"https://awesome.ecosyste.ms/projects/anvaka.github.io%2Fredsim%2F","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/anvaka.github.io%2Fredsim%2F/lists"}