{"id":18807112,"url":"https://github.com/mshenfield/subreddit_algebra","last_synced_at":"2025-04-13T19:23:38.112Z","repository":{"id":19851367,"uuid":"87166550","full_name":"mshenfield/subreddit_algebra","owner":"mshenfield","description":"A python port and frontend to 538's subreddit analysis","archived":false,"fork":false,"pushed_at":"2023-01-24T18:47:33.000Z","size":1725,"stargazers_count":5,"open_issues_count":25,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-27T10:04:50.619Z","etag":null,"topics":["data-science","data-visualization","reddit"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mshenfield.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-04-04T09:02:09.000Z","updated_at":"2021-01-05T04:56:49.000Z","dependencies_parsed_at":"2023-02-14T00:25:14.109Z","dependency_job_id":null,"html_url":"https://github.com/mshenfield/subreddit_algebra","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mshenfield%2Fsubreddit_algebra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mshenfield%2Fsubreddit_algebra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mshenfield%2Fsubreddit_algebra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mshenfield%2Fsubreddit_algebra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mshenfield","download_url":"https://codeload.github.com/mshenfield/subreddit_algebra/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248766687,"owners_count":21158302,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","data-visualization","reddit"],"created_at":"2024-11-07T22:50:52.015Z","updated_at":"2025-04-13T19:23:38.092Z","avatar_url":"https://github.com/mshenfield.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Subreddit Algebra\nA frontend to [538's analysis](https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following) of subreddit similarity.\n\n## Methodology\n538 has some really interesting commentary at the end of [their article](https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/) on their methodology.\n\nFor convenience and personal familiarity, this ports [the R script](https://github.com/fivethirtyeight/data/blob/master/subreddit-algebra/processData.sql) used by 538 to Python. This tweaks the methodology so as to be able to more efficiently query for nearest neighbors using an index. Cosine Similarity is not a metric space. This exploits the (hopefully accurate) fact that for unit vectors, Euclidean distance is correlated with the value of Cosine Similarity.\n\nWith this in mind, this normalizes all feature vectors to unit length, and builds a [Ball Tree](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree) index for efficient K-Nearest-Neighbors querying.\n\n## Installation\nThis requires running two development servers, one for the `Flask` based API which integrates with our pickled sklearn models, and another for the `create-react-app` based frontend.\n\n### Frontend\nMake sure you have [`nodejs`](https://github.com/creationix/nvm/) installed.\n\n```bash\ncd frontend\nnpm install # or yarn install\ncd frontend\nnpm start # or yarn start\n# you should be automatically sent to localhost:5000 in the browser.\n```\n\n\n### Backend\nBuilding the models is still a manual process of executing SQL code, downloading the results, and using a python script to massage, index, and pickle the results.\n\nMake sure you have [`pipenv`](http://docs.pipenv.org/en/latest/) installed and run `pipenv install`\n\n**Query**\n\nFollow the instructions in the [bigquery README](./bigquery/README.md) to set execute and download the required file.\n\n**Index**\n\nWith your query results on disk\n\n```bash\nmkdir output\npipenv run python subreddit_algebra_app/algebra/build_index.py \u003cpath_to_table_csv\u003e\n```\n\nThis will automatically run the algorithm and processing steps, and save all required data into the `output`. folder at the root of the project.\n\nNow you can see it in action!\n```bash\nFLASK_APP=subreddit_algebra_app/server.py flask run\ncurl http://localhost:5000/algebra/highqualitygifs/-/reactiongifs\n```\n\n## API\n`/algebra/\u003csubreddit_1\u003e/\u003coperator/\u003csubreddit_2\u003e` - return the closest five subreddits to result of adding or subtracting `subreddit_1` and `subreddit_2`\n\n`/completions/\u003cprefix\u003e` - return first 10 subreddit names that start with `prefix`\n\n## Load Testing\nWe use [`locust`](http://locust.io/). To test, go to the root of the project, and run\n\n```bash\nlocust --host=example.com # replace example.com with the URL of the instance you want to test\n```\n\n## Deployment\nThis project is configured to deploy to AWS Elastic Beanstalk using the [eb command line tool](http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/eb-cli3.html).\n\nUpload your pickled models to a bucket in S3, e.g.\n\n```bash\naws s3 sync output/ s3://path/to/your/bucket/ # replace with your bucket\n```\n\nYou'll have to customize just a little - change the `S3_DATA_BUCKET` variable in [.ebextensions/00_main.config](.ebextensions/00_main.config) to the S3 bucket associated with your ELB setup.\n\n\nYou'll also want to set the REACT_APP_GA_TRACKING_CODE environment variable in your ELB production environment.\n\n```bash\neb setenv REACT_APP_GA_TRACKING_CODE=XXXXXXXXXX # Replace with your GA tracking code\n```\n\nYou can then just use the normal commands (`eb create`, `eb deploy`).\n\n## License\n[MIT](LICENSE.md)\n\n## Contributing\nContributions  ✍  are welcome\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmshenfield%2Fsubreddit_algebra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmshenfield%2Fsubreddit_algebra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmshenfield%2Fsubreddit_algebra/lists"}