{"id":20715634,"url":"https://github.com/someshsingh22/flaireddit-midas","last_synced_at":"2025-07-08T10:05:36.146Z","repository":{"id":53638295,"uuid":"253452340","full_name":"someshsingh22/FlaiReddit-MIDAS","owner":"someshsingh22","description":"A Webapp deployed on Heroku which detects the 'flair' tags of a Reddit Post from the subreddit r/india","archived":false,"fork":false,"pushed_at":"2021-03-20T03:43:39.000Z","size":41423,"stargazers_count":4,"open_issues_count":3,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-29T23:22:47.914Z","etag":null,"topics":["flask","heroku","reddit","text-classification","web-application","web-scraping"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/someshsingh22.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-06T09:31:41.000Z","updated_at":"2021-04-02T20:41:13.000Z","dependencies_parsed_at":"2022-09-04T01:22:10.146Z","dependency_job_id":null,"html_url":"https://github.com/someshsingh22/FlaiReddit-MIDAS","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/someshsingh22%2FFlaiReddit-MIDAS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/someshsingh22%2FFlaiReddit-MIDAS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/someshsingh22%2FFlaiReddit-MIDAS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/someshsingh22%2FFlaiReddit-MIDAS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/someshsingh22","download_url":"https://codeload.github.com/someshsingh22/FlaiReddit-MIDAS/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250419407,"owners_count":21427596,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flask","heroku","reddit","text-classification","web-application","web-scraping"],"created_at":"2024-11-17T02:39:17.801Z","updated_at":"2025-04-23T10:44:34.138Z","avatar_url":"https://github.com/someshsingh22.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# *FlaiReddit*\nFlaiReddit is an end to end web-app deployed on Heroku that classifies the flair tags from posts in r/india. The project is strurctured in 5 steps.\n\n## RedditCrawler - Web Scraper\nThe data extractor extracts posts from a wide time period to eliminate the Bias towards some hot topics.\n* You can save and load your progress at checkpoints too (especially useful for online collection and storage), \n* Approximately  600 posts can be extracted per second, however as a result of the moderation of the subreddit only 20% of the data is actually available. \n* All logs are made in crawler.log, warnings are displayed.\n* To optimize space removed, empty flairs are removed batch wise.\n\n```python\nfrom modules.crawler import *\nstart_time = #Enter the unix timestamp of date since when scraping should begin\nend_time= #Enter the unix timestamp of date since when scraping should end\nscraper = Crawler(size=1000, difference=12, sleep=0.5, start=start_time)\n\nwhile(scraper.current \u003e end time):\n\tred.query() #Query the database\nred.dump() #Dump the stats and csv\n```\n\nA commited notebook is available at [kaggle](https://www.kaggle.com/someshsingh22/redditcrawlertest)\n\n## Exploratory Data Analysis\nExtensive analysis has been done, important words are visualized through WordClouds, in depth explanation of these and preprocessing is present in my [Notebook](https://github.com/someshsingh22/FlaiReddit-MIDAS/blob/master/Notebooks/Part-2-EDA.ipynb)\n\n\u003eA baseline model from BOW is also implemented at the end.\n## Training the Model [BERT, TFIDF]\nWe set the seed for reproducibility and use BERT - *uncased, base*, freezing all layes apart from the last layer and the weights are saved for easier inference at : \n\n**Model Summary [Inference Time]**:\n| Model | Micro-F1  |Macro-F1  | CPU Inference Time\n|--|--|--|--|\n| TFIDF Combined | 0.51 | 0.50  | **331 Samples/s**\n| BERT | **0.60** | **0.59**  |\t2.37 Samples\n| TFIDF , Feats | 0.49 | 0.48  | 273 Sample/s\n\n![Confusion Matrix](Images/CM.png)\n\n## WebApp - Flask TFIDF\n* For the web app we have used the TFIDF model keeping the CPU Rate and Memory Usage in mind [BERT BASE has 114 M parameters].\n* The app is created on flask, the root view is a simple webpage where you can enter the weblink and the predicted flair is displayed.\n* The other end point is \\auto, to which a post request is sent and the prediction json is sent back.\n* Logs and Error pages will be enabled in a future update.\n* The colour theme used is taken from reddit's own theme :)\n\n **Root page :**\n```bash\ncd app\npython main.py\n* Running on http://127.0.0.1:5000/\n```\n![Root page](Images/flaireddit_webapp.gif)\n\n**Auto Endpoint**\n```python\n\u003e\u003e\u003e import requests\n\u003e\u003e\u003e with open('file.txt','wb') as f:\n\t\tf.write(b\"r/india post urls\")\n\u003e\u003e\u003e base_url = \"https://flaireddittest.herokuapp.com\" #http://127.0.0.1:5000/ if local\n\u003e\u003e\u003e url = f\"{base_url}/auto\"\n\u003e\u003e\u003e files = {'upload_file': open('file.txt','rb')}\n\u003e\u003e\u003e r = requests.post(url, files=files)\n\u003e\u003e\u003e r\n\u003cResponse [200]\u003e\n\u003e\u003e\u003e r.json()\n{\"post_url\" : 'predicted tag'}\n```\n## HEROKU DEPLOYMENT\nFinally the web application is deployed on Heroku and is available at [FlaiRedditTest](https://flaireddittest.herokuapp.com/), all Automation is available at [FlaiReddiTest/auto](https://flaireddittest.herokuapp.com/auto)\n\u003e Here is a snapshot of a correct classification on android\n\n Similar to the local webapp you can access the Automated endpoint and root get view by just replacing http://127.0.0.1:5000/ with https://flaireddittest.herokuapp.com/\n\n **Final Android View**\n\n![Android Final View](Images/DroidView.jpeg)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomeshsingh22%2Fflaireddit-midas","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsomeshsingh22%2Fflaireddit-midas","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomeshsingh22%2Fflaireddit-midas/lists"}