{"id":19530555,"url":"https://github.com/bhavyac16/flairifyme","last_synced_at":"2026-05-06T15:37:43.436Z","repository":{"id":56693264,"uuid":"197047942","full_name":"BhavyaC16/FlairifyMe","owner":"BhavyaC16","description":"FlairifyMe is a Reddit Flair Detector for r/india subreddit, that takes a post's URL as user input and predicts the flair for the post using a model generated by Logistic Regression.","archived":false,"fork":false,"pushed_at":"2023-04-21T20:09:46.000Z","size":59090,"stargazers_count":0,"open_issues_count":1,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-08T16:32:05.116Z","etag":null,"topics":["flair-prediction","flask","hacktoberfest","linear-svm","logistic-regression","naive-bayes-classifier","nltk","praw-reddit","reddit-flair-detector","scikit-learn","scraped-data","subreddit","text-classification"],"latest_commit_sha":null,"homepage":"https://flairify-me.herokuapp.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BhavyaC16.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-07-15T18:01:41.000Z","updated_at":"2021-10-15T07:43:38.000Z","dependencies_parsed_at":"2022-08-15T23:30:46.233Z","dependency_job_id":null,"html_url":"https://github.com/BhavyaC16/FlairifyMe","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BhavyaC16%2FFlairifyMe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BhavyaC16%2FFlairifyMe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BhavyaC16%2FFlairifyMe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BhavyaC16%2FFlairifyMe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BhavyaC16","download_url":"https://codeload.github.com/BhavyaC16/FlairifyMe/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240783108,"owners_count":19856776,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flair-prediction","flask","hacktoberfest","linear-svm","logistic-regression","naive-bayes-classifier","nltk","praw-reddit","reddit-flair-detector","scikit-learn","scraped-data","subreddit","text-classification"],"created_at":"2024-11-11T01:33:46.159Z","updated_at":"2026-05-06T15:37:38.394Z","avatar_url":"https://github.com/BhavyaC16.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FlairifyMe\nFlairifyMe is a Reddit Flair Detector for [r/india](https://www.reddit.com/r/india/) subreddit, that takes a post's URL as user input and predicts the flair for the post using a model generated by Logistic Regression. The web-application is hosted on Heroku at [FlairifyMe(https://flairify-me.herokuapp.com/)](https://flairify-me.herokuapp.com/).\n\nThe web-application also offers visual content and temporal analysis of the collected data.\n\n## Directory Structure\nThe project has been developed using Python and several of its libraries and frameworks:\n- Scikit-learn\n- PRAW\n- NLTK\n- Flask\n- numpy\n- pandas\n- PyMongo\n\nThe scraped data is saved and loaded as a MongoDB instance.The web-application is based on Flask, and deployed using Heroku.\n\nFollowing is the description of the files and folders in the repository:\n\n- [Data](https://github.com/BhavyaC16/FlairifyMe/tree/master/Data): Contains CSV files with preprocessed scraped data, the MongoDB Collections and scripts for scraping, and preprocessing and analysing data.\n- [Models](https://github.com/BhavyaC16/FlairifyMe/tree/master/Models): Contains the machine learning model used for predicting flairs.\n- [Training](https://github.com/BhavyaC16/FlairifyMe/tree/master/Training): Contains the script for text-classification.\n- [templates](https://github.com/BhavyaC16/FlairifyMe/tree/master/templates): Contains HTML scripts for the web-application\n- [app.py](https://github.com/BhavyaC16/FlairifyMe/blob/master/app.py): Used to start up the Flask server.\n- [flair_predictor.py](https://github.com/BhavyaC16/FlairifyMe/blob/master/flair_predictor.py): Module to accept a valid URL and predict the post's flair by loading the model.\n- [nltk.txt](https://github.com/BhavyaC16/FlairifyMe/blob/master/nltk.txt): Contains NLTK library dependencies for deployment on Heroku.\n- [requirements.txt](https://github.com/BhavyaC16/FlairifyMe/blob/master/requirements.txt): Contains all dependencies for the project\n\n## Usage\nThe web-application allows the user to enter a r/india URL and displays the predicted flair for the submitted post. The user can view content and temporal analysis of the scraped data by clicking on the 'Post Analysis' button on the top right corner of the page.\n\nTo run on a local server:\n1. Clone the repository\n```\ngit clone https://github.com/BhavyaC16/FlairifyMe.git\n```\n2. Create a virtual environment\n```\npython3 -m venv FlairifyMe\nsource FlairifyMe/bin/activate\ncd FlairifyMe/\n```\n3. Finally, install the project dependencies\n```\npip3 install -r requirements.txt\n```\n4. Create the file `RedditAPI.py` as follows:\n```python\ndef accinfo():\n\tpersonalScript = '\u003center_Reddit_App_personal_script_here\u003e'\n\tsecretKey = '\u003center_Reddit_App_secret_key_here\u003e'\n\tapp = 'FlairifyMe'\n\tusername = '\u003center_your_Reddit_Username_here\u003e'\n\tpassword = '\u003center_your_Reddit_password\u003e'\n\treturn([personalScript,secretKey,app,username,password])\n\n```\nCopy the same file to the directory: `./Data/Scripts/` as well if you want to scrape posts from Reddit.\n\n5. To run the server, execute the following command\n```\npython3 app.py\n```\n\n## Approach \n### Data Scraping\nThe python library PRAW has been used to scrape data from the subreddit r/india, with a total of 3,156 posts for 13 different flairs. The number of posts scraped per flair are as follows:\n![alt text](https://github.com/BhavyaC16/FlairifyMe/blob/master/Data/Scripts/DataSplit.png)\n\n### Data preprocessing\nThe data has been preprocessed using the NLTK library. The following procedures have been executed on the title, body and comments to clean the data:\n1. Tokenizing and removing symbols\n2. Removing stopwords\n3. Stemming\n\nTwo separate databases have been prepared and saved as a MongoDB instance for training: one with stemming, and the other without stemming, as it is said to reduce prediction accuracy in certain cases by sources.\n\n### Training \nThe data has been loaded from MongoDB to a pandas DataFrame and split into 80-20 Training-Testing sets using scikit-learn.\nEach of the post features: Title, Body, Comments, Title+Comments and Title+Body+Comments were trained on three algorithms: Naive Bayes, Linear SVM and Logistic Regression, for both datasets(with and without stemming).\n\nFollowing are the results, summarized as a table:\n\nDATA WITHOUT STEMMING:\n\n| **Feature\\Algorithm**   | **Naive Bayes** | **Linear SVM** | **Logistic Regression** |\n|-------------------------|-----------------|----------------|-------------------------|\n| **Title**               | 0.59177         | 0.58386        | 0.54430                 |\n| **Body**                | 0.20569         | 0.24367        | 0.24051                 |\n| **Comments**            | 0.31171         | 0.59494        | 0.58069                 |\n| **Title+Comments**      | 0.37500         | 0.64082        | 0.63449                 |\n| **Title+Body+Comments** | 0.37816         | 0.64399        | **0.65189**             |\n\nDATA WITH STEMMING:\n\n| **Feature\\Algorithm**   | **Naive Bayes** | **Linear SVM** | **Logistic Regression** |\n|-------------------------|-----------------|----------------|-------------------------|\n| **Title**               | 0.57753         | 0.57120        | 0.54430                 |\n| **Body**                | 0.18354         | 0.23101        | 0.24051                 |\n| **Comments**            | 0.30063         | 0.55538        | 0.56013                 |\n| **Title+Comments**      | 0.36076         | 0.58703        | 0.60126                 |\n| **Title+Body+Comments** | 0.36551         | 0.59335        | 0.61392                 |\n\nAfter going through the flair-wise and overall prediction accuracies, the model trained using Title+Body+Comments on non-Stemmed data, using Logistic Regresssion was chosen. \n\n### Flair Prediction\nThe saved model is loaded for predicting the flair once the post features (title, body and comments) have been cleaned using NLTK. The returned result is displayed on the web-application.\n\n### API for querying FlairifyMe\nA developer API using flask has been implemented, which returns a JSON containing the predicted flair of the Reddit Post queried by the user.\n\nCan be accessed by querying: \n```\nflairify-me.herokuapp.com/api/resource?redditURL=\u003center_url_here\u003e\n```\n\nReturns JSON of the following format when successful:\n```\n{'status': 'successful', 'status_code': 200, 'result': {'flair': '\u003cpredicted_flair\u003e'}}\n```\nElse, returns JSON of the format: \n```\n{'status': 'failed', 'status_code': \u003cerror_code\u003e, 'result': {'error': '\u003cerror_message\u003e'}}\n```\n## Future Extension\nI plan on adding the following features to the project:\n1. Improving the prediction by training the model on user inputs.\n2. Automating the script to allow users to develop prediction model for any subreddit entered by them.\n\n## Learnings\nThis task has been a great learning experience for me as it was my first time working with Machine Learning and Natural Language Processing, and with most of the tools like Heroku and MongoDB, as well as several libraries like scikit-learn, nltk, praw and Flask.\n\n## References\n1. [Scraping Reddit](https://www.datasciencecentral.com/profiles/blogs/scraping-reddit)\n2. [Pre-processing Data](https://pythonhealthcare.org/2018/12/14/101-pre-processing-data-tokenization-stemming-and-removal-of-stop-words/)\n3. [Training Machine Learning Models with MongoDB](https://www.mongodb.com/blog/post/training-machine-learning-models-with-mongodb)\n4. [Text-Classification](https://medium.com/@ageitgey/text-classification-is-your-new-secret-weapon-7ca4fad15788)\n5. [Bag of Words in NLP](https://medium.com/greyatom/an-introduction-to-bag-of-words-in-nlp-ac967d43b428)\n6. [Choosing a Text-Classifier](https://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of-classifier-to-use-1.html)\n7. [Text-Classification using Scikit-learn](https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568)\n8. [Deploying Flask app to Heroku](https://github.com/datademofun/heroku-basic-flask)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbhavyac16%2Fflairifyme","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbhavyac16%2Fflairifyme","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbhavyac16%2Fflairifyme/lists"}