{"id":25753582,"url":"https://github.com/gabya06/twitter_models","last_synced_at":"2026-05-04T23:33:05.447Z","repository":{"id":77215119,"uuid":"53684105","full_name":"Gabya06/twitter_models","owner":"Gabya06","description":"Repository used for twitter impression models","archived":false,"fork":false,"pushed_at":"2018-09-13T20:18:56.000Z","size":15963,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-28T01:34:37.514Z","etag":null,"topics":["data","data-science","impressions","machinelearning","python","ridge-regression","sklearn","twitter"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Gabya06.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-03-11T17:23:30.000Z","updated_at":"2018-09-13T20:18:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"951e763b-9c81-40cf-840d-5afa3ea92d39","html_url":"https://github.com/Gabya06/twitter_models","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Gabya06/twitter_models","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gabya06%2Ftwitter_models","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gabya06%2Ftwitter_models/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gabya06%2Ftwitter_models/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gabya06%2Ftwitter_models/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Gabya06","download_url":"https://codeload.github.com/Gabya06/twitter_models/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gabya06%2Ftwitter_models/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32628829,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-04T10:08:07.713Z","status":"ssl_error","status_checked_at":"2026-05-04T10:08:02.005Z","response_time":58,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-science","impressions","machinelearning","python","ridge-regression","sklearn","twitter"],"created_at":"2025-02-26T15:18:47.955Z","updated_at":"2026-05-04T23:33:05.433Z","avatar_url":"https://github.com/Gabya06.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Twitter model \n\n## Project Overview\n\nThis repository is used to model twitter impressions using followers and retweets, based on linear regression models (Rigde Regression in particular)\n\nThe python code in the jupyter notebook is used to predict twitter impressions based on retweets and followers. The model is used to initialize, clean data, train, predict and preform cross validation. Three slightly different models are built:\n\n## Model 1:\nThe 1st model is built using all 27 properties and by remove top and bottom 10% of retweets and followers. Ridge regression is used with alpha set to 0.1, and R-squared is 0.764 using 90% training and 10% test data. Using an average error threshold of 60% filters out properties with larger average errors such as CNN, NFL, MTV, BET and InStyle. Performing 5-fold cross validation to find the best alpha for ridge regression yields about the same score with slightly lower average errors.\n\n```python\nprint \"*\" * 30\nprint \"MODEL 1 - Using all 27 properties for modeling impressions\"\nprint \".... CLEANING DATA .... REMOVING OUTLIERS ....\"\nlinModel1 = Model()\nlinModel1.clean(quantile = .1)\n\nprint \".... TRAINING MODEL 1 ....\"\nlinModel1.train(model = Ridge, perc_train = .9, alpha = .1)\n\nprint \".... GETTING RESULTS FOR MODEL 1....\"\nscore = linModel1.r_score\nprint \"ADJUSTED R2 = \", score\n\nresults = linModel1.get_results()\nprint \"Model Coefficients:\"\ncoefs = linModel1.get_coefs()\nprint coefs\n```\n\nADJUSTED R2 =  0.765818282025\n\n5-fold cross validation scores:\n\n- [fold 0] alpha: 0.000100000, score: 0.76615\n- [fold 1] alpha: 0.000152831, score: 0.76248\n- [fold 2] alpha: 0.000233572, score: 0.76689\n- [fold 3] alpha: 0.000356970, score: 0.75942\n- [fold 4] alpha: 0.000545559, score: 0.76597\n\nBest alpha's\n\n|alpha     |fold     |score |\n|----------|:-------:|:----:|\n|0.000234  |   2     |0.7668|\n|0.000100  |   0     |0.7661|\n|0.000546  |   4     |0.7659|\n|0.000153  |   1     |0.7624|\n|0.000357  |   3     |0.7594|\n\nProperties with the largest errors in model 1:\n\nCNN, NFL, MTV, 106andpark, BET, InStyle, Brueggers, Dasaniwater, Essencemag, ABC11 wtvd\n\n\n## Model 2:\nThe 2nd model removes properties that had the highest average error (over 60%) and also performs 5-fold cross-validation and searches for best alpha for Ridge regression. The average score on the folds is higher than the 1st model (around 0.82). The largest errors are just at 50% with smaller properties, which leads me to think that maybe we need to build a model for properties such as CNN and MTV and another model for smaller properties.\n\n```python\nalphas = np.logspace(-4, -.5, 20)\nprint \"*\" * 30\nprint \"MODEL 2\"\nprint \"REMOVING TWITTER USERS WITH AVG ERRORS \u003e 60%\"\n\n\nlinModel2 = Model()\n# remove names with largest average of error\nlinModel2.tw_data = linModel2.tw_data[~linModel2.tw_data.page_id.isin(tw_names_drop.page_id)]\nprint \".... CLEANING DATA .... REMOVING OUTLIERS ....\"\nlinModel2.clean(quantile = .1)\nprint \"\\n\"\nprint \"5 FOLD CROSS VALIDATION ....\"\nbest_alpha = linModel2.cross_validate(alphas=alphas, folds=5)\nprint\nprint \"FOUND BEST ALPHA USED IN MODEL 2: \", best_alpha\nprint \".... TRAINING MODEL 2....\"\nlinModel2.train(model = Ridge, perc_train = .9, alpha = best_alpha)\nscore2 = linModel2.r_score\nprint \"ADJUSTED R2 = \", score2\nresults2 = linModel2.get_results()\nprint \"Model Coefficients:\"\ncoefs2 = linModel2.get_coefs()\nprint coefs2\n```\n\nADJUSTED R2 =  0.816688528879\n\n5-FOLD CROSS VALIDATION SCORES:\n- [fold 0] alpha: 0.000100000, score: 0.81777 \u003cbr\u003e\n- [fold 1] alpha: 0.000152831, score: 0.82348 \u003cbr\u003e\n- [fold 2] alpha: 0.000233572, score: 0.82171\n- [fold 3] alpha: 0.000356970, score: 0.81639\n- [fold 4] alpha: 0.000545559, score: 0.81868\n\nBest alpha's\n\n|alpha     |fold     |score |\n|----------|:-------:|:----:|\n|0.000153  |  1      |0.8234|\n|0.000234  |  2      |0.8217|\n|0.000546  |  4      |0.8186|\n|0.000100  |  0      |0.8177|\n|0.000357  |  3      |0.8163|\n\n## Model 3:\nThe 3rd model is built using only those properties with the largest errors (over 60%) and the average R-squared error is closer to 0.89, so this model outperforms the other two. While there are still some properties with large errors, some like CNN do a better job using this model than the 1st model which includes all properties.\n\n```python\nprint \"*\" * 30\nprint \"MODEL 3\"\n\nprint \"*\" * 30\nprint \"BUILDING MODEL FOR TWITTER USERS GUILTY OF LARGEST ERRORS\"\n\nlinModel3 = Model()\nlinModel3.tw_data = linModel3.tw_data[linModel3.tw_data.page_id.isin(tw_names_drop.page_id)]\nprint \".... CLEANING DATA .... REMOVING OUTLIERS ....\"\nlinModel3.clean(quantile = .1)\nprint \"\\n\"\nprint \"10 FOLD CROSS VALIDATION ....\"\nbest_alpha_3 = linModel3.cross_validate(alphas=alphas, folds=10)\nprint\nprint \"FOUND BEST ALPHA USED IN MODEL 3: \", best_alpha_3\nprint \".... TRAINING MODEL 3....\"\nlinModel3.train(model = Ridge, perc_train = .9, alpha = best_alpha_3)\nscore3 = linModel3.r_score\nprint \"ADJUSTED R2 = \", score3\nprint \"Model Coefficients:\"\ncoefs3 = linModel3.get_coefs()\nprint coefs3\n```\n\nADJUSTED R2 =  0.888740204144\n\n10 FOLD CROSS VALIDATION SCORES:\n\n- [fold 0] alpha: 0.000100000, score: 0.89105\n- [fold 1] alpha: 0.000152831, score: 0.89236\n- [fold 2] alpha: 0.000233572, score: 0.88658\n- [fold 3] alpha: 0.000356970, score: 0.89746\n- [fold 4] alpha: 0.000545559, score: 0.88830\n- [fold 5] alpha: 0.000833782, score: 0.89534\n- [fold 6] alpha: 0.001274275, score: 0.88810\n- [fold 7] alpha: 0.001947483, score: 0.89164\n- [fold 8] alpha: 0.002976351, score: 0.89149\n- [fold 9] alpha: 0.004548778, score: 0.89680\n\nBest alpha's\n\n|alpha     |fold     |score |\n|----------|:-------:|:----:|\n|0.000357  | 3       |0.8974|\n|0.004549  | 9       |0.8967|\n|0.000834  |  5      |0.8953|\n|0.000153  |  1      |0.8923|\n|0.001947  |  7      |0.8916|\n|0.002976  |  8      |0.8914|\n|0.000100  |  0      |0.8910|\n|0.000546  |  4      |0.8883|\n|0.001274  |  6      |0.8881|\n|0.000234  |  2      |0.8865|\n\n\n### A bit of documentation on Model class:\n\nThe Model class initializes a model with the following attributes:\n- Data: Twitter and user data \n- Model results: a dataframe to store page_id, followers, retweets, impressions, predicted and percent differences\n- rmse score\n- n data points\n- model (linear regression)\n\nThe clean function:\n- Removes retweets and impressions above and below the input quantile (this is .10 by default)\n- Add weekday as a feature and drop time column\n\nThe train function:\n- Gets randomized training and test data based on percent to train (default is 90%), fits the model and sets the score and predicted values.\n\nThe cross validate function:\n- Takes as input alphas and the number of k-folds to run (default is 5) and performs cross-validation to find the best alpha. - It returns the best alpha found based on k-fold cross validation.\n\n```python\nclass Model:\n    '''\n    class to build linear regression model to predict impressions based on followers \u0026 re-tweets\n    '''\n    def __init__(self):\n        '''\n        Initialize model with Twitter data with user data (not used yet)\n        Has rmse, r_score, n and model attributes\n        '''\n        self.tw_data = load_tw_data()\n        self.user_data = load_user_data()\n\n        # self.result = pd.DataFrame(columns = ['base_user','cross_user','base_total_overlap','perc_overlap'])\n        self.model_results = pd.DataFrame(columns=['page_id', 'followers', 'retweets', 'impressions', 'predicted', \\\n                                                   'perc_diff', 'err_cat'])\n        self.rmse = None\n        self.r_score = None\n        self.n = None\n        self.model = None\n\n\n    def train(self, model, perc_train, alpha):\n        '''\n        fit model on training data, set score and model results \n        :param model: linear model to set (Ridge, Lasso..)  \n        :param perc_train: percent to use for training (float between 0 \u0026 1)\n        :param alpha: penalty value for model \n\n        '''\n        x_train, y_train, x_test, y_test = self._get_train_test(perc_train)\n        self.model = model(alpha)\n        self.model.fit(x_train, y_train)\n        self.r_score = self.model.score(x_test, y_test)\n        self.model_results.predicted = self.model.predict(x_test)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgabya06%2Ftwitter_models","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgabya06%2Ftwitter_models","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgabya06%2Ftwitter_models/lists"}