{"id":15913101,"url":"https://github.com/x-tabdeveloping/language-analytics-assignment2","last_synced_at":"2025-04-03T03:16:32.903Z","repository":{"id":225115974,"uuid":"765083208","full_name":"x-tabdeveloping/language-analytics-assignment2","owner":"x-tabdeveloping","description":"Assignment 2 for Language Analytics","archived":false,"fork":false,"pushed_at":"2024-05-10T12:23:53.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-08T17:14:38.398Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/x-tabdeveloping.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-29T08:51:02.000Z","updated_at":"2024-05-10T12:23:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"d87bc815-dbd8-4046-b85c-aa94c2269216","html_url":"https://github.com/x-tabdeveloping/language-analytics-assignment2","commit_stats":null,"previous_names":["x-tabdeveloping/language-analytics-assignment2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/x-tabdeveloping","download_url":"https://codeload.github.com/x-tabdeveloping/language-analytics-assignment2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246927844,"owners_count":20856198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-06T16:23:06.589Z","updated_at":"2025-04-03T03:16:32.806Z","avatar_url":"https://github.com/x-tabdeveloping.png","language":"Python","readme":"# language-analytics-assignment2\nAssignment 2 for Language Analytics: Classifying fake news with logistic regression and ANNs based on bag-of-words representations.\n\nThe data is available on [Kaggle](https://www.kaggle.com/datasets/jillanisofttech/fake-or-real-news).\nThe downloaded csv file should be placed in a `data` directory:\n```\n- data/\n    fake_or_real_news.csv\n```\n\n## Usage\n\nInstall requirements:\n\n```bash\npip install -r requirements.txt\n```\n\nRun logistic regression benchmark:\n\n```bash\npython3 src/logistic_regression.py\n```\n\nRun Neural Network benchmark:\n\n```bash\npython3 src/neural_network.py\n```\n\nBoth files save classification reports to the `out/` folder in the form of txt files\nand serialize models as pipelines, including the vectorizer with `joblib`.\n\nTo load the fitted models in a separate script for inference:\n\n```python\nimport joblib\n\nclassifier = joblib.load(\"models/logistic_regression.joblib\")\nclassifier.predict([\"Write your text here\"])\n```\n\n\u003e Additionally the scripts will produce csv files with the CO2 emissions of the substasks in the code (`emissions/`).\n\u003e This is necessary for Assignment 5, and is not directly relevant to this assignment.\n\n\u003e Note: The `emissions/emissions.csv` file should be ignored. This is due to the fact, that codecarbon can't track process and task emissions at the same time.\n\n## Results\n\nWe can see that the performance of the neural network surpasses the logistic regression classifier by about two percentage points and that it performs generally better on unseen data than LR.\nThis is expected as the MLP is able to better represent the data (in the hidden layer) before passing it forward to a logistic regression head.\nThe difference in accuracy, however, is relatively negligable compared to the difference in CO2 emissions and training time, and in a real world setting these might be more important than sheer performance.\nBoth classifiers performed very reliably, and I would certainly deem them to be usable in a production setting with both of themm boasting F1 scores above 0.9.\n\n## Potential Limitations\nNo experimentation with the hyperparameters was done in the code, and default parameters were used in almost all scikit-learn estimators.\n\n### Vectorization\nTF-IDF weighting was used on bag-of-words representations and stop words were removed as part of the preprocessing.\nReducing the dimensionality of the language representations would probably have benefitted the performance of both classifiers.\nThis could have been done either by a matrix decomposition step (NMF or SVD), which would account for polysemy and homonymy in the feature space, or simply by reducing the vocabulary.\nThe feature space could have been reduced by filtering out overly frequent or infrequent terms or by lemmatizing the texts in preprocessing.\nThese options might result in better model fit and potentially even close the gap between LR and MLP classifiers.\n\n### Model Hyperparameters\nArtificial Neural Networks are typically sensitive to the number and size of hidden layers, along with the activation functions in between them.\nI have not experimented with these aspects, but thorough and systematic evaluations might have given us a few extra percentage points in performance.\nGrid search or Bayesian hyperparameter optimization could be used to find optimal values if maximal performance was neededd.\n\n### Cross-validation\nEvaluating over multiple folds (K-folds CV) might have given us a more realistic expectation for what the model might perform like in a real world setting.\nIf we did not shuffle the dataset before we might even have a reasonable estimate of how domain shift would affect the model in a production setting.\nK-fold CV would also englighten us about the uncertainty around the models' performance.\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Flanguage-analytics-assignment2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-tabdeveloping%2Flanguage-analytics-assignment2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Flanguage-analytics-assignment2/lists"}