{"id":15913014,"url":"https://github.com/x-tabdeveloping/visual-analytics-assignmen3","last_synced_at":"2025-04-03T03:16:04.980Z","repository":{"id":232839212,"uuid":"785245764","full_name":"x-tabdeveloping/visual-analytics-assignmen3","owner":"x-tabdeveloping","description":"Third Assignment for Visual Analytics in Cultural Data Science","archived":false,"fork":false,"pushed_at":"2024-04-11T17:33:48.000Z","size":74,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-08T17:14:16.634Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/x-tabdeveloping.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-04-11T13:52:14.000Z","updated_at":"2024-04-11T16:42:23.000Z","dependencies_parsed_at":"2024-04-11T19:23:07.985Z","dependency_job_id":"2bdb4320-7803-418f-8711-26fecf57c8ba","html_url":"https://github.com/x-tabdeveloping/visual-analytics-assignmen3","commit_stats":null,"previous_names":["x-tabdeveloping/visual-analytics-assignmen3"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fvisual-analytics-assignmen3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fvisual-analytics-assignmen3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fvisual-analytics-assignmen3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fvisual-analytics-assignmen3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/x-tabdeveloping","download_url":"https://codeload.github.com/x-tabdeveloping/visual-analytics-assignmen3/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246927843,"owners_count":20856198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-06T16:22:28.754Z","updated_at":"2025-04-03T03:16:04.952Z","avatar_url":"https://github.com/x-tabdeveloping.png","language":"Python","readme":"# visual-analytics-assignmen3\nThird Assignment for Visual Analytics in Cultural Data Science.\n\nThis repository contains code for training a classifier with Keras on the [Tobacco3482](https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg?resource=download)\ndataset using pretrained image embeddings from the VGG16 model.\n\n## Usage\n\n### Data\n\nYou should download the data from this [Kaggle page](https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg?resource=download).\nPut the archive in a `dat/` folder in the repository's folder.\n\nYou should then unzip the archive.\n\n```bash\ncd dat/\nunzip archive.zip\n```\n\n### Requirements\n\n```bash\npip install -r requirements.txt\n```\n\n### Training\n\nRun the `src/train.py` script to train the model.\n\n```bash\npython3 src/train.py\n```\n\nThis will output a `loss_curve.png` and a `classification_report.txt` into the `./out/` directory.\n\n## Methodology\n\n### Preprocessing\n\nAll images were resized to have maximal width and height of 224 pixels, and were padded with black pixels to achieve the desired `224x224` size.\nOne-hot encodings were produced for the labels.\n\nThe dataset was split into training (60%), validation (20%) and test (20%) sets.\nThe number of examples of each class was equalized across splits.\n\n### Training\n\nImage embeddings were extracted using the VGG16 model.\nA classifier was trained with one hidden dense layer with ReLU activation aided with batch normalization and dropout.\nAn Adam optimizer was used with learning rate of `1e-3`.\nThe model was trained for 10 epochs with batches of 32 examples.\n\n## Results\n\n### Loss curve\n\n![Loss curve](out/loss_curve.png)\n\nThe loss curve indicates that the model should have been trained for more epochs, as the model's validation loss was decreasing roughly at the same pace as the training loss,\nand when training was terminated both of them still had a negative trend.\nThis indicates that the model's knowledge, which it was still in the process of learning, was generalizable to unseen images.\n\n### Classification Report\n\n|          | precision | recall | f1-score | support |\n|----------|-----------|--------|----------|---------|\n| ADVE     | 0.85      | 0.89   | 0.87     | 46      |\n| Email    | 0.75      | 0.84   | 0.79     | 120     |\n| Form     | 0.53      | 0.79   | 0.63     | 86      |\n| Letter   | 0.60      | 0.77   | 0.68     | 114     |\n| Memo     | 0.54      | 0.55   | 0.54     | 124     |\n| News     | 0.76      | 0.58   | 0.66     | 38      |\n| Note     | 0.71      | 0.12   | 0.21     | 40      |\n| Report   | 0.58      | 0.36   | 0.44     | 53      |\n| Resume   | 0.86      | 0.25   | 0.39     | 24      |\n| Scientific | 0.57     | 0.40   | 0.47     | 52      |\n| Accuracy |           |        | 0.63     | 697     |\n| Macro avg| 0.67      | 0.56   | 0.57     | 697     |\n| Weighted avg | 0.64  | 0.63   | 0.61     | 697     |\n\nThe model's performance on the holdout set was way above chance level, indicating that its knowledge is solid and generalizable.\nPerformance was not equally good across all classes, it would seem that the model had trouble recognizing certain types of documents.\nResumes and Notes for instance had very low F1 scores, mostly dragged down by very low recall, meaning that many of these got recognized as something else (high number of false negatives).\nOn the other hand, there were a low number of false positives when predicting these two classes, indicated by high precision.\n\nWhile the dataset was fairly imbalanced, the model did not seem to learn frequent classes substantially better than infrequent ones.\nADVE documents were for example rather uncommon, but we still achieved an F1 score of 0.87, while there were many examples of Memos, and the model was not particularly good at recognizing these, indicated by an F1 score of 0.54.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Fvisual-analytics-assignmen3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-tabdeveloping%2Fvisual-analytics-assignmen3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Fvisual-analytics-assignmen3/lists"}