{"id":27299947,"url":"https://github.com/demic-dev/bioinfo-tfidf","last_synced_at":"2025-04-12T00:50:04.566Z","repository":{"id":220099560,"uuid":"750754361","full_name":"demic-dev/bioinfo-tfidf","owner":"demic-dev","description":null,"archived":false,"fork":false,"pushed_at":"2024-01-31T09:00:08.000Z","size":583,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-12T00:50:00.501Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/demic-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-01-31T08:55:14.000Z","updated_at":"2024-01-31T08:57:54.000Z","dependencies_parsed_at":"2024-01-31T10:10:18.313Z","dependency_job_id":null,"html_url":"https://github.com/demic-dev/bioinfo-tfidf","commit_stats":null,"previous_names":["demic-dev/bioinfo-tfidf"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/demic-dev%2Fbioinfo-tfidf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/demic-dev%2Fbioinfo-tfidf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/demic-dev%2Fbioinfo-tfidf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/demic-dev%2Fbioinfo-tfidf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/demic-dev","download_url":"https://codeload.github.com/demic-dev/bioinfo-tfidf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248501902,"owners_count":21114681,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-12T00:50:03.991Z","updated_at":"2025-04-12T00:50:04.553Z","avatar_url":"https://github.com/demic-dev.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"Surname: **De Cillis**\nName: **Michele**\n\nMail address: [michele.de-cillis@universite-paris-saclay.fr](mail-to:michele.de-cillis@universite-paris-saclay.fr)\n\nStudent Number: **22311787**\n\n[GitHub repo.](https://github.com/demic-dev/bioinfo-tfidf)\n\n# TF-IDF\n\nDuring the class, we've learnt the TF-IDF notation, which represents the importance of a word in a collection of documents. It's really useful because, given a corpus, we can classify the documents inside or understand the key-words and the topics about them.\n\nIn this homework I'm going to analyze a small (around ~1000) dataset which contains abstracts of medical paper about COVID-19 [source](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge). First, I've reduced the dataset from two millions to six hundered (reduced of the 99,97%) and then I loaded it into my notebook.\n\nOnce I loaded it, I started to setup my environnment by loading all the libraries and by coding a preprocess function, where I give my abstract and it starts to tokenize first and then lemmatize the retrieved tokens.\n\nThen, I called the function `TfidfVectorizer` of `scikit-learn` and I printed the results with `panda`.\n\nI've tried to use the function multiple times, with a different size of dataset. First with only **1** document, then with **10** documents and in the end with **50** documents. Here's my results:\n\n## 1 document\n\nBy passing only one document, we can see how all the tokens have a percentage bigger than 0 and the words with a bigger frequency (such as the stop words) has an increased percentage.\n\nThis is expected, because in the end, since there is only one document, the _idf_ is going to be useless and it will represent only the term frequency.\n\n## 10 documents\n\nNow, we're starting to see a lot more zeros for each term. Plus, the overall percentages are lower, because we're giving more documents. We can see for each documents, beyond the stop words, that the higher percentages are for the keywords.\n\n## 50 documents\n\nWith 50 documents, the results are similiar. Of course, on some items where in the previous case there was a percentage, in this case the percentage is different because the number of documents is higher and a word can appear in other documents.\n\nStill, there is a problem that partially invalidates the results just discovered. The stop words appear too often and infect the results just found.\n\nSo, during the word preprocessing I added an `if` which removes the stop words, if found.\n\nNow, in each case there will be less columns because the common english terms are removed and now there are tokens only relative to the content of the abstract.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdemic-dev%2Fbioinfo-tfidf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdemic-dev%2Fbioinfo-tfidf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdemic-dev%2Fbioinfo-tfidf/lists"}