https://github.com/demic-dev/bioinfo-tfidf
https://github.com/demic-dev/bioinfo-tfidf
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/demic-dev/bioinfo-tfidf
- Owner: demic-dev
- Created: 2024-01-31T08:55:14.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-01-31T09:00:08.000Z (over 2 years ago)
- Last Synced: 2025-04-12T00:50:00.501Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 569 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Surname: **De Cillis**
Name: **Michele**
Mail address: [michele.de-cillis@universite-paris-saclay.fr](mail-to:michele.de-cillis@universite-paris-saclay.fr)
Student Number: **22311787**
[GitHub repo.](https://github.com/demic-dev/bioinfo-tfidf)
# TF-IDF
During the class, we've learnt the TF-IDF notation, which represents the importance of a word in a collection of documents. It's really useful because, given a corpus, we can classify the documents inside or understand the key-words and the topics about them.
In this homework I'm going to analyze a small (around ~1000) dataset which contains abstracts of medical paper about COVID-19 [source](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge). First, I've reduced the dataset from two millions to six hundered (reduced of the 99,97%) and then I loaded it into my notebook.
Once I loaded it, I started to setup my environnment by loading all the libraries and by coding a preprocess function, where I give my abstract and it starts to tokenize first and then lemmatize the retrieved tokens.
Then, I called the function `TfidfVectorizer` of `scikit-learn` and I printed the results with `panda`.
I've tried to use the function multiple times, with a different size of dataset. First with only **1** document, then with **10** documents and in the end with **50** documents. Here's my results:
## 1 document
By passing only one document, we can see how all the tokens have a percentage bigger than 0 and the words with a bigger frequency (such as the stop words) has an increased percentage.
This is expected, because in the end, since there is only one document, the _idf_ is going to be useless and it will represent only the term frequency.
## 10 documents
Now, we're starting to see a lot more zeros for each term. Plus, the overall percentages are lower, because we're giving more documents. We can see for each documents, beyond the stop words, that the higher percentages are for the keywords.
## 50 documents
With 50 documents, the results are similiar. Of course, on some items where in the previous case there was a percentage, in this case the percentage is different because the number of documents is higher and a word can appear in other documents.
Still, there is a problem that partially invalidates the results just discovered. The stop words appear too often and infect the results just found.
So, during the word preprocessing I added an `if` which removes the stop words, if found.
Now, in each case there will be less columns because the common english terms are removed and now there are tokens only relative to the content of the abstract.