Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dhairyac/news-article-classification
Use Hugging Face Transformers to classify news articles
https://github.com/dhairyac/news-article-classification
albert-transformer huggingface-transformers keras tensorflow tokenization
Last synced: about 2 months ago
JSON representation
Use Hugging Face Transformers to classify news articles
- Host: GitHub
- URL: https://github.com/dhairyac/news-article-classification
- Owner: DhairyaC
- Created: 2024-05-19T16:32:04.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-06-07T09:38:27.000Z (7 months ago)
- Last Synced: 2024-06-07T11:01:05.772Z (7 months ago)
- Topics: albert-transformer, huggingface-transformers, keras, tensorflow, tokenization
- Language: Jupyter Notebook
- Homepage:
- Size: 35.2 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Objectives
1. Perform web scraping to extract data (Business and News articles) from News API. It should contain at least 1000 words in total and at least two categories with at least 100 examples per category.
2. Split the dataset into training (at least 160 examples) and test (at least 40 examples) sets.
3. Fine-tune a pretrained language model (Albert-base-v2) capable of generating text on this dataset (the one I created in part 1). Report the test accuracy. Discuss what could be done to improve accuracy.
4. Try a couple of different language models (GPT-J and GPT-Sw3) to gain a better understanding.## Techniques/Models
1. Techniques:
- Web Scraping
- Tokenization
- Model optimization
- Performance evaluation
2. Models:
- Albert-base-v2 (a pre-trained model for sequence classification from Hugging Face Transformers)
- GPT-J
- GPT-Sw3## Findings
1. After training the Albert-base-v2 model with 20 epochs, we observe a decrease in the training loss at each step. The accuracy obtained was 85%, which is higher than the accuracy obtained with 4 epochs. However, we could have achieved even higher accuracy if we had used a larger model or added regularization techniques like dropout to prevent overfitting. We could further experiment different batch sizes. A smaller batch size may allow the model to generalize better.
2. GPT-J is computationally expensive since the model requires around 24.2 GB of memory space to download. It is too big to be used on regular hardware: wouldn’t fit in RAM.
3. GPT-Sw3 is not released publicly: requires token access.