Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/0101011/analitika
Testing Automatic Text Summarization
https://github.com/0101011/analitika
hdf5 machine-learning natural-language-processing nlp nlp-machine-learning pickle python tokenizer transformers
Last synced: 11 days ago
JSON representation
Testing Automatic Text Summarization
- Host: GitHub
- URL: https://github.com/0101011/analitika
- Owner: 0101011
- License: mit
- Created: 2017-12-29T18:27:34.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2024-07-06T21:47:49.000Z (5 months ago)
- Last Synced: 2024-07-07T01:32:07.578Z (5 months ago)
- Topics: hdf5, machine-learning, natural-language-processing, nlp, nlp-machine-learning, pickle, python, tokenizer, transformers
- Language: Python
- Homepage:
- Size: 18.6 KB
- Stars: 15
- Watchers: 4
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🚀 Article Data Processing
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-370/)
[![Contributions Welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](CONTRIBUTING.md)This is a really old script from IBM Watson times I'm trying to keep afresh :)
## 🌟 So, what it does?
- 🎯 **Efficient Processing**: Tokenize and filter articles (now ***a little bit*** faster)
- 🧠 **Pre-trained Embeddings**:
- 🔮 **Data Augment**: Expand your dataset
- 💾 **Storage**: I/O HDF5
- 🛠 **Customizable**: Hopefully! ;)## 🚀 Quick Start
1. Clone the repo:
```
git clone https://github.com/0101011/analitika.git
```2. Install dependencies:
```
pip install -r requirements.txt
```3. Run the script:
```
python analitika.py
```## 📚 Table of Contents
- [Installation](#-installation)
- [Usage](#-usage)
- [Configuration](#-configuration)
- [Contributing](#-contributing)
- [License](#-license)## 🎮 Usage
1. Place your `raw_data.json` in the `data/` directory
2. (Optional) Add pre-trained embeddings to `data/`
3. Run the script:
```
python analitika.py
```
4. Find processed data in `data/` as HDF5 and pickle files## ⚙ Configuration
Customize the script by modifying these variables:
- `WHITELIST`: Allowed characters
- `VOCAB_SIZE`: Maximum vocabulary size
- `limit`: Length constraints for articles## 🤝 Contributing
Here are some ways you can contribute:
- 💡 My goal was to develop a package or CLI tool out of it. Maybe we'll come up with something.
## 📜 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙌 Acknowledgements
- [NLTK](https://www.nltk.org/) for natural language processing
- [Gensim](https://radimrehurek.com/gensim/) for word embeddings
- [HDF5 for Python](https://www.h5py.org/) for efficient data storage---
Made with ❤️ by [Your Name]