https://github.com/0101011/analitika

Testing Automatic Text Summarization
https://github.com/0101011/analitika

hdf5 machine-learning natural-language-processing nlp nlp-machine-learning pickle python tokenizer transformers

Last synced: 28 days ago
JSON representation

Testing Automatic Text Summarization

Host: GitHub
URL: https://github.com/0101011/analitika
Owner: 0101011
License: mit
Created: 2017-12-29T18:27:34.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2024-08-20T16:46:27.000Z (10 months ago)
Last Synced: 2025-04-20T19:36:47.036Z (about 2 months ago)
Topics: hdf5, machine-learning, natural-language-processing, nlp, nlp-machine-learning, pickle, python, tokenizer, transformers
Language: Python
Homepage:
Size: 19.5 KB
Stars: 17
Watchers: 2
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # 🚀 Article Data Processing

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-370/)

[![Contributions Welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](CONTRIBUTING.md)

This is a really old script from IBM Watson times I'm trying to keep afresh :)

## 🌟 So, what it does?

- 🎯 **Efficient Processing**: Tokenize and filter articles (now ***a little bit*** faster)

- 🧠 **Pre-trained Embeddings**:

- 🔮 **Data Augment**: Expand your dataset

- 💾 **Storage**: I/O HDF5

- 🛠 **Customizable**: Hopefully! ;)

## 🚀 Quick Start

1. Clone the repo:

   ```

   git clone https://github.com/0101011/analitika.git

   ```

2. Install dependencies:

   ```

   pip install -r requirements.txt

   ```

3. Run the script:

   ```

   python analitika.py

   ```

## 📚 Table of Contents

- [Installation](#-installation)

- [Usage](#-usage)

- [Configuration](#-configuration)

- [Contributing](#-contributing)

- [License](#-license)

## 🎮 Usage

1. Place your `raw_data.json` in the `data/` directory

2. (Optional) Add pre-trained embeddings to `data/`

3. Run the script:

   ```

   python analitika.py

   ```

4. Find processed data in `data/` as HDF5 and pickle files

## ⚙ Configuration

Customize the script by modifying these variables:

- `WHITELIST`: Allowed characters

- `VOCAB_SIZE`: Maximum vocabulary size

- `limit`: Length constraints for articles

## 🤝 Contributing

Here are some ways you can contribute:

- 💡 My goal was to develop a package or CLI tool out of it. Maybe we'll come up with something.

## 📜 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙌 Acknowledgements

- [NLTK](https://www.nltk.org/) for natural language processing

- [Gensim](https://radimrehurek.com/gensim/) for word embeddings

- [HDF5 for Python](https://www.h5py.org/) for efficient data storage

---



  Made with ❤️ by [Your Name]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/0101011/analitika

Awesome Lists containing this project

README