Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/trbndev/wikibert

๐Ÿ‘จ๐Ÿปโ€๐Ÿซ๐Ÿšง Experiment: Training a Neural Network with Wikipedia Articles
https://github.com/trbndev/wikibert

Last synced: 4 days ago
JSON representation

๐Ÿ‘จ๐Ÿปโ€๐Ÿซ๐Ÿšง Experiment: Training a Neural Network with Wikipedia Articles

Awesome Lists containing this project

README

        

# ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Wikibert

This is an experimental weekend project for testing purposes.

## ๐Ÿ“ Overview

**Wikibert** is a project that demonstrates text generation using Wikipedia data. It consists of two main components:
- ๐Ÿ A Python script (`get_data.py`) for fetching and saving Wikipedia pages.
- ๐Ÿ““ A Jupyter notebook (`wikibert.ipynb`) for training a text generation model using TensorFlow.

## ๐Ÿ› ๏ธ get_data.py

The `get_data.py` script performs the following tasks:
- ๐Ÿ”„ Fetches a Wikipedia page and its linked pages recursively up to a specified depth.
- ๐Ÿ’พ Saves the content of each page into individual text files in a `data` folder.
- ๐Ÿงน Sanitizes filenames to ensure compatibility with the file system.

## ๐Ÿ“Š wikibert.ipynb

The `wikibert.ipynb` notebook includes the following steps:
- ๐Ÿ” Loads and preprocesses text data from the saved Wikipedia pages.
- ๐Ÿ—๏ธ Builds and trains a GRU-based RNN model for text generation using TensorFlow.
- ๐Ÿ’พ Saves model checkpoints during training.
- ๐Ÿ“ Generates text using the trained model.

## ๐Ÿš€ Usage

1. Run `get_data.py` to fetch and save Wikipedia pages.
2. Open `wikibert.ipynb` in Jupyter Notebook or Google Colab.
3. Follow the cells in the notebook to train the text generation model and generate text.

## ๐Ÿ“‹ Requirements

- Python 3.x
- `wikipedia-api` library
- TensorFlow
- Jupyter Notebook (for `wikibert.ipynb`)

## ๐Ÿ› ๏ธ Installation

Install the required libraries using pip:
```bash
pip install wikipedia-api tensorflow
```

## ๐Ÿ“„ License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.