https://github.com/shamspias/html-content-processor-mysql

MySQL HTML Content Processor retrieves HTML content from a MySQL database, converts it to plain text, and saves it as text files with sanitized filenames.
https://github.com/shamspias/html-content-processor-mysql

Last synced: 7 months ago
JSON representation

MySQL HTML Content Processor retrieves HTML content from a MySQL database, converts it to plain text, and saves it as text files with sanitized filenames.

Host: GitHub
URL: https://github.com/shamspias/html-content-processor-mysql
Owner: shamspias
Created: 2024-09-23T12:36:27.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-11-03T09:27:57.000Z (11 months ago)
Last Synced: 2025-01-31T13:15:45.384Z (8 months ago)
Language: Python
Size: 27.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# MySQL HTML Content Processor

This project connects to a MySQL database, retrieves HTML content from the `pechen_site_content` table, converts it to
plain text, and saves each entry as a text file named after the `pagetitle`. It is designed to handle large datasets and
allows you to resume processing from a specific ID in case of interruptions.

---

## Table of Contents

- [Project Structure](#project-structure)
- [Features](#features)
- [Getting Started](#getting-started)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Configuration](#configuration)
- [Usage](#usage)
- [Running the Application](#running-the-application)
- [Resuming Processing](#resuming-processing)
- [Testing](#testing)
- [Project Details](#project-details)
- [HTML to Text Conversion](#html-to-text-conversion)
- [Database Processing](#database-processing)
- [Best Practices](#best-practices)
- [Dependencies](#dependencies)
- [Contributing](#contributing)
- [License](#license)
- [Acknowledgments](#acknowledgments)

---

## Project Structure

```
project/
├── .env
├── example.env
├── requirements.txt
├── README.md
├── src/
│ ├── __init__.py
│ ├── main.py
│ ├── db_processor.py
│ └── html_to_text_converter.py
└── tests/
└── test_processor.py
```

- **`.env`**: Environment variables (database credentials). **Do not commit this file.**
- **`example.env`**: Template for `.env` without sensitive information.
- **`requirements.txt`**: Lists all Python dependencies.
- **`README.md`**: Documentation and instructions.
- **`src/`**: Source code.
- **`__init__.py`**: Initializes the `src` package.
- **`main.py`**: Entry point of the application.
- **`db_processor.py`**: Contains `DatabaseProcessor` class.
- **`html_to_text_converter.py`**: Contains `HTMLToTextConverter` class.
- **`tests/`**: Unit tests.
- **`test_processor.py`**: Tests for the classes.

---

## Features

- **HTML to Text Conversion**: Converts HTML content to plain text, preserving structure using newlines.
- **Database Interaction**: Connects to a MySQL database and retrieves content.
- **File Output**: Saves each content entry to a text file named after the `pagetitle`.
- **Resume Capability**: Can resume processing from a specific ID.
- **Modular Design**: Clean separation of concerns with classes and modules.
- **Unit Testing**: Comprehensive tests for reliability.

---

## Getting Started

### Prerequisites

- Python 3.6 or higher
- MySQL database access
- Pip package manager
- Virtual environment (optional but recommended)

### Installation

1. **Clone the Repository**
get the link from git, use https or ssh
```bash
git clone https://github.com/shamspias/html-content-processor-mysql.git
cd project
```

2. **Create and Activate a Virtual Environment**

```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```

3. **Install Dependencies**

```bash
pip install -r requirements.txt
```

### Configuration

1. **Copy `example.env` to `.env`**

```bash
cp example.env .env
```

2. **Edit `.env` and Add Your Database Credentials**

Open `.env` in a text editor and configure:

```dotenv
# Database Configuration
DB_HOST=your_db_host
DB_USER=your_db_user
DB_PASSWORD=your_db_password
DB_NAME=your_db_name

# Starting ID (optional)
START_ID=0
```

- **`DB_HOST`**: Your database host (e.g., `localhost`).
- **`DB_USER`**: Your database username.
- **`DB_PASSWORD`**: Your database password.
- **`DB_NAME`**: Name of your database.
- **`START_ID`**: (Optional) ID to start processing from.

---

## Usage

### Running the Application

Navigate to the project directory and run:

```bash
python -m src.main
```

This command tells Python to execute the `main.py` script located in the `src` package.

### Resuming Processing

If the script stops and you need to resume:

1. **Note the Last Processed ID**

The script outputs the ID of each processed record.

2. **Update `START_ID` in `.env`**

```dotenv
START_ID=last_processed_id
```

3. **Rerun the Application**

```bash
python -m src.main
```

---

## Testing

To run the unit tests, execute:

```bash
python -m unittest discover tests
```

This command discovers and runs all tests in the `tests` directory.

---

## Project Details

### HTML to Text Conversion

The `HTMLToTextConverter` class in `html_to_text_converter.py`:

- **Purpose**: Converts HTML content to plain text.
- **Features**:
- Parses HTML using BeautifulSoup.
- Inserts newlines at appropriate tags (`

`, `

`, headers, etc.).
- Removes HTML tags while preserving text content.

### Database Processing

The `DatabaseProcessor` class in `db_processor.py`:

- **Purpose**: Handles database connections and processes records.
- **Features**:
- Connects to MySQL using credentials from `.env`.
- Retrieves records from `pechen_site_content` starting from `START_ID`.
- Converts HTML content to text using `HTMLToTextConverter`.
- Saves content to text files named after sanitized `pagetitle`.
- Handles exceptions and ensures the database connection is closed properly.

---

## Best Practices

- **Environment Variables**: Use `.env` to store sensitive information.
- **Modular Code**: Organized into reusable modules and classes.
- **Testing**: Includes unit tests to ensure code reliability.
- **Logging**: Print statements provide progress updates; consider using the `logging` module for production.
- **Error Handling**: Comprehensive exception handling for robustness.
- **Version Control**: Use `.gitignore` to exclude sensitive files and directories.

---

## Dependencies

- **mysql-connector-python**: For connecting to the MySQL database.
- **beautifulsoup4**: For parsing and converting HTML content.
- **python-dotenv**: For loading environment variables from `.env` file.

Install all dependencies using:

```bash
pip install -r requirements.txt
```

---

## Contributing

Contributions are welcome! Please follow these steps:

1. Fork the repository.
2. Create a new branch:

```bash
git checkout -b feature/your-feature-name
```

3. Make your changes and commit:

```bash
git commit -am 'Add new feature'
```

4. Push to the branch:

```bash
git push origin feature/your-feature-name
```

5. Open a Pull Request.

---

## Acknowledgments

- **BeautifulSoup**: For making HTML parsing easy.
- **MySQL Connector/Python**: For facilitating database interactions.
- **Python Community**: For continuous support and resources.

---

## Contact

For any questions or issues, please open an issue on the repository or contact the maintainer.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shamspias/html-content-processor-mysql

Awesome Lists containing this project

README