https://github.com/notshrirang/data-extractor-app
This Python script is designed to extract structured data from PDF files containing information such as Company Identification Number (CIN), email addresses, PAN (Permanent Account Number), phone numbers, dates, and websites. The script utilizes the PyPDF2 library for PDF processing and multiprocessing for efficient extraction from multiple PDFs.
https://github.com/notshrirang/data-extractor-app
multiprocessing pypdf2 selenium
Last synced: 29 days ago
JSON representation
This Python script is designed to extract structured data from PDF files containing information such as Company Identification Number (CIN), email addresses, PAN (Permanent Account Number), phone numbers, dates, and websites. The script utilizes the PyPDF2 library for PDF processing and multiprocessing for efficient extraction from multiple PDFs.
- Host: GitHub
- URL: https://github.com/notshrirang/data-extractor-app
- Owner: NotShrirang
- Created: 2024-01-31T09:44:11.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-02T06:25:53.000Z (over 2 years ago)
- Last Synced: 2025-07-16T05:13:43.292Z (11 months ago)
- Topics: multiprocessing, pypdf2, selenium
- Language: Python
- Homepage:
- Size: 1.58 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Extractor App
This Python script is designed to extract structured data from PDF files containing information such as Company Identification Number (CIN), email addresses, PAN (Permanent Account Number), phone numbers, dates, and websites. The script utilizes the PyPDF2 library for PDF processing and multiprocessing for efficient extraction from multiple PDFs.
## Table of Contents
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
## Prerequisites
- Python 3.x
- Required Python libraries (install via `pip install -r requirements.txt`):
- `selenium`
- `PyPDF2`
## Installation
1. Clone the repository:
```bash
git clone https://github.com/NotShrirang/Data-Extractor-App.git
```
2. Navigate to the project directory:
```bash
cd Data-Extractor-App
```
3. Install the required dependencies:
```bash
pip install -r requirements.txt
```
## Usage
1. Edit the `config.json` file to configure URLs for PDFs.
2. Run the main script:
```bash
python main.py
```
To run with `multiprocessing`:
```bash
python main.py multiprocessing
```
4. The extracted data will be saved as `output.json` in the project directory.
## Configuration
- **config.json**: This file contains the configuration for the script. It includes the list of URLs for PDFs and page_count.