https://github.com/notshrirang/data-extractor-app

This Python script is designed to extract structured data from PDF files containing information such as Company Identification Number (CIN), email addresses, PAN (Permanent Account Number), phone numbers, dates, and websites. The script utilizes the PyPDF2 library for PDF processing and multiprocessing for efficient extraction from multiple PDFs.
https://github.com/notshrirang/data-extractor-app

multiprocessing pypdf2 selenium

Last synced: 29 days ago
JSON representation

Host: GitHub
URL: https://github.com/notshrirang/data-extractor-app
Owner: NotShrirang
Created: 2024-01-31T09:44:11.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-02-02T06:25:53.000Z (over 2 years ago)
Last Synced: 2025-07-16T05:13:43.292Z (11 months ago)
Topics: multiprocessing, pypdf2, selenium
Language: Python
Homepage:
Size: 1.58 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Data Extractor App

## Table of Contents

- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)

## Prerequisites

- Python 3.x
- Required Python libraries (install via `pip install -r requirements.txt`):
- `selenium`
- `PyPDF2`

## Installation

1. Clone the repository:

```bash
git clone https://github.com/NotShrirang/Data-Extractor-App.git
```

2. Navigate to the project directory:

```bash
cd Data-Extractor-App
```

3. Install the required dependencies:

```bash
pip install -r requirements.txt
```

## Usage

1. Edit the `config.json` file to configure URLs for PDFs.

2. Run the main script:

```bash
python main.py
```

To run with `multiprocessing`:
```bash
python main.py multiprocessing
```

4. The extracted data will be saved as `output.json` in the project directory.

## Configuration

- **config.json**: This file contains the configuration for the script. It includes the list of URLs for PDFs and page_count.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/notshrirang/data-extractor-app

Awesome Lists containing this project

README