https://github.com/samestrin/pdf-extract-api-digitalocean
A Node.js based REST PDF Text Extraction API using pdf-parse.
https://github.com/samestrin/pdf-extract-api-digitalocean
api node nodejs ocr pdf pdf-parse rest
Last synced: about 1 month ago
JSON representation
A Node.js based REST PDF Text Extraction API using pdf-parse.
- Host: GitHub
- URL: https://github.com/samestrin/pdf-extract-api-digitalocean
- Owner: samestrin
- License: mit
- Created: 2024-05-01T00:12:19.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-08T01:59:35.000Z (about 2 years ago)
- Last Synced: 2025-01-22T17:46:52.499Z (over 1 year ago)
- Topics: api, node, nodejs, ocr, pdf, pdf-parse, rest
- Language: JavaScript
- Homepage:
- Size: 2.79 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf-extract-api-digitalocean
[](https://github.com/samestrin/pdf-extract-api-digitalocean/stargazers)[ ](https://github.com/samestrin/pdf-extract-api-digitalocean/network/members)[](https://github.com/samestrin/pdf-extract-api-digitalocean/watchers)
 [ ](https://opensource.org/licenses/MIT)[](https://www.python.org/)
This project implements a simulated Optical Character Recognition (OCR) service that extracts text from PDF files uploaded by users. Built with Node.js and utilizing several libraries such as Express, Multer, and pdf-parse, this application is designed to be easy to set up and integrate into other systems needing PDF text extraction capabilities.
## Features
- **PDF Text Extraction**: Allows users to upload PDF files and extracts readable text from them.
- **File Upload Management**: Utilizes Multer for efficient handling of file uploads with customizable storage options.
- **Error Handling**: Robust error management to ensure stability and provide meaningful error messages to the client.
## Dependencies
- **Node.js**: The script runs in a Node.js environment.
- **express**: Web framework for Node.js.
- **multer**: Middleware for handling multipart/form-data, used for uploading files.
- **pdf-parse**: Library to parse and extract text from PDF files.
- **fs.promises**: Part of Node.js File System module to handle file operations using promises.
- **path**: Utilities for handling and transforming file paths.
## Installing Node.js
Before installing, ensure you have Node.js and npm (Node Package Manager) installed on your system. You can download and install Node.js from [Node.js official website](https://nodejs.org/).
## Installing pdf-extract-api-digitalocean
To install and use pdf-extract-api-digitalocean, follow these steps:
Clone the Repository: Begin by cloning the repository containing the pdf-extract-api-digitalocean to your local machine.
```bash
git clone https://github.com/samestrin/pdf-extract-api-digitalocean/
```
Set PORT environment variable to define the port on which the server will listen. Default is 3000
Navigate to your project's root directory and run:
```bash
npm start
```
## **Endpoints**
### **Extract**
**Endpoint:** `/extract` **Method:** POST
Extract text from a PDF file.
#### **Parameters**
- `file`: PDF file
## **Example Usage**
Use a tool like Postman or curl to make a request:
```bash
curl -F "file=@path_to_pdf_file.pdf" http://localhost:[PORT]/extract
```
The server will process the uploaded file and return the extracted text in JSON format.
## **Error Handling**
The API handles errors gracefully and returns appropriate error responses.
- **400 Bad Request**: Invalid request parameters.
- **500 Internal Server Error**: Unexpected server error.
## Contribute
Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes or improvements.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Share
[](https://twitter.com/intent/tweet?text=Check%20out%20this%20awesome%20project!&url=https://github.com/samestrin/pdf-extract-api-digitalocean) [](https://www.facebook.com/sharer/sharer.php?u=https://github.com/samestrin/pdf-extract-api-digitalocean) [](https://www.linkedin.com/sharing/share-offsite/?url=https://github.com/samestrin/pdf-extract-api-digitalocean)