https://github.com/mittalsoni00/filereader
Java PdfReader API is a Spring Boot-based application that extracts text from PDF files using the Apache PDFBox library. It provides a REST API to upload PDFs and retrieve their extracted text. This project simplifies text extraction for various applications like document processing and data analysis.
https://github.com/mittalsoni00/filereader
github-config java maven pdfbox postman spring spring-boot spring-web
Last synced: 2 months ago
JSON representation
Java PdfReader API is a Spring Boot-based application that extracts text from PDF files using the Apache PDFBox library. It provides a REST API to upload PDFs and retrieve their extracted text. This project simplifies text extraction for various applications like document processing and data analysis.
- Host: GitHub
- URL: https://github.com/mittalsoni00/filereader
- Owner: mittalsoni00
- Created: 2025-03-24T18:24:36.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-08T10:45:18.000Z (about 1 year ago)
- Last Synced: 2025-04-08T11:34:10.801Z (about 1 year ago)
- Topics: github-config, java, maven, pdfbox, postman, spring, spring-boot, spring-web
- Homepage: http://localhost:8080/api/parse-pdf
- Size: 43.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# JAVA PdfReader API using LLM interaction
A Spring Boot application that allows users to upload a **CASA bank statement PDF** and receive extracted details such as **Name**, **Email**, **Opening Balance**, and **Closing Balance** using **OpenAI GPT-4o** model.
๐ **Live Hosted API**: [https://pdfreader-fped.onrender.com/api/parse-pdf](https://pdfreader-fped.onrender.com/api/parse-pdf)
---
## ๐ Switch to Master Branch
**Note:** Please switch to the `master` branch to access all the documentation and source files.
## ๐ Project Structure
### ๐น Key Files
| File | Description |
|------|-------------|
| `PdfController.java` | Main controller that handles file upload and calls service to process PDF |
| `PdfService.java` | Extracts text from the PDF and uses OpenAI API for intelligent data parsing |
| `OpenAiService.java` | Integrates with OpenAI GPT-4o via REST call using Spring's `RestTemplate` |
| `application.properties` | Stores server config and API key (linked via environment variable for security) |
| `Dockerfile` | Dockerized for public deployment on Render |
| `ChatController.java` | [Debug Endpoint] Allows sending prompt manually to OpenAI via JSON |
| `ChatPromptDTO.java` | DTO for accepting request body in JSON format |
| `PdfParserUtil.java` | Optional utility class to aid in raw PDF parsing |
API can be accessed via Postman or curl command.(I have used Postman{instructions below๐})
### โ Files You Can Exclude
The following files are not crucial to the core functionality:
- `ApiReaderNewApplicationTests.java`
- `Main.java`
## ๐งช PdfReader.java (Testing Purpose)
The `PdfReader.java` file is a **testing utility** that demonstrates how text extraction is performed from a PDF. It directly uses the **Apache PDFBox API** to extract text from any PDF file.
---
## ๐ ๏ธ Tech Stack
- Java 17
- Spring Boot
- Apache PDFBox
- OpenAI GPT-4o (via REST API)
- Maven
- Docker (for deployment)
- Render (hosting platform)
- Postman / curl for testing
---
## ๐ก Features
- ๐ Intelligent extraction using **OpenAI GPT-4o** API
- ๐ Accepts **PDF file** as multipart input
- ๐ง Analyzes content with **LLM prompt** to extract:
- Name
- Email
- Opening Balance
- Closing Balance
- ๐ฌ JSON formatted output
- ๐งช Separate test/debug endpoint to interact with OpenAI
- ๐ **Deployed publicly** using Docker and Render
---
## ๐ Getting Started (Local Setup)
### 1๏ธโฃ Clone the Repository
```bash
git clone https://github.com/mittalsoni00/FileReader.git
cd FileReader
git checkout master
```
### 2๏ธโฃ Add Your OpenAI API Key
Use an environment variable for security:
```bash
export OPENAI_API_KEY=your_secret_key_here
```
Or add it in `application.properties` (only for testing, not recommended for prod):
```properties
OPENAI_API_KEY=${OPENAI_API_KEY}
```
### 3๏ธโฃ Build and Run
```bash
mvn clean install
java -jar target/Api_Reader_New-0.0.1-SNAPSHOT.jar
```
---
## ๐งช Debug Endpoint (Optional)
To test prompt-only flow (without PDF), hit:
```
POST /api/chat
Body (JSON):
{
"prompt": "Tell me a joke about Java developers"
}
```
This will return a direct OpenAI response.
---
## ๐ฌ API Documentation
### โ
Endpoint for PDF Parsing
```
POST /api/parse-pdf
```
### Request Type
`multipart/form-data`
### Form Key:
| Key | Value |
|-----|-------|
| `file` | [Select your PDF file] |
### ๐ Response (Success)
```json
{
"response": "Here are the extracted details from the bank statement:\n\n- Name: John Doe\n- Email: johndoe@example.com\n- Opening Balance: $5,000\n- Closing Balance: $6,500"
}
```
---
## ๐ฌ Testing via Postman
### ๐ Steps
1. Open Postman โ **New Request**
2. Choose **POST** โ Enter URL:
```
https://pdfreader-fped.onrender.com/api/parse-pdf
```
3. Go to **Body** tab โ Select `form-data`
4. Add a key named `file` โ Upload PDF file
5. Click **Send**
### โ
Response
Youโll receive a JSON containing Name, Email, and balances extracted using OpenAI.
๐ข Make sure `Content-Type` is set to `multipart/form-data`. Postman sets this automatically if `form-data` is chosen.
---
## ๐ณ Deployment Notes (on Render)
- Dockerized Spring Boot app using:
```dockerfile
FROM openjdk:17
WORKDIR /app
COPY target/Api_Reader_New-0.0.1-SNAPSHOT.jar app.jar
ENTRYPOINT ["java", "-jar", "app.jar"]
```
- Pushed to GitHub repo: [https://github.com/mittalsoni00/FileReader](https://github.com/mittalsoni00/FileReader)
- Environment variable `OPENAI_API_KEY` added via Render dashboard
- Health Check path: `/api/parse-pdf`
๐ **Boss, you made it happen โ from idea to a fully deployed AI-powered PDF reader! Absolute ๐ฅ. Proud moment! ๐ฏ**