{"id":32786820,"url":"https://github.com/techbrolakes/pdf-scraper","last_synced_at":"2026-05-07T03:32:17.973Z","repository":{"id":322166282,"uuid":"1085821160","full_name":"Techbrolakes/pdf-scraper","owner":"Techbrolakes","description":"An AI-powered Next.js application for extracting and managing resume data from PDF files.","archived":false,"fork":false,"pushed_at":"2025-11-02T23:13:27.000Z","size":882,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-03T01:06:11.194Z","etag":null,"topics":["lenis","next-auth","nextjs","openai","prisma","react-email","react-hook-form","resend-email","stripe","supabase-db","tailwindcss","typescript"],"latest_commit_sha":null,"homepage":"https://ola-pdf-scraper.vercel.app","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Techbrolakes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-29T14:58:18.000Z","updated_at":"2025-11-02T23:10:52.000Z","dependencies_parsed_at":"2025-11-03T01:06:52.481Z","dependency_job_id":null,"html_url":"https://github.com/Techbrolakes/pdf-scraper","commit_stats":null,"previous_names":["techbrolakes/pdf-scraper"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Techbrolakes/pdf-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Techbrolakes%2Fpdf-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Techbrolakes%2Fpdf-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Techbrolakes%2Fpdf-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Techbrolakes%2Fpdf-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Techbrolakes","download_url":"https://codeload.github.com/Techbrolakes/pdf-scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Techbrolakes%2Fpdf-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":282762578,"owners_count":26723111,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-05T02:00:05.946Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lenis","next-auth","nextjs","openai","prisma","react-email","react-hook-form","resend-email","stripe","supabase-db","tailwindcss","typescript"],"created_at":"2025-11-05T05:01:44.272Z","updated_at":"2025-11-05T05:03:25.280Z","avatar_url":"https://github.com/Techbrolakes.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF Scraper - AI-Powered Resume Data Extraction\n\nAn enterprise-grade Next.js application for extracting and managing structured resume data from PDF files using OpenAI GPT-4o.\n\n## 🚀 Quick Overview\n\n**PDF Upload Pipeline:**\n1. **Upload** → User uploads PDF (max 10MB)\n2. **Validate** → Authentication, credits, rate limits, PDF structure\n3. **Extract** → pdf2json extracts text (serverless-compatible)\n4. **Parse** → GPT-4o extracts structured data (JSON Schema mode)\n5. **Store** → Save to PostgreSQL with Prisma\n6. **Deduct** → Deduct 100 credits from user account\n\n**Key Highlights:**\n- ⚡ **Serverless-First**: 100% compatible with Vercel, Netlify, AWS Lambda\n- 🤖 **AI-Powered**: OpenAI GPT-4o with Structured Outputs (guaranteed JSON format)\n- 🔒 **Enterprise Security**: NextAuth v5, rate limiting, credit system\n- 💳 **Stripe Integration**: Subscription plans with automated billing\n- 📊 **Structured Data**: Strict ENUM validation for consistent data\n- 🎨 **Modern UI**: 30+ custom components, dark mode, responsive design\n\n## Features\n\n### Core Features\n- 🔐 **Authentication**: Email/password + GitHub/Google OAuth with NextAuth.js\n- 📤 **PDF Upload**: Drag-and-drop PDF upload with file validation\n- 🤖 **AI-Powered Extraction**: OpenAI GPT-4o with Structured Outputs for guaranteed data format\n- 📄 **Text-based PDFs**: Serverless-compatible text extraction with pdf2json\n- 📊 **Structured Data**: Extracts profile, experience, education, skills, and more\n- 🗄️ **Database**: PostgreSQL with Prisma ORM\n- 🎨 **Modern UI**: Built with TailwindCSS\n- ✅ **Type Safety**: Full TypeScript with strict ENUM validation\n\n### Additional Features\n- 🚨 **Error Handling**: Comprehensive error boundaries and user-friendly error pages\n- 🚦 **Rate Limiting**: Database-based rate limiting (10 uploads/hour per user)\n- 🔔 **Enhanced Toasts**: Rich notifications with icons and descriptions\n- ⏳ **Loading States**: Skeleton loaders and progress indicators\n- 📭 **Empty States**: Helpful empty state components with actions\n- ♿ **Accessibility**: WCAG AA compliant with keyboard navigation and screen reader support\n- ⚡ **Performance**: Code splitting, lazy loading, and performance utilities\n- 💰 **Subscription Plans**: Basic ($10/month) and Pro ($20/month) plans\n- 🎫 **Credit System**: 100 credits per resume extraction\n- 💳 **Stripe Checkout**: Secure hosted checkout flow\n- 🔄 **Webhook Handling**: Automated subscription and payment processing\n- 📊 **Usage Tracking**: Real-time credit balance display\n- ⚠️ **Credit Warnings**: Low credit and no credit alerts\n- 🎛️ **Billing Portal**: Manage subscriptions and payment methods\n- 🔒 **Payment Security**: PCI-compliant payment processing\n\n## Tech Stack\n\n- **Framework**: Next.js 14+ (App Router)\n- **Language**: TypeScript\n- **Authentication**: NextAuth.js v5\n- **Database**: PostgreSQL (via Supabase)\n- **ORM**: Prisma\n- **AI**: OpenAI GPT-4o (text \u0026 vision)\n- **PDF Processing**: pdf2json (serverless-compatible)\n- **Payments**: Stripe (subscriptions \u0026 webhooks)\n- **Styling**: TailwindCSS\n- **Form Validation**: Zod + React Hook Form\n- **Notifications**: Sonner\n\n## Getting Started\n\n### Prerequisites\n\n- Node.js 18+ installed\n- PostgreSQL database (Supabase recommended)\n- npm or yarn package manager\n\n### Installation\n\n1. Clone the repository:\n```bash\ngit clone \u003crepository-url\u003e\ncd pdf-scraper\n```\n\n2. Install dependencies:\n```bash\nnpm install\n```\n\n3. Set up environment variables:\n```bash\ncp .env.example .env.local\n```\n\nEdit `.env.local` and add your configuration:\n```env\n# Database\nDATABASE_URL=\"postgresql://user:password@localhost:5432/pdf_scraper?schema=public\"\n\n# NextAuth\nNEXTAUTH_SECRET=\"your-secret-key-here\" # Generate with: openssl rand -base64 32\nNEXTAUTH_URL=\"http://localhost:3000\"\n\n# GitHub OAuth (see NEXTAUTH_SETUP.md for instructions)\nGITHUB_ID=\"your-github-oauth-client-id\"\nGITHUB_SECRET=\"your-github-oauth-client-secret\"\n\n# Google OAuth (see NEXTAUTH_SETUP.md for instructions)\nGOOGLE_ID=\"your-google-oauth-client-id\"\nGOOGLE_SECRET=\"your-google-oauth-client-secret\"\n\n# OpenAI\nOPENAI_KEY=\"your-openai-api-key-here\"\n\n# Stripe (Optional - for subscription features)\nSTRIPE_SECRET_KEY=\"sk_test_your-stripe-secret-key-here\"\nSTRIPE_WEBHOOK_SECRET=\"whsec_your-webhook-secret-here\"\nSTRIPE_PRICE_BASIC=\"price_basic_plan_id\"\nSTRIPE_PRICE_PRO=\"price_pro_plan_id\"\n```\n\nFor detailed OAuth setup instructions, see [NEXTAUTH_SETUP.md](./NEXTAUTH_SETUP.md)\n\n4. Generate Prisma client and run migrations:\n```bash\nnpx prisma generate\nnpx prisma db push\n```\n\n5. Run the development server:\n```bash\nnpm run dev\n```\n\n6. Open [http://localhost:3000](http://localhost:3000) in your browser.\n\n## Project Structure\n\n```\npdf-scraper/\n├── app/\n│   ├── (auth)/\n│   │   ├── login/              # Login page with OAuth\n│   │   ├── register/           # Registration page\n│   │   ├── forgot-password/    # Password reset flow\n│   │   └── layout.tsx          # Auth layout\n│   ├── (dashboard)/\n│   │   ├── dashboard/          # Main dashboard with PDF upload\n│   │   ├── settings/           # User settings \u0026 billing\n│   │   ├── billing/            # Subscription management\n│   │   └── layout.tsx          # Dashboard layout with sidebar\n│   ├── api/\n│   │   ├── auth/               # NextAuth API routes\n│   │   ├── upload/             # PDF upload \u0026 processing (route.ts)\n│   │   ├── checkout/           # Stripe checkout session\n│   │   ├── billing/            # Stripe customer portal\n│   │   └── webhooks/stripe/    # Stripe webhook handler\n│   ├── actions/\n│   │   ├── resume-actions.ts   # Server actions for resumes\n│   │   ├── settings-actions.ts # Server actions for settings\n│   │   └── tour-actions.ts     # Product tour actions\n│   ├── layout.tsx              # Root layout\n│   └── page.tsx                # Landing page\n├── components/\n│   ├── ui/                     # 30+ custom UI components\n│   │   ├── button.tsx          # Button with variants\n│   │   ├── input.tsx           # Form input\n│   │   ├── card.tsx            # Card component\n│   │   ├── dialog.tsx          # Modal dialog\n│   │   ├── tabs.tsx            # Tabbed interface\n│   │   ├── progress.tsx        # Progress bar\n│   │   ├── skeleton.tsx        # Loading skeletons\n│   │   └── ...                 # 20+ more components\n│   ├── layout/\n│   │   ├── sidebar.tsx         # Collapsible sidebar\n│   │   └── header.tsx          # Dashboard header\n│   ├── dashboard/\n│   │   ├── stats-cards.tsx     # Dashboard statistics\n│   │   └── credit-alerts.tsx   # Credit warnings\n│   ├── auth/\n│   │   ├── oauth-buttons.tsx   # GitHub/Google OAuth\n│   │   └── feature-highlights.tsx\n│   ├── billing/\n│   │   ├── billing-stats.tsx   # Credit \u0026 plan display\n│   │   └── test-card-modal.tsx # Test card info\n│   └── product-tour.tsx        # Driver.js tour\n├── lib/\n│   ├── auth.ts                 # NextAuth v5 configuration\n│   ├── prisma.ts               # Prisma client singleton\n│   ├── rate-limit.ts           # Database-based rate limiting\n│   ├── stripe-service.ts       # Stripe integration\n│   ├── openai-service.ts       # OpenAI GPT-4o integration\n│   ├── pdf/\n│   │   └── pdf-extractor.ts    # pdf2json text extraction\n│   ├── openai/\n│   │   ├── client.ts           # OpenAI client config\n│   │   └── resume-parser.ts    # Structured output parser\n│   ├── validations/\n│   │   ├── auth.ts             # Auth schemas (Zod)\n│   │   └── settings.ts         # Settings schemas (Zod)\n│   └── utils.ts                # Utility functions\n├── prisma/\n│   └── schema.prisma           # Database schema with User, ResumeHistory\n├── types/\n│   ├── resume.ts               # Resume data types \u0026 ENUMs\n│   └── next-auth.d.ts          # NextAuth type extensions\n├── emails/\n│   ├── welcome-email.tsx       # Welcome email template\n│   └── password-reset-email.tsx # Password reset email\n├── scripts/\n│   ├── setup-db.sh             # Database setup script\n│   └── grant-free-credits.ts   # Admin credit script\n└── middleware.ts               # Protected routes \u0026 auth\n```\n\n### Key Implementation Files\n\n**PDF Processing:**\n- `app/api/upload/route.ts` - Main upload endpoint with validation, rate limiting, credit checks\n- `lib/pdf/pdf-extractor.ts` - pdf2json integration with event-driven extraction\n- `lib/openai-service.ts` - GPT-4o integration with structured outputs\n- `lib/openai/resume-parser.ts` - Resume parsing with JSON Schema validation\n\n**Authentication \u0026 Authorization:**\n- `lib/auth.ts` - NextAuth v5 config (credentials + OAuth)\n- `middleware.ts` - Route protection and session management\n- `app/api/auth/[...nextauth]/route.ts` - Auth API routes\n\n**Billing \u0026 Credits:**\n- `lib/stripe-service.ts` - Credit management and Stripe integration\n- `app/api/webhooks/stripe/route.ts` - Webhook event handling\n- `app/api/checkout/session/route.ts` - Checkout session creation\n- `lib/rate-limit.ts` - Upload rate limiting (10/hour)\n\n**Database:**\n- `prisma/schema.prisma` - User, ResumeHistory, Account, Session models\n- `lib/prisma.ts` - Prisma client with connection pooling\n\n## Database Schema\n\n### User Model\n- Authentication and profile information\n- Managed by NextAuth.js\n\n### ResumeHistory Model\n- Stores uploaded resume metadata\n- Links to User model\n- Contains extracted resume data in JSON format\n\n## Authentication\n\nThe application uses NextAuth.js v5 with:\n- Credentials provider (email/password)\n- GitHub OAuth provider\n- Google OAuth provider\n- JWT session strategy\n- Prisma adapter for database sessions\n- Protected routes via middleware\n- Password reset flow\n\nFor detailed setup instructions, see [NEXTAUTH_SETUP.md](./NEXTAUTH_SETUP.md)\n\n## Development\n\n### Available Scripts\n\n- `npm run dev` - Start development server\n- `npm run build` - Build for production\n- `npm run start` - Start production server\n- `npm run lint` - Run ESLint\n\n### Database Commands\n\n- `npx prisma studio` - Open Prisma Studio (database GUI)\n- `npx prisma generate` - Generate Prisma Client\n- `npx prisma db push` - Push schema changes to database\n- `npx prisma migrate dev` - Create and apply migrations\n\n\n## PDF Upload Implementation Details\n\n### Architecture Overview\n\nThe PDF upload system is built with a **serverless-first architecture** using pure JavaScript libraries for maximum compatibility with platforms like Vercel, Netlify, and AWS Lambda.\n\n### Technology Stack\n\n**PDF Processing:**\n- **Library**: `pdf2json` (v4.0.0) - Pure JavaScript PDF parser\n- **Why pdf2json**: 100% serverless-compatible, no native dependencies (canvas/sharp)\n- **Temporary Storage**: `/tmp` directory with UUID-based filenames\n- **Cleanup**: Automatic file cleanup with try-finally blocks\n\n**AI Processing:**\n- **Model**: OpenAI GPT-4o (gpt-4o-2024-08-06)\n- **Structured Outputs**: JSON Schema mode with strict validation\n- **Token Limit**: 4096 max tokens per response\n- **Temperature**: 0.1 (for consistent extraction)\n\n### Upload Flow (Step-by-Step)\n\n#### 1. **Client-Side Validation**\n```typescript\n// File type check\nif (file.type !== \"application/pdf\") → Error\n\n// File size check  \nif (file.size \u003e 10MB) → Error\nif (file.size === 0) → Error\n```\n\n#### 2. **Authentication \u0026 Authorization**\n```typescript\n// Check user session\nconst session = await auth()\nif (!session?.user?.id) → 401 Unauthorized\n```\n\n#### 3. **Credit Check** (Pre-Processing)\n```typescript\n// Verify user has enough credits\nconst hasCredits = await hasEnoughCredits(userId, 100)\nif (!hasCredits) → 402 Payment Required\n```\n\n#### 4. **Rate Limiting**\n```typescript\n// Database-based rate limiting\n// Default: 10 uploads per hour per user\nawait checkRateLimit(userId)\nif (exceeded) → 429 Too Many Requests (with Retry-After header)\n```\n\n#### 5. **PDF Validation**\n```typescript\n// Validate PDF buffer\n- Check PDF signature (%PDF header)\n- Verify file size (max 10MB)\n- Ensure buffer is not empty\nif (invalid) → 400 Bad Request\n```\n\n#### 6. **PDF Text Extraction** (Serverless)\n```typescript\n// Using pdf2json library\n1. Write buffer to /tmp/{uuid}.pdf\n2. Initialize PDFParser with event listeners\n3. Extract text with 30-second timeout\n4. Clean and normalize text content\n5. Delete temporary file (cleanup)\n\nResult: { success, text, pageCount, metadata }\n```\n\n**Text Cleaning Process:**\n- Remove excessive whitespace\n- Strip special Unicode characters\n- Normalize line breaks\n- Remove excessive line breaks (\u003e2)\n- Trim whitespace\n\n#### 7. **AI Resume Parsing**\n```typescript\n// Send to OpenAI GPT-4o\n- System prompt: Expert resume parser instructions\n- User prompt: Extracted text\n- Response format: JSON Schema (strict mode)\n- Validation: ENUM values enforced\n\nExtracts:\n- Profile (name, email, summary, location, etc.)\n- Work experiences (with employment/location types)\n- Education (with degree levels)\n- Skills (array of strings)\n- Licenses, languages, achievements, publications, honors\n```\n\n#### 8. **Data Validation**\n```typescript\n// Validate extracted data\n- Check required fields (profile, workExperiences, educations)\n- Verify data types\n- Ensure ENUM values are valid\nif (invalid) → 500 Internal Server Error\n```\n\n#### 9. **Database Storage**\n```typescript\n// Save to PostgreSQL via Prisma\nawait prisma.resumeHistory.create({\n  userId: session.user.id,\n  fileName: file.name,\n  resumeData: {\n    pdfType: \"text\",\n    pages: pageCount,\n    processingMethod: \"text\",\n    status: \"processed\",\n    resumeData: extractedData,\n    metadata: { pages: pageCount }\n  }\n})\n```\n\n#### 10. **Credit Deduction** (Post-Processing)\n```typescript\n// Deduct credits after successful processing\nawait deductCredits(userId, 100)\n// 100 credits per resume extraction\n```\n\n#### 11. **Response**\n```typescript\n// Return success response\n{\n  success: true,\n  data: {\n    id: resumeHistory.id,\n    fileName: file.name,\n    pdfType: \"text\",\n    pages: pageCount,\n    processingMethod: \"text\",\n    status: \"processed\",\n    resumeData: extractedData,\n    creditsUsed: 100\n  }\n}\n```\n\n### File Size Handling\n\n- **Maximum file size**: 10MB (enforced at multiple levels)\n- **Serverless timeout**: 60 seconds max execution time\n- **PDF extraction timeout**: 30 seconds\n- **Payload limit**: Configured via Next.js route config\n\n### Error Handling\n\n**Comprehensive error handling at every stage:**\n\n| Error Type | HTTP Status | User Message |\n|------------|-------------|--------------|\n| No authentication | 401 | \"Unauthorized\" |\n| Insufficient credits | 402 | \"Insufficient credits. Please subscribe...\" |\n| Rate limit exceeded | 429 | \"Rate limit exceeded. Try again in X minutes\" |\n| Invalid file type | 400 | \"Only PDF files are allowed\" |\n| File too large | 400 | \"File size exceeds 10MB limit\" |\n| Empty file | 400 | \"File is empty\" |\n| Invalid PDF structure | 400 | \"Invalid PDF file\" |\n| No text extracted | 500 | \"No meaningful text content found\" |\n| OpenAI rate limit | 429 | \"OpenAI rate limit exceeded\" |\n| Processing timeout | 504 | \"Processing timed out\" |\n| Invalid API key | 500 | \"Server configuration error\" |\n| Generic error | 500 | \"An unexpected error occurred\" |\n\n**Error Response Format:**\n```json\n{\n  \"success\": false,\n  \"error\": \"User-friendly error message\",\n  \"insufficientCredits\": true, // Optional flag\n  \"retryAfter\": 3600 // Optional (for rate limiting)\n}\n```\n\n### Rate Limiting Details\n\n**Configuration:**\n- **Limit**: 10 uploads per hour per user\n- **Window**: Rolling 1-hour window\n- **Storage**: Database-based (ResumeHistory table)\n- **Headers**: Includes `Retry-After`, `X-RateLimit-Limit`, `X-RateLimit-Remaining`\n\n**Implementation:**\n```typescript\n// Count uploads in last hour\nconst uploadCount = await prisma.resumeHistory.count({\n  where: {\n    userId,\n    uploadedAt: { gte: windowStart }\n  }\n})\n\nif (uploadCount \u003e= 10) {\n  // Calculate retry time from oldest upload\n  const retryAfter = Math.ceil(\n    (oldestUpload.uploadedAt + 1hour - now) / 1000\n  )\n  throw new RateLimitError(message, retryAfter)\n}\n```\n\n### Serverless Compatibility\n\n**Why Serverless-Compatible?**\n- No native dependencies (canvas, sharp, pdfjs-dist)\n- Pure JavaScript implementation\n- Works on Vercel, Netlify, AWS Lambda, Cloudflare Workers\n- No webpack configuration needed\n- No build-time compilation required\n\n**Previous Challenges (Solved):**\n- ❌ `pdfjs-dist` → Required canvas (native dependency)\n- ❌ `pdf-parse` → Limited text extraction\n- ❌ `sharp` → Native image processing\n- ✅ `pdf2json` → Pure JavaScript, event-driven, reliable\n\n**Deployment Configuration:**\n```typescript\n// app/api/upload/route.ts\nexport const runtime = \"nodejs\"\nexport const dynamic = \"force-dynamic\"\nexport const maxDuration = 60 // 60 seconds\n```\n\n### Performance Metrics\n\n**Typical Processing Times:**\n- PDF validation: \u003c100ms\n- Text extraction: 500ms - 3s (depending on PDF size)\n- OpenAI parsing: 2s - 8s (depending on content length)\n- Database storage: \u003c200ms\n- **Total**: ~3-12 seconds per resume\n\n**Resource Usage:**\n- Memory: ~50-150MB per request\n- Temporary storage: PDF file size (deleted after processing)\n- Database: ~5-50KB per resume record\n\n\n## OpenAI Integration Details\n\n### Resume Data Extraction\n\nThe application uses OpenAI GPT-4o with **Structured Outputs** (JSON Schema mode) to extract comprehensive resume data with guaranteed format compliance.\n\n**Model Configuration:**\n- **Model**: `gpt-4o-2024-08-06` (latest GPT-4o with structured outputs)\n- **Response Format**: JSON Schema with `strict: true`\n- **Temperature**: 0.1 (for consistent, deterministic extraction)\n- **Max Tokens**: 4096\n- **Timeout**: Configurable (default: 60s)\n\n**Processing Method:**\n- Extracts text from PDF using pdf2json\n- Sends cleaned text to OpenAI with expert system prompt\n- Receives structured JSON matching exact schema\n- Validates ENUM values and required fields\n- Returns validated ResumeData object\n\n### Extracted Data Structure\n\nThe system extracts the following information:\n\n```typescript\n{\n  profile: {\n    name, surname, email, headline,\n    professionalSummary, linkedIn, website,\n    country, city, relocation, remote\n  },\n  workExperiences: [{\n    jobTitle, employmentType, locationType,\n    company, startMonth, startYear,\n    endMonth, endYear, current, description\n  }],\n  educations: [{\n    school, degree, major,\n    startYear, endYear, current, description\n  }],\n  skills: [\"JavaScript\", \"React\", ...],\n  licenses: [{ name, issuer, issueYear, description }],\n  languages: [{ language, level }],\n  achievements: [{ title, organization, achieveDate, description }],\n  publications: [{ title, publisher, publicationDate, publicationUrl, description }],\n  honors: [{ title, issuer, issueMonth, issueYear, description }]\n}\n```\n\n### ENUM Values (Strictly Enforced)\n\nThe JSON Schema enforces these exact ENUM values:\n\n| Field | Allowed Values |\n|-------|----------------|\n| **employmentType** | `FULL_TIME`, `PART_TIME`, `INTERNSHIP`, `CONTRACT` |\n| **locationType** | `ONSITE`, `REMOTE`, `HYBRID` |\n| **degree** | `HIGH_SCHOOL`, `ASSOCIATE`, `BACHELOR`, `MASTER`, `DOCTORATE` |\n| **languageLevel** | `BEGINNER`, `INTERMEDIATE`, `ADVANCED`, `NATIVE` |\n\n**Why Strict ENUMs?**\n- Ensures data consistency across all resumes\n- Enables reliable filtering and searching\n- Prevents typos and variations\n- Simplifies frontend rendering logic\n\n### System Prompt Strategy\n\nThe system uses a carefully crafted prompt to guide GPT-4o:\n\n**Key Instructions:**\n1. Extract ALL available information from the resume\n2. Use exact ENUM values (no variations)\n3. Use `null` for missing single values, `[]` for missing arrays\n4. Format dates correctly (numeric months 1-12, 4-digit years)\n5. Set `current: true` for ongoing positions/education\n6. Extract skills as array of strings\n7. Be thorough with licenses, languages, achievements, publications, honors\n\n**Prompt Engineering:**\n```typescript\nconst SYSTEM_PROMPT = `You are an expert resume parser. Extract ALL information \nfrom the resume and return it in the exact JSON format specified.\n\nIMPORTANT INSTRUCTIONS:\n1. Extract ALL available information from the resume\n2. Use the exact ENUM values provided (e.g., FULL_TIME, REMOTE, BACHELOR, ADVANCED)\n3. For missing fields, use null for single values or empty arrays [] for lists\n...\nReturn ONLY valid JSON matching the ResumeData schema.`\n```\n\n### Structured Outputs (JSON Schema Mode)\n\n**Why JSON Schema Mode?**\n- **Guaranteed Format**: OpenAI ensures response matches schema exactly\n- **No Parsing Errors**: Valid JSON guaranteed (no markdown, no explanations)\n- **Type Safety**: All fields match TypeScript types\n- **ENUM Enforcement**: Only allowed values are returned\n- **Required Fields**: All required fields are always present\n\n**Schema Configuration:**\n```typescript\nresponse_format: {\n  type: \"json_schema\",\n  json_schema: {\n    name: \"resume_extraction\",\n    strict: true,  // Enforces exact schema compliance\n    schema: RESUME_SCHEMA\n  }\n}\n```\n\n### Error Handling\n\n**OpenAI-Specific Errors:**\n- ✅ Rate limiting (429) → \"OpenAI rate limit exceeded. Try again in a moment.\"\n- ✅ Timeout errors → \"Request timed out. Please try again.\"\n- ✅ Invalid API key → \"OpenAI API key is invalid.\"\n- ✅ No response → \"No response from OpenAI\"\n- ✅ Invalid JSON → Caught by structured outputs (shouldn't happen)\n\n**Data Validation Errors:**\n- ✅ Missing required fields (profile, workExperiences, educations)\n- ✅ Invalid data types\n- ✅ Invalid ENUM values\n- ✅ Malformed resume data\n\nAll errors return user-friendly messages via toast notifications.\n\n### Cost Optimization\n\n**Pricing (as of 2024):**\n- GPT-4o: ~$0.005 per 1K input tokens, ~$0.015 per 1K output tokens\n- Average resume: ~2K input tokens, ~1K output tokens\n- **Cost per resume**: ~$0.025 (2.5 cents)\n\n**Optimization Strategies:**\n1. Text extraction only (no expensive Vision API)\n2. Low temperature (0.1) for faster responses\n3. Token limit (4096) to prevent excessive costs\n4. Efficient text cleaning to reduce input tokens\n5. Structured outputs to eliminate retry costs\n\n\n## Dashboard Features\n\n### Quick Stats\n- **Total Resumes**: Count of all processed resumes\n- **Most Recent**: Date of latest upload\n- **Upload Area**: Quick access to PDF upload\n\n### Resume History\n- **Search**: Filter resumes by filename\n- **Sort**: Order by newest or oldest first\n- **Pagination**: Navigate through large lists (10 per page)\n- **View Details**: Click to see full extracted data\n- **Delete**: Remove resumes with confirmation\n\n### Resume Detail Modal\n- **Tabbed Interface**: Profile, Experience, Education, Other\n- **Profile Section**: Personal info, summary, skills\n- **Experience Section**: Timeline view of work history\n- **Education Section**: Academic background\n- **Other Section**: Licenses, languages, achievements, publications, honors\n- **Export Options**: Download JSON or copy to clipboard\n\n\n## Settings Page Features\n\n### Profile Management\n- **Update Display Name**: Change your name with validation\n- **Email Display**: View email (read-only)\n- **Form Validation**: Real-time validation with error messages\n\n### Password Management\n- **Change Password**: Update password with current password verification\n- **Password Strength**: Enforced requirements (8+ chars, uppercase, lowercase, number)\n- **Show/Hide Toggle**: Toggle password visibility\n- **Confirmation Matching**: Ensures new password matches confirmation\n\n### Account Management\n- **Sign Out**: Sign out from current device\n- **Delete Account**: Permanently delete account with all data\n- **Cascade Deletion**: Automatically removes all resume history\n- **Password Confirmation**: Requires password to delete\n- **Type Confirmation**: Must type \"DELETE\" to confirm\n- **Warning Messages**: Clear warnings about data loss\n\n### Usage Statistics\n- **Total Resumes**: Count of processed resumes\n- **Account Created**: Account creation date\n- **Days Active**: Number of days since account creation\n- **Visual Stats**: Color-coded stat cards\n\n## Stripe Integration Setup\n\n### Overview\n\nThe application includes a complete subscription and credit system using Stripe. Users can subscribe to plans that provide credits for resume processing.\n\n### Subscription Plans\n\n- **FREE**: 0 credits (default for new users)\n- **BASIC**: $10/month - 10,000 credits (~100 resume extractions)\n- **PRO**: $20/month - 20,000 credits (~200 resume extractions)\n\nEach resume extraction costs **100 credits**.\n\n### Stripe Setup Instructions\n\n#### 1. Create a Stripe Account\n\n1. Go to [https://stripe.com](https://stripe.com) and sign up\n2. Complete account verification\n3. Switch to **Test Mode** (toggle in top right)\n\n#### 2. Get API Keys\n\n1. Navigate to **Developers** → **API Keys**\n2. Copy your **Publishable key** (starts with `pk_test_`)\n3. Copy your **Secret key** (starts with `sk_test_`)\n4. Add them to your `.env` file:\n\n```env\nSTRIPE_SECRET_KEY=\"sk_test_your_key_here\"\nSTRIPE_PUBLIC_KEY=\"pk_test_your_key_here\"\n```\n\n#### 3. Create Subscription Products\n\n1. Go to **Products** → **Add Product**\n2. Create two products:\n\n**Basic Plan:**\n- Name: \"Basic Plan\"\n- Description: \"10,000 credits per month\"\n- Pricing: $10.00 USD / month (recurring)\n- Copy the **Price ID** (starts with `price_`)\n\n**Pro Plan:**\n- Name: \"Pro Plan\"\n- Description: \"20,000 credits per month\"\n- Pricing: $20.00 USD / month (recurring)\n- Copy the **Price ID** (starts with `price_`)\n\n3. Add the Price IDs to your `.env`:\n\n```env\nSTRIPE_PRICE_BASIC=\"price_1234567890\"\nSTRIPE_PRICE_PRO=\"price_0987654321\"\n```\n\n#### 4. Set Up Webhooks\n\nWebhooks are required for automated subscription management.\n\n**For Local Development (using Stripe CLI):**\n\n1. Install Stripe CLI:\n```bash\n# macOS\nbrew install stripe/stripe-cli/stripe\n\n# Windows (with Scoop)\nscoop install stripe\n\n# Linux\n# Download from https://github.com/stripe/stripe-cli/releases\n```\n\n2. Login to Stripe CLI:\n```bash\nstripe login\n```\n\n3. Forward webhooks to your local server:\n```bash\nstripe listen --forward-to localhost:3000/api/webhooks/stripe\n```\n\n4. Copy the webhook signing secret (starts with `whsec_`) and add to `.env`:\n```env\nSTRIPE_WEBHOOK_SECRET=\"whsec_your_secret_here\"\n```\n\n**For Production:**\n\n1. Go to **Developers** → **Webhooks** → **Add endpoint**\n2. Endpoint URL: `https://yourdomain.com/api/webhooks/stripe`\n3. Select events to listen to:\n   - `invoice.paid`\n   - `invoice.payment_failed`\n   - `customer.subscription.updated`\n   - `customer.subscription.deleted`\n   - `checkout.session.completed`\n4. Copy the **Signing secret** and add to production environment variables\n\n#### 5. Test the Integration\n\n**Test Cards:**\n- Success: `4242 4242 4242 4242`\n- Decline: `4000 0000 0000 0002`\n- Requires authentication: `4000 0025 0000 3155`\n\nUse any future expiry date, any 3-digit CVC, and any ZIP code.\n\n**Testing Flow:**\n\n1. Start your development server:\n```bash\nnpm run dev\n```\n\n2. In another terminal, start Stripe webhook forwarding:\n```bash\nstripe listen --forward-to localhost:3000/api/webhooks/stripe\n```\n\n3. Register/login to your app\n4. Go to Settings page\n5. Click \"Subscribe to Basic Plan\" or \"Subscribe to Pro Plan\"\n6. Complete checkout with test card `4242 4242 4242 4242`\n7. Verify:\n   - Credits are added to your account\n   - Plan type is updated\n   - You can process resumes\n\n**Test Webhook Events:**\n\n```bash\n# Test successful payment\nstripe trigger invoice.paid\n\n# Test subscription cancellation\nstripe trigger customer.subscription.deleted\n```\n\n### Credit System Integration\n\nThe credit system is automatically integrated with resume processing:\n\n1. **Before Processing**: Checks if user has ≥100 credits\n2. **If Insufficient**: Returns 402 error with message to subscribe\n3. **After Success**: Deducts 100 credits from user's balance\n4. **Dashboard Display**: Shows credit balance with color-coded warnings\n\n**Credit Warnings:**\n- **Green** (≥500 credits): Normal operation\n- **Orange** (\u003c500 credits): Low credit warning\n- **Red** (0 credits): No credits - processing blocked\n\n### Database Schema Changes\n\nThe User model now includes:\n\n```prisma\nmodel User {\n  // ... existing fields\n  credits               Int       @default(0)\n  planType              PlanType  @default(FREE)\n  stripeCustomerId      String?   @unique\n  stripeSubscriptionId  String?   @unique\n}\n\nenum PlanType {\n  FREE\n  BASIC\n  PRO\n}\n```\n\nRun migration after pulling:\n```bash\nnpx prisma generate\nnpx prisma db push\n```\n\n### API Routes\n\n**Checkout Session:**\n- `POST /api/checkout/session` - Create Stripe checkout session\n\n**Billing Portal:**\n- `POST /api/billing/portal` - Access Stripe customer portal\n\n**Webhooks:**\n- `POST /api/webhooks/stripe` - Handle Stripe webhook events\n\n### Features\n\n#### Settings Page\n- View current plan and credit balance\n- Subscribe to Basic or Pro plan\n- Upgrade/downgrade plans\n- Manage billing via Stripe Customer Portal\n- Cancel subscription\n\n#### Dashboard\n- Credit balance display with plan type\n- Color-coded credit warnings\n- Low credit alerts (\u003c500 credits)\n- No credit alerts (0 credits)\n- Links to settings for subscription\n\n#### Resume Processing\n- Pre-processing credit check\n- Automatic credit deduction after success\n- Insufficient credit error handling\n- Credit usage tracking\n\n### Webhook Events Handled\n\n- `invoice.paid` - Add credits when subscription payment succeeds\n- `invoice.payment_failed` - Log payment failures\n- `customer.subscription.updated` - Update plan when subscription changes\n- `customer.subscription.deleted` - Downgrade to FREE when cancelled\n- `checkout.session.completed` - Log successful checkouts\n\n### Troubleshooting\n\n**Webhook not receiving events:**\n- Ensure Stripe CLI is running: `stripe listen --forward-to localhost:3000/api/webhooks/stripe`\n- Check webhook signing secret matches `.env`\n- Verify endpoint URL is correct\n\n**Credits not added after payment:**\n- Check webhook logs in Stripe Dashboard\n- Verify Price IDs match in `.env`\n- Check server logs for errors\n\n**Checkout session fails:**\n- Verify API keys are correct\n- Ensure Price IDs exist in Stripe\n- Check NEXTAUTH_URL is set correctly\n\n**Production deployment:**\n- Add webhook endpoint in Stripe Dashboard\n- Use production API keys (starts with `pk_live_` and `sk_live_`)\n- Set all environment variables in production\n- Test with real card in test mode first\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechbrolakes%2Fpdf-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftechbrolakes%2Fpdf-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechbrolakes%2Fpdf-scraper/lists"}