An open API service indexing awesome lists of open source software.

https://github.com/somus/resume-extract

Fast local resume extraction using ONNX NER model. Structured output + ATS scoring in ~15ms.
https://github.com/somus/resume-extract

ats bun machine-learning ner nlp onnx resume resume-parser transformers-js typescript

Last synced: 3 days ago
JSON representation

Fast local resume extraction using ONNX NER model. Structured output + ATS scoring in ~15ms.

Awesome Lists containing this project

README

          

# resume-extract

Fast, local resume extraction using a fine-tuned DistilBERT NER model. Extracts structured data from resume text, PDF, or DOCX via local document parsing + ONNX inference.

## Installation

**Binary (recommended):**

```bash
curl -fsSL https://raw.githubusercontent.com/somus/resume-extract/main/scripts/install-release.sh | bash
resume-extract --help
```

The installer downloads the latest GitHub Release asset into `~/.local/bin`. Override `INSTALL_DIR`, `REPO`, or `VERSION` if needed:

```bash
INSTALL_DIR=/usr/local/bin VERSION=v0.1.0 curl -fsSL https://raw.githubusercontent.com/somus/resume-extract/main/scripts/install-release.sh | bash
```

**As library:**

```bash
bun install
```

**Build from source:**

```bash
bun run build:bin
./dist/resume-extract --input ./resume.pdf --ats
```

Notes:

- `parseResume()` is text-only fast path.
- `parseResumePdf()` and `parseResumeDocx()` use `@kreuzberg/node` for local document text extraction.
- `parseResumePdf(..., { ocr: true })` enables OCR for scanned PDFs (defaults to Tesseract). Supports `tesseract`, `easyocr`, and `paddleocr` backends via `{ ocr: { backend: "easyocr" } }`. OCR is much slower than text parsing.
- On first run, the CLI automatically downloads the required `oksomu/resume-ner` model files into a local cache if they are missing and shows download progress. Pass `--model` to use a custom directory or `--no-download` to require a pre-populated model directory.
- Library consumers should manage model directories explicitly.

## Features

- **Structured extraction**: name, email, phone, location, companies, titles, education, skills
- **Document input support**: parse raw text, PDF, or DOCX
- **ATS scoring**: completeness score with actionable issues list
- **Seniority inference**: from job titles + years of experience
- **Country detection**: from location + phone prefix
- **Experience years**: computed from employment dates
- **Section-aware chunking**: splits long resumes at paragraph boundaries for >512 token texts
- **Section detection**: rule-based gap-filling for skills, certifications, and languages the model misses
- **100% local**: runs offline via ONNX, no API calls
- **Fast text parsing**: ~15ms per resume after model load
- **Optional document parsing**: PDF via Kreuzberg, including OCR when enabled; DOCX via Kreuzberg

## Model

Uses [`oksomu/resume-ner`](https://huggingface.co/oksomu/resume-ner) — a DistilBERT model fine-tuned for resume NER and exported to ONNX for local structured extraction.

Latest model metrics (from [model card](https://huggingface.co/oksomu/resume-ner), noise-augmented, 25 epochs, entity-level exact-match via seqeval):

- entity F1: 97.77%
- structured micro F1: 97.88%
- clean resume F1: 99.18%
- noisy resume F1: 69.24% (OCR/scraped text)
- quantized ONNX size: 63MB

Entity types:

- NAME, EMAIL, PHONE, LOCATION, COMPANY, TITLE, DATE, DEGREE, INSTITUTION, FIELD, SKILL, CERT, LANGUAGE

Model directory should include:

- `resume_config.json` — pre-processing, post-processing, and inference rules
- `companies.json` — company gazetteer for post-processing
- `city_country_map.json` — 317 cities for country inference
- tokenizer/config files
- `onnx/model_quantized.onnx` or `onnx/model.onnx`

## Usage

```typescript
import {
computeATSScore,
parseResume,
parseResumeDocx,
parseResumePdf,
} from "resume-extract";

const result = await parseResume(resumeText, "/path/to/model");
const fromPdf = await parseResumePdf("/path/to/resume.pdf", "/path/to/model");
const fromScannedPdf = await parseResumePdf(pdfBytes, "/path/to/model", { ocr: true });
const fromDocx = await parseResumeDocx("/path/to/resume.docx", "/path/to/model");

// result.personal: { name, email, phone, location }
// result.experience: [{ title, company, start_date, end_date }]
// result.education: [{ degree, field, institution }]
// result.skills: ["Python", "AWS", ...]
// result.seniority: "Senior"
// result.country: "India"
// result.experience_years: 10

const ats = computeATSScore(result);
// ats.score: 87
// ats.issues: [{ severity: "medium", message: "..." }]
```

## CLI

Run directly with Bun:

```bash
bun run cli ./resume.pdf --ats
bun run cli --text "Jane Doe..."
bun run cli ./resume.pdf --view json --output result.json
cat ./resume.txt | bun run cli

# Batch mode
bun run cli batch ./resumes/*.pdf --ats
bun run cli batch --input-dir ./resumes --glob '**/*' --output batch.jsonl
bun run cli batch --input-dir ./resumes --output batch.csv --output-format csv
bun run cli batch --input-dir ./resumes --fail-fast

# Explicit model setup and diagnostics
bun run cli setup-model
bun run cli doctor --ocr
bun run cli doctor --fix
bun run cli doctor --json
```

Common flags:

- `--model `: model directory
- `--model-repo `: alternate Hugging Face repo for first-run download
- `--model-revision `: alternate model revision for first-run download
- `--no-download`: disable automatic model download
- `--input `: input file path
- `--text `: inline text input
- `--format `: override format detection
- `--ocr`: enable PDF OCR (defaults to Tesseract)
- `--ocr-backend `: OCR backend: `tesseract`, `easyocr`, or `paddleocr`
- `--ats`: include ATS scoring in output
- `--view `: render machine JSON or human-friendly terminal output
- `--output `: write structured output to a file
- `--compact`: emit minified JSON

Batch-only flags:

- `batch [inputs...]`: process many resumes at once
- `--input-dir `: scan a directory for resumes
- `--glob `: file selection pattern for directory scanning
- `--concurrency `: parallel batch workers, defaults to `4`
- `--fail-fast`: stop batch processing on the first extraction error
- `--output-format `: structured batch output format

Extra commands:

- `setup-model`: download the configured model into the local cache or custom `--model` path
- `update-model`: pull the latest model from Hugging Face, re-downloading all files
- `doctor`: inspect model readiness, file integrity, writable cache paths, runtime platform, and optional OCR availability
- `doctor --fix`: download/repair the configured model, then report status
- `doctor --json`: emit machine-readable diagnostics

The CLI checks for model updates once per day. If a newer model is available on Hugging Face, a warning is shown on stderr. Run `update-model` to pull the latest.

Output behavior:

- Single resume commands default to `pretty` view on a TTY and `json` otherwise.
- Batch commands default to `pretty` summaries on a TTY and structured JSON otherwise.
- Use `--view json` when piping to other tools.
- Use `--output` with `batch` plus `--output-format jsonl` for machine-friendly bulk processing.
- Use `--output-format csv` when you want spreadsheet-friendly flat output with summary fields plus numbered experience and education columns.

## Limitations

- English resumes only
- Max 512 tokens per chunk (section-aware chunking splits at paragraph boundaries for longer resumes)
- Image-based/scanned PDFs require OCR before text extraction
- Two-column PDF layouts may flatten during text extraction

## Development

```bash
bun run test # Run tests
bun run check # Biome lint + format check
bun run typecheck # TypeScript type check
bun run format # Auto-format
```

## License

MIT