{"id":50697296,"url":"https://github.com/somus/resume-extract","last_synced_at":"2026-06-09T07:30:52.756Z","repository":{"id":355458480,"uuid":"1228170740","full_name":"somus/resume-extract","owner":"somus","description":"Fast local resume extraction using ONNX NER model. Structured output + ATS scoring in ~15ms.","archived":false,"fork":false,"pushed_at":"2026-05-05T09:55:05.000Z","size":83,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-06-05T03:08:44.344Z","etag":null,"topics":["ats","bun","machine-learning","ner","nlp","onnx","resume","resume-parser","transformers-js","typescript"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/somus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-03T17:29:00.000Z","updated_at":"2026-05-29T09:49:01.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/somus/resume-extract","commit_stats":null,"previous_names":["somus/resume-extract"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/somus/resume-extract","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somus%2Fresume-extract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somus%2Fresume-extract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somus%2Fresume-extract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somus%2Fresume-extract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/somus","download_url":"https://codeload.github.com/somus/resume-extract/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somus%2Fresume-extract/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34096950,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ats","bun","machine-learning","ner","nlp","onnx","resume","resume-parser","transformers-js","typescript"],"created_at":"2026-06-09T07:30:52.208Z","updated_at":"2026-06-09T07:30:52.750Z","avatar_url":"https://github.com/somus.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# resume-extract\n\nFast, local resume extraction using a fine-tuned DistilBERT NER model. Extracts structured data from resume text, PDF, or DOCX via local document parsing + ONNX inference.\n\n## Installation\n\n**Binary (recommended):**\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/somus/resume-extract/main/scripts/install-release.sh | bash\nresume-extract --help\n```\n\nThe installer downloads the latest GitHub Release asset into `~/.local/bin`. Override `INSTALL_DIR`, `REPO`, or `VERSION` if needed:\n\n```bash\nINSTALL_DIR=/usr/local/bin VERSION=v0.1.0 curl -fsSL https://raw.githubusercontent.com/somus/resume-extract/main/scripts/install-release.sh | bash\n```\n\n**As library:**\n\n```bash\nbun install\n```\n\n**Build from source:**\n\n```bash\nbun run build:bin\n./dist/resume-extract --input ./resume.pdf --ats\n```\n\nNotes:\n\n- `parseResume()` is text-only fast path.\n- `parseResumePdf()` and `parseResumeDocx()` use `@kreuzberg/node` for local document text extraction.\n- `parseResumePdf(..., { ocr: true })` enables OCR for scanned PDFs (defaults to Tesseract). Supports `tesseract`, `easyocr`, and `paddleocr` backends via `{ ocr: { backend: \"easyocr\" } }`. OCR is much slower than text parsing.\n- On first run, the CLI automatically downloads the required `oksomu/resume-ner` model files into a local cache if they are missing and shows download progress. Pass `--model` to use a custom directory or `--no-download` to require a pre-populated model directory.\n- Library consumers should manage model directories explicitly.\n\n## Features\n\n- **Structured extraction**: name, email, phone, location, companies, titles, education, skills\n- **Document input support**: parse raw text, PDF, or DOCX\n- **ATS scoring**: completeness score with actionable issues list\n- **Seniority inference**: from job titles + years of experience\n- **Country detection**: from location + phone prefix\n- **Experience years**: computed from employment dates\n- **Section-aware chunking**: splits long resumes at paragraph boundaries for \u003e512 token texts\n- **Section detection**: rule-based gap-filling for skills, certifications, and languages the model misses\n- **100% local**: runs offline via ONNX, no API calls\n- **Fast text parsing**: ~15ms per resume after model load\n- **Optional document parsing**: PDF via Kreuzberg, including OCR when enabled; DOCX via Kreuzberg\n\n## Model\n\nUses [`oksomu/resume-ner`](https://huggingface.co/oksomu/resume-ner) — a DistilBERT model fine-tuned for resume NER and exported to ONNX for local structured extraction.\n\nLatest model metrics (from [model card](https://huggingface.co/oksomu/resume-ner), noise-augmented, 25 epochs, entity-level exact-match via seqeval):\n\n- entity F1: 97.77%\n- structured micro F1: 97.88%\n- clean resume F1: 99.18%\n- noisy resume F1: 69.24% (OCR/scraped text)\n- quantized ONNX size: 63MB\n\nEntity types:\n\n- NAME, EMAIL, PHONE, LOCATION, COMPANY, TITLE, DATE, DEGREE, INSTITUTION, FIELD, SKILL, CERT, LANGUAGE\n\nModel directory should include:\n\n- `resume_config.json` — pre-processing, post-processing, and inference rules\n- `companies.json` — company gazetteer for post-processing\n- `city_country_map.json` — 317 cities for country inference\n- tokenizer/config files\n- `onnx/model_quantized.onnx` or `onnx/model.onnx`\n\n## Usage\n\n```typescript\nimport {\n  computeATSScore,\n  parseResume,\n  parseResumeDocx,\n  parseResumePdf,\n} from \"resume-extract\";\n\nconst result = await parseResume(resumeText, \"/path/to/model\");\nconst fromPdf = await parseResumePdf(\"/path/to/resume.pdf\", \"/path/to/model\");\nconst fromScannedPdf = await parseResumePdf(pdfBytes, \"/path/to/model\", { ocr: true });\nconst fromDocx = await parseResumeDocx(\"/path/to/resume.docx\", \"/path/to/model\");\n\n// result.personal: { name, email, phone, location }\n// result.experience: [{ title, company, start_date, end_date }]\n// result.education: [{ degree, field, institution }]\n// result.skills: [\"Python\", \"AWS\", ...]\n// result.seniority: \"Senior\"\n// result.country: \"India\"\n// result.experience_years: 10\n\nconst ats = computeATSScore(result);\n// ats.score: 87\n// ats.issues: [{ severity: \"medium\", message: \"...\" }]\n```\n\n## CLI\n\nRun directly with Bun:\n\n```bash\n bun run cli ./resume.pdf --ats\n bun run cli --text \"Jane Doe...\"\n bun run cli ./resume.pdf --view json --output result.json\ncat ./resume.txt | bun run cli\n\n# Batch mode\nbun run cli batch ./resumes/*.pdf --ats\nbun run cli batch --input-dir ./resumes --glob '**/*' --output batch.jsonl\nbun run cli batch --input-dir ./resumes --output batch.csv --output-format csv\nbun run cli batch --input-dir ./resumes --fail-fast\n\n# Explicit model setup and diagnostics\nbun run cli setup-model\nbun run cli doctor --ocr\nbun run cli doctor --fix\nbun run cli doctor --json\n```\n\nCommon flags:\n\n- `--model \u003cpath\u003e`: model directory\n- `--model-repo \u003crepo\u003e`: alternate Hugging Face repo for first-run download\n- `--model-revision \u003crev\u003e`: alternate model revision for first-run download\n- `--no-download`: disable automatic model download\n- `--input \u003cpath\u003e`: input file path\n- `--text \u003ctext\u003e`: inline text input\n- `--format \u003cauto|text|pdf|docx\u003e`: override format detection\n- `--ocr`: enable PDF OCR (defaults to Tesseract)\n- `--ocr-backend \u003cbackend\u003e`: OCR backend: `tesseract`, `easyocr`, or `paddleocr`\n- `--ats`: include ATS scoring in output\n- `--view \u003cjson|pretty\u003e`: render machine JSON or human-friendly terminal output\n- `--output \u003cpath\u003e`: write structured output to a file\n- `--compact`: emit minified JSON\n\nBatch-only flags:\n\n- `batch [inputs...]`: process many resumes at once\n- `--input-dir \u003cpath\u003e`: scan a directory for resumes\n- `--glob \u003cpattern\u003e`: file selection pattern for directory scanning\n- `--concurrency \u003cn\u003e`: parallel batch workers, defaults to `4`\n- `--fail-fast`: stop batch processing on the first extraction error\n- `--output-format \u003cjson|jsonl|csv\u003e`: structured batch output format\n\nExtra commands:\n\n- `setup-model`: download the configured model into the local cache or custom `--model` path\n- `update-model`: pull the latest model from Hugging Face, re-downloading all files\n- `doctor`: inspect model readiness, file integrity, writable cache paths, runtime platform, and optional OCR availability\n- `doctor --fix`: download/repair the configured model, then report status\n- `doctor --json`: emit machine-readable diagnostics\n\nThe CLI checks for model updates once per day. If a newer model is available on Hugging Face, a warning is shown on stderr. Run `update-model` to pull the latest.\n\nOutput behavior:\n\n- Single resume commands default to `pretty` view on a TTY and `json` otherwise.\n- Batch commands default to `pretty` summaries on a TTY and structured JSON otherwise.\n- Use `--view json` when piping to other tools.\n- Use `--output` with `batch` plus `--output-format jsonl` for machine-friendly bulk processing.\n- Use `--output-format csv` when you want spreadsheet-friendly flat output with summary fields plus numbered experience and education columns.\n\n## Limitations\n\n- English resumes only\n- Max 512 tokens per chunk (section-aware chunking splits at paragraph boundaries for longer resumes)\n- Image-based/scanned PDFs require OCR before text extraction\n- Two-column PDF layouts may flatten during text extraction\n\n## Development\n\n```bash\nbun run test        # Run tests\nbun run check       # Biome lint + format check\nbun run typecheck   # TypeScript type check\nbun run format      # Auto-format\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomus%2Fresume-extract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsomus%2Fresume-extract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomus%2Fresume-extract/lists"}