{"id":16095640,"url":"https://github.com/rigwild/ocr-search","last_synced_at":"2025-04-05T20:14:26.270Z","repository":{"id":57191354,"uuid":"436458157","full_name":"rigwild/ocr-search","owner":"rigwild","description":"🔍 Find files that contain some text with OCR","archived":false,"fork":false,"pushed_at":"2021-12-12T00:57:33.000Z","size":2527,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-12T17:06:27.875Z","etag":null,"topics":["jpg","ocr","pdf","png","search","tesseract","tesseract-ocr","webp","worker-threads"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rigwild.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-09T02:30:47.000Z","updated_at":"2024-02-02T10:13:58.000Z","dependencies_parsed_at":"2022-09-16T05:12:26.011Z","dependency_job_id":null,"html_url":"https://github.com/rigwild/ocr-search","commit_stats":null,"previous_names":["rigwild/bulk-files-ocr-search"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rigwild%2Focr-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rigwild%2Focr-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rigwild%2Focr-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rigwild%2Focr-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rigwild","download_url":"https://codeload.github.com/rigwild/ocr-search/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247393573,"owners_count":20931813,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jpg","ocr","pdf","png","search","tesseract","tesseract-ocr","webp","worker-threads"],"created_at":"2024-10-09T17:07:15.290Z","updated_at":"2025-04-05T20:14:26.248Z","avatar_url":"https://github.com/rigwild.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OCR Search\n\n[![Node.js CI](https://github.com/rigwild/ocr-search/workflows/Node.js%20CI/badge.svg)](https://github.com/rigwild/ocr-search/actions)\n[![npm package](https://img.shields.io/npm/v/ocr-search.svg?logo=npm)](https://www.npmjs.com/package/ocr-search)\n[![npm downloads](https://img.shields.io/npm/dw/ocr-search)](https://www.npmjs.com/package/ocr-search)\n[![license](https://img.shields.io/npm/l/ocr-search?color=blue)](./LICENSE)\n\n🔍 Find files that contain some text with [OCR](https://en.wikipedia.org/wiki/Optical_character_recognition).\n\nSupported file formats:\n\n- Images: JPEG, PNG, [WebP](https://en.wikipedia.org/wiki/WebP)\n- Documents: PDF\n\nUnsupported file formats:\n\n- Images: [AVIF](https://en.wikipedia.org/wiki/AVIF), [WebP 2 (`.wp2`)](https://en.wikipedia.org/wiki/WebP#WebP_2), [JPEG XL (`.jxl`)](https://en.wikipedia.org/wiki/JPEG_XL)\n- Documents: Office (`.docx`, `.xlsx`, `.pptx`, ...)\n\n[Tesseract OCR](https://github.com/tesseract-ocr/tesseract) is used internally ([Tesseract Documentation](https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc)). For PDF to PNG conversion, [Poppler](https://poppler.freedesktop.org/) is used.\n\nThis package uses [worker threads](https://nodejs.org/api/worker_threads.html) to make use of your CPU cores and be faster.\n\n**Notes:**\n\n- The OCR will provide bad results for rotated files/non-straight text.\n  - 90/180 degrees rotations seems to output a good result\n  - You may want to pre-process your files somehow to make the text straight!\n- Files will be matched if at least 1 of the words is found in the text contained in it.\n\n## Install\n\nNo matter how you decide to use this package, you need to install Tesseract OCR anyway. If you have some PDF files, they need to be converted with additional packages.\n\n```sh\n# OCR Package (non-linux, see https://github.com/tesseract-ocr/tesseract#installing-tesseract)\nsudo apt install tesseract-ocr\n\n# PDF to JPEG conversion command-line (for Windows, see https://stackoverflow.com/a/53960829 - MacOS `brew install poppler`)\n# You can skip this if you don't plan to scan PDF files\nsudo apt install poppler-utils\n```\n\n### OCR Language\n\nIf you want to use another language than English, download then install the required language from the [Tesseract OCR Languages Models repository](https://github.com/tesseract-ocr/tessdata_fast).\n\n```sh\n# French language\nwget https://github.com/tesseract-ocr/tessdata_fast/raw/main/fra.traineddata\nsudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/\n```\n\n## Use with CLI\n\nThis will install the `ocr-search` CLI.\n\n```sh\npnpm i -g ocr-search\n```\n\n```\n$ ocr-search --help\n\n  🔍 Find files that contain some text with OCR\n\n  Usage\n    $ ocr-search --words \"\u003cwords_list\u003e\" \u003cinput_files\u003e\n\n  To delete images created from PDF files pages extractions, check the other provided command:\n    $ ocr-search --help\n\n  Required\n    --words List of comma-separated words to search (if \"MATCH_ALL\", will match everything for mass OCR extraction)\n\n  Options\n    --ignoreExt         List of comma-separated file extensions to ignore (e.g. \".pdf,.jpg\")\n    --pdfExtractFirst   Range start of the pages to extract from PDF files (1-indexed)\n    --pdfExtractLast    Range end of the pages to extract from PDF files, last page if overflow (1-indexed)\n    --progressFile      File to save progress to, will start from where it\n                        stopped last time by looking there (no file, use \"none\")  [default=\"progress.json\"]\n    --matchesLogFile    Log all matches to this file (no file, use \"none\") [default=\"matches.txt\"]\n    --no-console-logs   Silence all console logs\n    --no-show-matches   Do not print matched files text content to the console [default=\"false\"]\n    --workers           Amount of worker threads to use (default is total CPU cores count - 2)\n\n  OCR Options - See https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc\n    --lang  Tesseract OCR LANG configuration [default=\"eng\"]\n    --oem   Tesseract OCR OEM configuration [default=\"1\"]\n    --psm   Tesseract OCR PSM configuration [default=\"1\"]\n\n  Examples\n    Scan the \"scanned-dir\" directory and match all the files containing \"system\", \"wiki\" and \"hello\"\n      $ ocr-search --words \"system,wiki,hello\" scanned-dir\n\n    Scan the glob-matched files \"*\" and match all files (mass OCR text extraction)\n      $ ocr-search --words MATCH_ALL *\n\n    Skip .pdf and .webp files\n      $ ocr-search --words \"wiki,hello\" --ignoreExt \".pdf,.webp\" scanned-dir\n\n    Extract only page 3 to 6 in all PDF files (1-indexed)\n      $ ocr-search --words \"wiki,hello\" --pdfExtractFirst 3 --pdfExtractLast 6 scanned-dir\n\n    Use a specific Tesseract OCR configuration\n      $ ocr-search --words \"wiki,hello\" --lang fra --oem 1 --psm 3 scanned-dir\n\n  https://github.com/rigwild/ocr-search\n```\n\nAnother CLI is provided to easily remove all extracted PDF pages images.\n\n```\n$ ocr-search-clean --help\n\n  🗑️ Find and remove content generated by ocr-search\n\n  Usage\n    $ ocr-search-clean [--pdf] [--txt] \u003cinput_files\u003e\n\n  Options\n    --pdf  Remove images that were generated by PDF files pages extraction (e.g.\"file.pdf-1.png\")\n    --txt  Remove text files that were generated by OCR (option \"--save-ocr\" in \"ocr-search\")\n\n  https://github.com/rigwild/ocr-search\n```\n\n## Use with provided runner\n\n```sh\ngit clone https://github.com/rigwild/ocr-search.git\ncd ocr-search\npnpm install # or npm install -D\npnpm build\n```\n\nPut all your files/directories in the [`data`](./data) directory. They can be in subfolders.\n\nThe progress will be printed to the console and saved in the `progress.json` file.\n\nThe list of files that match at least one of the provided words and their content will be saved to the `matches.txt` file.\n\n```sh\nnode run.js\n```\n\nSee [`run.js`](./run.js).\n\n## Use Programatically\n\n### Install\n\n```sh\npnpm i ocr-search\n```\n\n### Directory scan\n\n```ts\nimport path from 'path'\nimport { scanDir, TesseractConfig } from 'ocr-search'\n\n// The list of options\nexport type ScanOptions = {\n  /**\n   * List of words to search (if one is matched, the file is matched)\n   *\n   * If not provided, every files will get matched (useful to do mass OCR and save the result)\n   */\n  words?: string[]\n\n  /** Should the OCR scanned content of each file be saved to a txt file (e.g. \"file.png.txt\") */\n  saveOcr?: boolean\n\n  /** Should the logs be printed to the console? (default = false) */\n  shouldConsoleLog?: boolean\n\n  /** Should the matches file content be printed to the console? (default = true) */\n  shouldConsoleLogMatches?: boolean\n\n  /**\n   * If provided, the progress will be saved to a file\n   *\n   * When stopped, the process will start from where it stopped last time by looking there\n   */\n  progressFile?: string\n\n  /** If provided, every file path and their text content that were matched are logged to this file */\n  matchesLogFile?: string\n\n  /** File extensions to ignore when looking for files (e.g. `new Set(['.pdf', '.jpg'])`) */\n  ignoreExt?: Set\u003cstring\u003e\n\n  /* Extract PDF files starting at this page, first page is 1 (1-indexed) (default = 1) */\n  pdfExtractFirst?: number\n\n  /* Extract PDF files until this page, last page if overflow (1-indexed) (default = last page of PDF file) */\n  pdfExtractLast?: number\n\n  /**\n   * Amount of worker threads to use (default = your total CPU cores - 2)\n   *\n   * Note: Using all your available cores may slow down the process!\n   */\n  workerPoolSize?: number\n\n  /**\n   * Tesseract OCR config, will default `{ lang: 'eng', oem: 1, psm: 1 }`\n   *\n   * @see https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc\n   */\n  tesseractConfig?: TesseractConfig\n}\n\nconst scannedDir = path.resolve(__dirname, 'data')\nconst words = ['hello', 'match this', '\u003c\u003c\u003c\u003c\u003c']\nconst tesseractConfig: TesseractConfig = { lang: 'fra', oem: 1, psm: 1 }\n\nconsole.time('scan')\n\nawait scanDir(scannedDir, {\n  words,\n  shouldConsoleLog: true,\n  tesseractConfig\n})\n\nconsole.log('Scan finished!')\nconsole.timeEnd('scan')\n```\n\n### Perform OCR on a single file\n\n```ts\nimport path from 'path'\nimport { ocr } from 'ocr-search'\n\nconst file = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.jpg')\n\n// Tesseract configuration\nconst tesseractConfig: TesseractConfig = { lang: 'eng', oem: 1, psm: 1 }\n\n// Should the string be normalized? (lowercase, accents removed, whitespace removed)\nconst shouldCleanStr: boolean | undefined = true\n\nconst text = await ocr(file, tesseractConfig, shouldCleanStr)\nconsole.log(text)\n```\n\n### PDF to images conversion\n\nConvert PDF pages to PNG. Files are generated on the file system, 1 file per PDF page.\n\n```ts\nimport path from 'path'\nimport { pdfToImages } from 'ocr-search'\n\nconst filePdf = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.pdf')\n\n// Extract from page 1 to page 3 (1-indexed)\nconst res = await pdfToImages(filePdf, 1, 3)\nconsole.log(res) // Paths to generated PNG files\n```\n\n## License\n\n[GNU Affero General Public License v3.0](./LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frigwild%2Focr-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frigwild%2Focr-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frigwild%2Focr-search/lists"}