{"id":26665505,"url":"https://github.com/elsehow/corpus-from-pdfs","last_synced_at":"2026-02-14T22:31:06.433Z","repository":{"id":76067195,"uuid":"48306497","full_name":"elsehow/corpus-from-pdfs","owner":"elsehow","description":"make a text corpus (for machine learning) from a batch of PDFs","archived":false,"fork":false,"pushed_at":"2015-12-20T18:43:57.000Z","size":3,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-08T06:53:34.832Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elsehow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-12-20T03:55:36.000Z","updated_at":"2023-02-19T06:58:48.000Z","dependencies_parsed_at":"2023-03-15T20:00:36.544Z","dependency_job_id":null,"html_url":"https://github.com/elsehow/corpus-from-pdfs","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/elsehow/corpus-from-pdfs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elsehow%2Fcorpus-from-pdfs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elsehow%2Fcorpus-from-pdfs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elsehow%2Fcorpus-from-pdfs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elsehow%2Fcorpus-from-pdfs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elsehow","download_url":"https://codeload.github.com/elsehow/corpus-from-pdfs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elsehow%2Fcorpus-from-pdfs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29458563,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-14T21:29:27.764Z","status":"ssl_error","status_checked_at":"2026-02-14T21:28:11.111Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-25T17:37:22.450Z","updated_at":"2026-02-14T22:31:06.415Z","avatar_url":"https://github.com/elsehow.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# corpus-from-pdfs\n\nthis makes a text corpus (for machine learning) from a batch of PDFs\n\n## usage\n\n**TODO this api**\n\n    corpus-from-pdfs my-pdf-dir/*.pdf\n\n\n## installation\n\n**TODO publish on npm**\n\n    npm install -g corpus-from-pdfs\n\nunfortunately, [pdf-extract](https://www.npmjs.com/package/pdf-extract) has a few native dependencies you'll need to install on your platform:\n\n- pdftk\n- pdftotext\n- ghostscript\n- tesseract\n\n### OSX\nTo begin on OSX, first make sure you have the homebrew package manager installed.\n\n**pdftk** is not available in Homebrew. However a gui install is available here.\n[http://www.pdflabs.com/docs/install-pdftk/](http://www.pdflabs.com/docs/install-pdftk/)\n\n**pdftotext** is included as part of the **poppler** utilities library. **poppler** can be installed via homebrew\n\n``` bash\nbrew install poppler\n```\n\n**ghostscript** can be install via homebrew\n``` bash\nbrew install gs\n```\n\n**tesseract** can be installed via homebrew as well\n\n`brew install tesseract`\n\nAfter tesseract is installed you need to install the alphanumeric config and an updated trained data file\n``` bash\ncd \u003croot of this module\u003e\nnpm install\ncp \"./node_modules/share/eng.traineddata\" \"/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/eng.traineddata\"\ncp \"./node_modules/share/dia.traineddata\" \"/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/dia.traineddata\"\ncp \"./node_modules/share/configs/alphanumeric\" \"/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/configs/alphanumeric\"\n```\n\n### Ubuntu\n**pdftk** can be installed directly via apt-get\n```bash\napt-get install pdftk\n```\n\n**pdftotext** is included in the **poppler-utils** library. To installer poppler-utils execute\n``` bash\napt-get install poppler-utils\n```\n\n**ghostscript** can be install via apt-get\n``` bash\napt-get install ghostscript\n```\n\n**tesseract** can be installed via apt-get. Note that unlike the osx install the package is called **tesseract-ocr** on Ubuntu, not **tesseract**\n``` bash\napt-get install tesseract-ocr\n```\n\nFor the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the *alphanumeric* file included with this pdf-extract module into the *tess-data* folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This pdf-extract module provides an up-to-date version which you should copy into the appropriate location on your system\n``` bash\ncd \u003croot of this module\u003e\nnpm install\ncp \"./node_modules/share/eng.traineddata\" \"/usr/share/tesseract-ocr/tessdata/eng.traineddata\"\ncp \"./node_modules/share/configs/alphanumeric\" \"/usr/share/tesseract-ocr/tessdata/configs/alphanumeric\"\n```\n\n\n### SmartOS\n**pdftk** can be installed directly via apt-get\n```bash\napt-get install pdftk\n```\n\n**pdftotext** is included in the **poppler-utils** library. To installer poppler-utils execute\n``` bash\napt-get install poppler-utils\n```\n\n**ghostscript** can be install via pkgin. Note you may need to update the pkgin repo to include the additional sources provided by Joyent. Check [http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html](http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html) for details\n``` bash\npkgin install ghostscript\n```\n\n**tesseract** can be must be manually downloaded and compiled. You must also install leptonica before installing tesseract. At the time of this writing leptonica is available from [http://www.leptonica.com/download.html](http://www.leptonica.com/download.html), with the latest version tarball available from [http://www.leptonica.com/source/leptonica-1.69.tar.gz](http://www.leptonica.com/source/leptonica-1.69.tar.gz)\n``` bash\npkgin install autoconf\nwget http://www.leptonica.com/source/leptonica-1.69.tar.gz\ntar -xvzf leptonica-1.69.tar.gz\ncd leptonica-1.69\n./configure\nmake\n[sudo] make install\n```\nAfter installing leptonic move on to tesseract. Tesseract is available from [https://code.google.com/p/tesseract-ocr/downloads/list](https://code.google.com/p/tesseract-ocr/downloads/list) with the latest version available from [https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz\u0026can=2\u0026q=](https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz\u0026can=2\u0026q=)\n``` bash\nwget https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz\u0026can=2\u0026q=\ntar -xvzf tesseract-ocr-3.02.02.tar.gz\ncd tesseract-ocr\n./configure\nmake\n[sudo] make install\n```\n\n### Windows\nNot yet tested. If you figure out how to use pdf-extract on windows send me a pull request and I will update the readme accordingly\n\n## Usage\n=======\n\n### OCR Extract from scanned image\nExtract from a pdf file which contains a scanned image and no searchable text\n``` javascript\nvar inspect = require('eyes').inspector({maxLength:20000});\nvar pdf_extract = require('pdf-extract');\nvar absolute_path_to_pdf = '~/Downloads/sample.pdf'\nvar options = {\n  type: 'ocr' // perform ocr to get the text within the scanned image\n}\n\nvar processor = pdf_extract(absolute_path_to_pdf, options, function(err) {\n  if (err) {\n    return callback(err);\n  }\n});\nprocessor.on('complete', function(data) {\n  inspect(data.text_pages, 'extracted text pages');\n  callback(null, text_pages);\n});\nprocessor.on('error', function(err) {\n  inspect(err, 'error while extracting pages');\n  return callback(err);\n});\n```\n\n\n\n### Text extract from searchable pdf\nExtract from a pdf file which contains actual searchable text\n``` javascript\nvar inspect = require('eyes').inspector({maxLength:20000});\nvar pdf_extract = require('pdf-extract');\nvar absolute_path_to_pdf = '~/Downloads/electronic.pdf'\nvar options = {\n  type: 'text'  // extract the actual text in the pdf file\n}\nvar processor = pdf_extract(absolute_path_to_pdf, options, function(err) {\n  if (err) {\n    return callback(err);\n  }\n});\nprocessor.on('complete', function(data) {\n  inspect(data.text_pages, 'extracted text pages');\n  callback(null, data.text_pages);\n});\nprocessor.on('error', function(err) {\n  inspect(err, 'error while extracting pages');\n  return callback(err);\n});\n\n```\n#### Options\nAt a minimum you must specific the type of pdf extract you wish to perform\n\n**clean**\nWhen the system performs extracts text from a multi-page pdf, it first splits the pdf into single pages. This are written to disk before the ocr occurs. For some applications these single page files can be useful. If you need to work with the single page pdf files after the ocr is complete, set the **clean** option to **false** as show below. Note that the single page pdf files are written to the system appropriate temp directory, so if you must copy the files to a more permanent location yourself after the ocr process completes\n``` javascript\nvar options = {\n  type: 'ocr' // (required), perform ocr to get the text within the scanned image\n  clean: false // keep the single page pdfs created during the ocr process\n  ocr_flags: [\n    '-psm 1',       // automatically detect page orientation\n    '-l dia',       // use a custom language file\n    'alphanumeric'  // only output ascii characters\n  ]\n}\n```\n\n\n### Events\nWhen processing, the module will emit various events as they occurr\n\n**page**\nEmitted when a page has completed processing. The data passed with this event looks like\n``` javascript\nvar data = {\n  hash: \u003csha1 hash of the input pdf file here\u003e\n  text: \u003cextracted text here\u003e,\n  index: 2,\n  num_pages: 4,\n  pdf_path: \"~/Downloads/input_pdf_file.pdf\",\n  single_page_pdf_path: \"/tmp/temp_pdf_file2.pdf\"\n}\n```\n\n**error**\nEmitted when an error occurs during processing. After this event is emitted processing will stop.\nThe data passed with this event looks like\n```\nvar data = {\n  error: 'no file exists at the path you specified',\n  pdf_path: \"~/Downloads/input_pdf_file.pdf\",\n}\n```\n\n**complete**\nEmitted when all pages have completed processing and the pdf extraction is complete\n```\nvar data = {\n  hash: \u003csha1 hash of the input pdf file here\u003e\n  text_pages: \u003cArray of Strings, one per page\u003e,\n  pdf_path: \"~/Downloads/input_pdf_file.pdf\",\n  single_page_pdf_file_paths: [\n    \"/tmp/temp_pdf_file1.pdf\",\n    \"/tmp/temp_pdf_file2.pdf\",\n    \"/tmp/temp_pdf_file3.pdf\",\n    \"/tmp/temp_pdf_file4.pdf\",\n  ]\n}\n```\n\n**log**\nTo avoid spamming process.stdout, log events are emitted instead.\n\n## Tests\n=======\nTo test that your system satisfies the needed dependencies and that module is functioning correctly execute the command in the pdf-extract module folder\n```\ncd \u003cproject_root\u003e/node_modules/pdf-extract\nnpm test\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felsehow%2Fcorpus-from-pdfs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felsehow%2Fcorpus-from-pdfs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felsehow%2Fcorpus-from-pdfs/lists"}