{"id":13665427,"url":"https://github.com/ropensci/pdftools","last_synced_at":"2025-08-27T10:03:52.632Z","repository":{"id":41545395,"uuid":"52382813","full_name":"ropensci/pdftools","owner":"ropensci","description":"Text Extraction, Rendering and Converting of PDF Documents","archived":false,"fork":false,"pushed_at":"2025-03-03T13:48:15.000Z","size":1101,"stargazers_count":529,"open_issues_count":55,"forks_count":71,"subscribers_count":28,"default_branch":"master","last_synced_at":"2025-03-13T03:42:58.949Z","etag":null,"topics":["pdf-files","pdf-format","pdftools","poppler","poppler-library","r","r-package","rstats","text-extraction"],"latest_commit_sha":null,"homepage":"https://docs.ropensci.org/pdftools","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ropensci.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-02-23T18:43:46.000Z","updated_at":"2025-03-03T13:48:19.000Z","dependencies_parsed_at":"2023-01-20T01:33:15.811Z","dependency_job_id":"39afb7e7-8aaa-4155-a7b1-098609504bd0","html_url":"https://github.com/ropensci/pdftools","commit_stats":null,"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fpdftools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fpdftools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fpdftools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fpdftools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ropensci","download_url":"https://codeload.github.com/ropensci/pdftools/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250960013,"owners_count":21514369,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf-files","pdf-format","pdftools","poppler","poppler-library","r","r-package","rstats","text-extraction"],"created_at":"2024-08-02T06:00:37.970Z","updated_at":"2025-04-26T08:32:03.070Z","avatar_url":"https://github.com/ropensci.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# pdftools\n\n[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)\n[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/pdftools)](http://cran.r-project.org/package=pdftools)\n[![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/pdftools)](http://cran.r-project.org/web/packages/pdftools/index.html)\n\n## Introduction\n\nScientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.\n\nThe pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.\n\n\n## Installation\n\nOn Windows and Mac the binary packages can be installed directly from CRAN:\n\n```r\ninstall.packages(\"pdftools\")\n```\n\nInstallation on Linux requires the poppler development library. For __Ubuntu 18.04 (Bionic)__ and __Ubuntu 20.04 (Focal)__ we provide backports of poppler version 22.02 to support the latest functionality:\n\n```\nsudo add-apt-repository -y ppa:cran/poppler\nsudo apt-get update\nsudo apt-get install -y libpoppler-cpp-dev\n```\n\nOn other versions of __Debian__ or __Ubuntu__ simply use::\n\n```\nsudo apt-get install libpoppler-cpp-dev\n```\n\nIf you want to install the package from source on __MacOS__ you need brew:\n\n```\nbrew install poppler\n```\n\nOn Fedora:\n\n```\nsudo yum install poppler-cpp-devel\n```\n\n## Getting started\n\nThe `?pdftools` manual page shows a brief overview of the main utilities. The most important function is `pdf_text` which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.\n\n```r\nlibrary(pdftools)\ndownload.file(\"http://arxiv.org/pdf/1403.2805.pdf\", \"1403.2805.pdf\", mode = \"wb\")\ntxt \u003c- pdf_text(\"1403.2805.pdf\")\n\n# first page text\ncat(txt[1])\n\n# second page text\ncat(txt[2])\n```\n\nIn addition, the package has some utilities to extract other data from the PDF file. The `pdf_toc` function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:\n\n```r\n# Table of contents\ntoc \u003c- pdf_toc(\"1403.2805.pdf\")\n\n# Show as JSON\njsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)\n```\n\nOther functions provide information about fonts, attachments and metadata such as the author, creation date or tags.\n\n\n```r\n# Author, version, etc\ninfo \u003c- pdf_info(\"1403.2805.pdf\")\n\n# Table with fonts\nfonts \u003c- pdf_fonts(\"1403.2805.pdf\")\n```\n\n## Rendering pdf files\n\nAnother feature of pdftools is rendering of PDF files to bitmap arrays (images). The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use `pdf_render_page` to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.\n\n```r\n# renders pdf to bitmap array\nbitmap \u003c- pdf_render_page(\"1403.2805.pdf\", page = 1)\n\n# save bitmap image\npng::writePNG(bitmap, \"page.png\")\nwebp::write_webp(bitmap, \"page.webp\")\n```\n\n## Limitations and related packages\n\n### Tables\n\nData scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data with `pdftools`.\n\n```r\ntxt \u003c- pdf_text(\"http://arxiv.org/pdf/1406.4806.pdf\")\n\n# some tables\ncat(txt[18])\ncat(txt[19])\n```\n\nThe [`tabulizer`](https://github.com/ropensci/tabulizer) package is dedicated to extracting tables from PDF, and includes interactive tools for selecting tables. However, `tabulizer` depends on `rJava` and therefore requires additional setup steps or may be impossible to use on systems where Java cannot be installed.\n\nIt is possible to use `pdftools` with some creativity to parse tables from PDF documents, which does not require Java to be installed.\n\n### Scanned text\n\nIf you want to extract text from scanned text present in a pdf, you'll need to use OCR (optical character recognition). Please refer to the [rOpenSci `tesseract` package](https://github.com/ropensci/tesseract) that provides bindings to the Tesseract OCR engine. In particular read [the section of its vignette about reading from PDF files using `pdftools` and `tesseract`](https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html#read_from_pdf_files).\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fropensci%2Fpdftools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fropensci%2Fpdftools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fropensci%2Fpdftools/lists"}