{"id":13425779,"url":"https://github.com/hrbrmstr/pdfbox","last_synced_at":"2025-03-21T12:30:58.731Z","repository":{"id":141238647,"uuid":"107906891","full_name":"hrbrmstr/pdfbox","owner":"hrbrmstr","description":"📄◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)","archived":false,"fork":false,"pushed_at":"2019-01-15T18:03:36.000Z","size":10119,"stargazers_count":45,"open_issues_count":2,"forks_count":3,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-18T01:11:16.368Z","etag":null,"topics":["pdf-document","pdf-files","pdfbox","pdfbox-wrapper","r","r-cyber","rstats"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hrbrmstr.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2017-10-22T22:11:23.000Z","updated_at":"2025-01-19T13:10:25.000Z","dependencies_parsed_at":"2024-02-03T08:47:48.668Z","dependency_job_id":"9b633104-ad9f-46fd-ae60-0d51895edf76","html_url":"https://github.com/hrbrmstr/pdfbox","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fpdfbox","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fpdfbox/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fpdfbox/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fpdfbox/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hrbrmstr","download_url":"https://codeload.github.com/hrbrmstr/pdfbox/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244799236,"owners_count":20512212,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf-document","pdf-files","pdfbox","pdfbox-wrapper","r","r-cyber","rstats"],"created_at":"2024-07-31T00:01:18.708Z","updated_at":"2025-03-21T12:30:57.713Z","avatar_url":"https://github.com/hrbrmstr.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"---\noutput: rmarkdown::github_document\neditor_options: \n  chunk_output_type: console\n---\n\n```{r, echo = FALSE, include=FALSE}\nknitr::opts_chunk$set(\n  message = FALSE,\n  warning = FALSE,\n  collapse = TRUE,\n  comment = \"##\"\n)\n```\n\n[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/pdfbox.svg?branch=master)](https://travis-ci.org/hrbrmstr/pdfbox) \n[![Coverage Status](https://codecov.io/gh/hrbrmstr/pdfbox/branch/master/graph/badge.svg)](https://codecov.io/gh/hrbrmstr/pdfbox)\n[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/pdfbox)](https://cran.r-project.org/package=pdfbox)\n\n# pdfbox\n\nCreate, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)\n\n## Description\n\nI came across this thread (\u003chttps://twitter.com/derekwillis/status/922138080043241473\u003e) \nand it looks like some misguided folks are going to help promote the use of PDF \ndocuments as a legit way to dissemiante data, which means that we're likely to \nsee more evil orgs and Government agencies try to use PDFs to hide data.\n\nPDFs are barely useful as publication holders these days let alone data sources.\n\nApache [PDFBox](https://pdfbox.apache.org/index.html) is a project that provides\na comprehensive suite of tools to do things with and to PDF documents. \n\nThe aim here is to fill in any gaps in [`pdftools`](https://github.com/ropensci/pdftools)\nsince `poppler` may not try to accommodate all the stupidity that we're now likley to see.\n\n## What's Inside The Tin\n\n- The ability to extract URI annotations\n\nThe following functions are implemented:\n\n- `extract_uris`:\tExtract URI annotations from a PDF document\n- `extract_text`:\tExtract text from a PDF document\n- `pdf_info`:\tRetrieve PDF Metadata\n\n## Installation\n\n```{r eval=FALSE}\ndevtools::install_github(\"hrbrmstr/pdfboxjars\")\ndevtools::install_github(\"hrbrmstr/pdfbox\")\n```\n\n```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}\noptions(width=120)\n```\n\n## Usage\n\n```{r message=FALSE, warning=FALSE, error=FALSE}\nlibrary(pdfbox)\n\n# current verison\npackageVersion(\"pdfbox\")\n```\n\n### PDF Info\n\n```{r}\npdf_info(\n system.file(\n   \"extdata\", \"imperfect-forward-secrecy-ccs15.pdf\", package=\"pdfbox\"\n )\n) -\u003e info\n\ndplyr::glimpse(info)\n```\n\n### Extract URI Annotations\n\n```{r message=FALSE, warning=FALSE, error=FALSE}\nextract_uris(\n  system.file(\"extdata\",\"imperfect-forward-secrecy-ccs15.pdf\", package=\"pdfbox\")\n)\n```\n\n### Extract text\n\n```{r}\nextract_text(\n  system.file(\n    \"extdata\", \"imperfect-forward-secrecy-ccs15.pdf\", package=\"pdfbox\"\n  )\n) -\u003e pg_df\n\ndplyr::glimpse(pg_df)\n```\n\n### pdfbox Metrics\n\n```{r echo=FALSE}\ncloc::cloc_pkg_md()\n```\n\n## Code of Conduct\n\nPlease note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). \nBy participating in this project you agree to abide by its terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fpdfbox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhrbrmstr%2Fpdfbox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fpdfbox/lists"}