Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hrbrmstr/pdfbox

📄◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)
https://github.com/hrbrmstr/pdfbox

pdf-document pdf-files pdfbox pdfbox-wrapper r r-cyber rstats

Last synced: 3 months ago
JSON representation

📄◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)

Host: GitHub
URL: https://github.com/hrbrmstr/pdfbox
Owner: hrbrmstr
License: apache-2.0
Created: 2017-10-22T22:11:23.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2019-01-15T18:03:36.000Z (about 6 years ago)
Last Synced: 2024-10-12T21:24:15.549Z (3 months ago)
Topics: pdf-document, pdf-files, pdfbox, pdfbox-wrapper, r, r-cyber, rstats
Language: Java
Homepage:
Size: 9.65 MB
Stars: 46
Watchers: 5
Forks: 3
Open Issues: 2
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

        ---

output: rmarkdown::github_document

editor_options: 

  chunk_output_type: console

---

```{r, echo = FALSE, include=FALSE}

knitr::opts_chunk$set(

  message = FALSE,

  warning = FALSE,

  collapse = TRUE,

  comment = "##"

)

```

[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/pdfbox.svg?branch=master)](https://travis-ci.org/hrbrmstr/pdfbox) 

[![Coverage Status](https://codecov.io/gh/hrbrmstr/pdfbox/branch/master/graph/badge.svg)](https://codecov.io/gh/hrbrmstr/pdfbox)

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/pdfbox)](https://cran.r-project.org/package=pdfbox)

# pdfbox

Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)

## Description

I came across this thread () 

and it looks like some misguided folks are going to help promote the use of PDF 

documents as a legit way to dissemiante data, which means that we're likely to 

see more evil orgs and Government agencies try to use PDFs to hide data.

PDFs are barely useful as publication holders these days let alone data sources.

Apache [PDFBox](https://pdfbox.apache.org/index.html) is a project that provides

a comprehensive suite of tools to do things with and to PDF documents. 

The aim here is to fill in any gaps in [`pdftools`](https://github.com/ropensci/pdftools)

since `poppler` may not try to accommodate all the stupidity that we're now likley to see.

## What's Inside The Tin

- The ability to extract URI annotations

The following functions are implemented:

- `extract_uris`:	Extract URI annotations from a PDF document

- `extract_text`:	Extract text from a PDF document

- `pdf_info`:	Retrieve PDF Metadata

## Installation

```{r eval=FALSE}

devtools::install_github("hrbrmstr/pdfboxjars")

devtools::install_github("hrbrmstr/pdfbox")

```

```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}

options(width=120)

```

## Usage

```{r message=FALSE, warning=FALSE, error=FALSE}

library(pdfbox)

# current verison

packageVersion("pdfbox")

```

### PDF Info

```{r}

pdf_info(

 system.file(

   "extdata", "imperfect-forward-secrecy-ccs15.pdf", package="pdfbox"

 )

) -> info

dplyr::glimpse(info)

```

### Extract URI Annotations

```{r message=FALSE, warning=FALSE, error=FALSE}

extract_uris(

  system.file("extdata","imperfect-forward-secrecy-ccs15.pdf", package="pdfbox")

)

```

### Extract text

```{r}

extract_text(

  system.file(

    "extdata", "imperfect-forward-secrecy-ccs15.pdf", package="pdfbox"

  )

) -> pg_df

dplyr::glimpse(pg_df)

```

### pdfbox Metrics

```{r echo=FALSE}

cloc::cloc_pkg_md()

```

## Code of Conduct

Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). 

By participating in this project you agree to abide by its terms.