Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bsorrentino/pdf-tools
Extract Markdown + Images from PDF
https://github.com/bsorrentino/pdf-tools
extract-images markdown pdf
Last synced: 3 months ago
JSON representation
Extract Markdown + Images from PDF
- Host: GitHub
- URL: https://github.com/bsorrentino/pdf-tools
- Owner: bsorrentino
- License: mit
- Created: 2020-11-08T20:22:02.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2024-06-23T15:39:52.000Z (8 months ago)
- Last Synced: 2024-10-13T09:25:52.064Z (4 months ago)
- Topics: extract-images, markdown, pdf
- Language: TypeScript
- Homepage:
- Size: 9.81 MB
- Stars: 42
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
[![npm](https://img.shields.io/npm/v/@bsorrentino/pdf-tools.svg)](https://www.npmjs.com/package/@bsorrentino/pdf-tools)
![]()
![]()
![]()
![example workflow](https://github.com/bsorrentino/pdf-tools/actions/workflows/npm-publish.yml/badge.svg)# pdf-tools
Tools to extract/transform data from PDF
> inspired by project: [pdf-to-markdown](https://github.com/jzillmann/pdf-to-markdown)
## Installation
```
npm install @bsorrentino/pdf-tools -g
```## Requirements
* NodeJs >= 16
* Since **pdf-tools** use [`canvas`] that is a [`Cairo`]-backed Canvas implementation for Node.js take a look to its [reqirements]## pdftools Commands
**common options**
```
-o, --outdir [folder] output folder (default: "out")
```### pdfximages
extract images (as png) from pdf and save it to the given folder
**Usage:**
```
pdftools pdfximages|pxi [options]
```### pdf2images
create an image (as png) for each pdf page
**Usage:**
```
pdftools pdf2images|p2i
```### pdf2md
convert pdf to markdown format.
**Usage:**
```
pdftools pdf2md|p2md [options]
```**Options:**
```
-ps, --pageseparator [separator] add page separator (default: "---")
--imageurl [url prefix] imgage url prefix
--stats print stats information
--debug print debug information
```
----## Conversion to Markdown
### supported features
* Detect headers
* Detect and extract images
* Extract plain text
* Extract fonts and allow custom mapping through a generated file `.font.json`
> Supported fonts **bold**, _italic_, `monospace`, **_bold+italic_**
* Detect code block ( i.e. ` ``` `)
* Detect external link### TO DO
* Detect TOC
[`canvas`]: https://www.npmjs.com/package/canvas
[`Cairo`]: http://cairographics.org/
[reqirements]: https://github.com/Automattic/node-canvas#compiling