https://github.com/bsorrentino/pdf-tools
Extract Markdown + Images from PDF
https://github.com/bsorrentino/pdf-tools
extract-images markdown pdf
Last synced: 4 months ago
JSON representation
Extract Markdown + Images from PDF
- Host: GitHub
- URL: https://github.com/bsorrentino/pdf-tools
- Owner: bsorrentino
- License: mit
- Created: 2020-11-08T20:22:02.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2024-12-19T19:13:25.000Z (10 months ago)
- Last Synced: 2025-04-02T01:35:29.776Z (7 months ago)
- Topics: extract-images, markdown, pdf
- Language: TypeScript
- Homepage:
- Size: 9.83 MB
- Stars: 45
- Watchers: 2
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://www.npmjs.com/package/@bsorrentino/pdf-tools)

# pdf-tools
Tools to extract/transform data from PDF
> inspired by project: [pdf-to-markdown](https://github.com/jzillmann/pdf-to-markdown)
## Installation
```
npm install @bsorrentino/pdf-tools -g
```
## Requirements
* NodeJs >= 16
* Since **pdf-tools** use [`canvas`] that is a [`Cairo`]-backed Canvas implementation for Node.js take a look to its [reqirements]
## pdftools Commands
**common options**
```
-o, --outdir [folder] output folder (default: "out")
```
### pdfximages
extract images (as png) from pdf and save it to the given folder
**Usage:**
```
pdftools pdfximages|pxi [options]
```
### pdf2images
create an image (as png) for each pdf page
**Usage:**
```
pdftools pdf2images|p2i
```
### pdf2md
convert pdf to markdown format.
**Usage:**
```
pdftools pdf2md|p2md [options]
```
**Options:**
```
-ps, --pageseparator [separator] add page separator (default: "---")
--imageurl [url prefix] imgage url prefix
--stats print stats information
--debug print debug information
```
----
## Conversion to Markdown
### supported features
* Detect headers
* Detect and extract images
* Extract plain text
* Extract fonts and allow custom mapping through a generated file `.font.json`
> Supported fonts **bold**, _italic_, `monospace`, **_bold+italic_**
* Detect code block ( i.e. ` ``` `)
* Detect external link
### TO DO
* Detect TOC
[`canvas`]: https://www.npmjs.com/package/canvas
[`Cairo`]: http://cairographics.org/
[reqirements]: https://github.com/Automattic/node-canvas#compiling