https://github.com/tomashubelbauer/pdf-scrape

Demonstrating PDF text and image extraction with correct bounds
https://github.com/tomashubelbauer/pdf-scrape

pdf pdf-js pdf-scraping pdfjs

Last synced: 12 months ago
JSON representation

Demonstrating PDF text and image extraction with correct bounds

Host: GitHub
URL: https://github.com/tomashubelbauer/pdf-scrape
Owner: TomasHubelbauer
Created: 2020-05-01T10:12:03.000Z (about 6 years ago)
Default Branch: main
Last Pushed: 2022-04-14T21:05:17.000Z (about 4 years ago)
Last Synced: 2025-06-01T16:40:04.448Z (about 1 year ago)
Topics: pdf, pdf-js, pdf-scraping, pdfjs
Language: JavaScript
Homepage: https://tomashubelbauer.github.io/pdf-scrape
Size: 1.54 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

          # [PDF Scrape](https://tomashubelbauer.github.io/pdf-scrape)

0. Print `demo.html` to `demo.pdf` or use your own document

1. Go to https://mozilla.github.io/pdf.js/getting_started

2. Download **Stable**

3. Extract `pdf.js` and `pdf.worker.js` and their corresponding `*.map` here

4. Make `index.html` and reference PDF.js:

`index.html`

```html

  

    

    PDF Scrape

    

  

  

  

```

5. Create `index.js` and reference it from `index.html`:

`index.js`

```js

```

`index.html`

```html

  

    

    PDF Scrape

    

    

  

  

  

```

6. Update `index.js` with code to load the document and render its page:

`index.js`

```js

void async function () {

  const document = await pdfjsLib.getDocument('demo.pdf').promise;

  const page = await document.getPage(1);

}()

```

7. Add a `canvas` element to `index.html` where the page will be rendered:

`index.html`

```html

  

    

    PDF Scrape

    

    

  

  

    

  

```

8. Extend the code to render the page to the canvas context:

`index.js`

```js

window.addEventListener('load', async () => {

  const document = await pdfjsLib.getDocument('demo.pdf').promise;

  const page = await document.getPage(1);

  const viewport = page.getViewport({ scale: 1 });

  const canvas = window.document.getElementById('pageCanvas');

  canvas.width = viewport.width;

  canvas.height = viewport.height;

  const context = canvas.getContext('2d');

  page.render({ canvasContext: context, viewport });

});

```

9. Hook up code to extract text and highlight texts and images (see this repo)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tomashubelbauer/pdf-scrape

Awesome Lists containing this project

README