https://github.com/tomashubelbauer/pdf-scrape
Demonstrating PDF text and image extraction with correct bounds
https://github.com/tomashubelbauer/pdf-scrape
pdf pdf-js pdf-scraping pdfjs
Last synced: 12 months ago
JSON representation
Demonstrating PDF text and image extraction with correct bounds
- Host: GitHub
- URL: https://github.com/tomashubelbauer/pdf-scrape
- Owner: TomasHubelbauer
- Created: 2020-05-01T10:12:03.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2022-04-14T21:05:17.000Z (about 4 years ago)
- Last Synced: 2025-06-01T16:40:04.448Z (about 1 year ago)
- Topics: pdf, pdf-js, pdf-scraping, pdfjs
- Language: JavaScript
- Homepage: https://tomashubelbauer.github.io/pdf-scrape
- Size: 1.54 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# [PDF Scrape](https://tomashubelbauer.github.io/pdf-scrape)
0. Print `demo.html` to `demo.pdf` or use your own document
1. Go to https://mozilla.github.io/pdf.js/getting_started
2. Download **Stable**
3. Extract `pdf.js` and `pdf.worker.js` and their corresponding `*.map` here
4. Make `index.html` and reference PDF.js:
`index.html`
```html
PDF Scrape
```
5. Create `index.js` and reference it from `index.html`:
`index.js`
```js
```
`index.html`
```html
PDF Scrape
```
6. Update `index.js` with code to load the document and render its page:
`index.js`
```js
void async function () {
const document = await pdfjsLib.getDocument('demo.pdf').promise;
const page = await document.getPage(1);
}()
```
7. Add a `canvas` element to `index.html` where the page will be rendered:
`index.html`
```html
PDF Scrape
```
8. Extend the code to render the page to the canvas context:
`index.js`
```js
window.addEventListener('load', async () => {
const document = await pdfjsLib.getDocument('demo.pdf').promise;
const page = await document.getPage(1);
const viewport = page.getViewport({ scale: 1 });
const canvas = window.document.getElementById('pageCanvas');
canvas.width = viewport.width;
canvas.height = viewport.height;
const context = canvas.getContext('2d');
page.render({ canvasContext: context, viewport });
});
```
9. Hook up code to extract text and highlight texts and images (see this repo)