{"id":23763680,"url":"https://github.com/tomashubelbauer/pdf-scrape","last_synced_at":"2025-07-03T00:05:54.281Z","repository":{"id":107986220,"uuid":"260429676","full_name":"TomasHubelbauer/pdf-scrape","owner":"TomasHubelbauer","description":"Demonstrating PDF text and image extraction with correct bounds","archived":false,"fork":false,"pushed_at":"2022-04-14T21:05:17.000Z","size":1611,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-01T16:40:04.448Z","etag":null,"topics":["pdf","pdf-js","pdf-scraping","pdfjs"],"latest_commit_sha":null,"homepage":"https://tomashubelbauer.github.io/pdf-scrape","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TomasHubelbauer.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-05-01T10:12:03.000Z","updated_at":"2023-02-13T19:13:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"52c50975-3965-4b5d-8795-0e2ec97ae753","html_url":"https://github.com/TomasHubelbauer/pdf-scrape","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TomasHubelbauer/pdf-scrape","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TomasHubelbauer%2Fpdf-scrape","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TomasHubelbauer%2Fpdf-scrape/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TomasHubelbauer%2Fpdf-scrape/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TomasHubelbauer%2Fpdf-scrape/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TomasHubelbauer","download_url":"https://codeload.github.com/TomasHubelbauer/pdf-scrape/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TomasHubelbauer%2Fpdf-scrape/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263234943,"owners_count":23434918,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf","pdf-js","pdf-scraping","pdfjs"],"created_at":"2024-12-31T22:13:16.350Z","updated_at":"2025-07-03T00:05:54.230Z","avatar_url":"https://github.com/TomasHubelbauer.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# [PDF Scrape](https://tomashubelbauer.github.io/pdf-scrape)\r\n\r\n0. Print `demo.html` to `demo.pdf` or use your own document\r\n1. Go to https://mozilla.github.io/pdf.js/getting_started\r\n2. Download **Stable**\r\n3. Extract `pdf.js` and `pdf.worker.js` and their corresponding `*.map` here\r\n4. Make `index.html` and reference PDF.js:\r\n\r\n`index.html`\r\n```html\r\n\u003c!DOCTYPE html\u003e\r\n\u003chtml lang=\"en\"\u003e\r\n  \u003chead\u003e\r\n    \u003cmeta charset=\"utf-8\" /\u003e\r\n    \u003ctitle\u003ePDF Scrape\u003c/title\u003e\r\n    \u003cscript src=\"pdf.js\"\u003e\u003c/script\u003e\r\n  \u003c/head\u003e\r\n  \u003cbody\u003e\r\n\r\n  \u003c/body\u003e\r\n\u003c/html\u003e\r\n```\r\n\r\n5. Create `index.js` and reference it from `index.html`:\r\n\r\n`index.js`\r\n```js\r\n```\r\n\r\n`index.html`\r\n```html\r\n\u003c!DOCTYPE html\u003e\r\n\u003chtml lang=\"en\"\u003e\r\n  \u003chead\u003e\r\n    \u003cmeta charset=\"utf-8\" /\u003e\r\n    \u003ctitle\u003ePDF Scrape\u003c/title\u003e\r\n    \u003cscript src=\"pdf.js\"\u003e\u003c/script\u003e\r\n    \u003cscript src=\"index.js\"\u003e\u003c/script\u003e\r\n  \u003c/head\u003e\r\n  \u003cbody\u003e\r\n\r\n  \u003c/body\u003e\r\n\u003c/html\u003e\r\n```\r\n\r\n6. Update `index.js` with code to load the document and render its page:\r\n\r\n`index.js`\r\n```js\r\nvoid async function () {\r\n  const document = await pdfjsLib.getDocument('demo.pdf').promise;\r\n  const page = await document.getPage(1);\r\n}()\r\n```\r\n\r\n7. Add a `canvas` element to `index.html` where the page will be rendered:\r\n\r\n`index.html`\r\n```html\r\n\u003c!DOCTYPE html\u003e\r\n\u003chtml lang=\"en\"\u003e\r\n  \u003chead\u003e\r\n    \u003cmeta charset=\"utf-8\" /\u003e\r\n    \u003ctitle\u003ePDF Scrape\u003c/title\u003e\r\n    \u003cscript src=\"pdf.js\"\u003e\u003c/script\u003e\r\n    \u003cscript src=\"index.js\"\u003e\u003c/script\u003e\r\n  \u003c/head\u003e\r\n  \u003cbody\u003e\r\n    \u003ccanvas id=\"pageCanvas\"\u003e\u003c/canvas\u003e\r\n  \u003c/body\u003e\r\n\u003c/html\u003e\r\n```\r\n\r\n8. Extend the code to render the page to the canvas context:\r\n\r\n`index.js`\r\n```js\r\nwindow.addEventListener('load', async () =\u003e {\r\n  const document = await pdfjsLib.getDocument('demo.pdf').promise;\r\n  const page = await document.getPage(1);\r\n  const viewport = page.getViewport({ scale: 1 });\r\n  const canvas = window.document.getElementById('pageCanvas');\r\n  canvas.width = viewport.width;\r\n  canvas.height = viewport.height;\r\n  const context = canvas.getContext('2d');\r\n  page.render({ canvasContext: context, viewport });\r\n});\r\n```\r\n\r\n9. Hook up code to extract text and highlight texts and images (see this repo)\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomashubelbauer%2Fpdf-scrape","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomashubelbauer%2Fpdf-scrape","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomashubelbauer%2Fpdf-scrape/lists"}