{"id":26109553,"url":"https://github.com/wzbsocialsciencecenter/pdf2xml-viewer","last_synced_at":"2026-03-11T14:36:53.697Z","repository":{"id":85191092,"uuid":"62876379","full_name":"WZBSocialScienceCenter/pdf2xml-viewer","owner":"WZBSocialScienceCenter","description":"A simple viewer and inspection tool for text boxes in PDF documents","archived":false,"fork":false,"pushed_at":"2022-03-07T10:45:34.000Z","size":113,"stargazers_count":95,"open_issues_count":0,"forks_count":20,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-12T20:36:24.729Z","etag":null,"topics":["d3","ocr","pdf","pdf-document","viewer","xml"],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WZBSocialScienceCenter.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-07-08T09:18:01.000Z","updated_at":"2025-02-27T02:26:26.000Z","dependencies_parsed_at":"2023-06-15T02:00:16.770Z","dependency_job_id":null,"html_url":"https://github.com/WZBSocialScienceCenter/pdf2xml-viewer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/WZBSocialScienceCenter/pdf2xml-viewer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WZBSocialScienceCenter%2Fpdf2xml-viewer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WZBSocialScienceCenter%2Fpdf2xml-viewer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WZBSocialScienceCenter%2Fpdf2xml-viewer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WZBSocialScienceCenter%2Fpdf2xml-viewer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WZBSocialScienceCenter","download_url":"https://codeload.github.com/WZBSocialScienceCenter/pdf2xml-viewer/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WZBSocialScienceCenter%2Fpdf2xml-viewer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30384077,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-11T14:10:17.325Z","status":"ssl_error","status_checked_at":"2026-03-11T14:09:37.934Z","response_time":84,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["d3","ocr","pdf","pdf-document","viewer","xml"],"created_at":"2025-03-09T23:12:04.660Z","updated_at":"2026-03-11T14:36:53.680Z","avatar_url":"https://github.com/WZBSocialScienceCenter.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdf2xml-viewer - A simple viewer and inspection tool for text boxes in PDF documents\n\nJuly 2016 / Feb. 2017, Markus Konrad \u003cmarkus.konrad@wzb.eu\u003e / \u003cpost@mkonrad.net\u003e / [Berlin Social Science Center](https://www.wzb.eu/en)\n\n**This project is currently not maintained.**\n\n## Introduction\n\nThis is a small tool with which it is possible to view and examine individual text boxes in PDF documents. This is\nvery helpful for analyzing the distribution of texts across a page, especially in the case of\n[OCR-processed PDFs](https://en.wikipedia.org/wiki/Optical_character_recognition) (so called \"sandwich PDFs\") from\nwhich you might want to extract structured information (see \n[pdftabextract](https://github.com/WZBSocialScienceCenter/pdftabextract) for this). With this viewer, you can examine\nsuch PDFs and have a look at the properties of individual text boxes, like position, width, height or font\nspecification. In combination with pdftabextract, you can view the grids that were generated for the detected columns and rows. [This blog post](https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/) shows an example usage.\n\nThe viewer requires you to convert your PDFs to the *pdf2xml* format. Afterwards\nyou can start up a local webserver, display this XML file in the viewer (as seen below) and examine the individual\ntext boxes with your browsers developer console.\n\n![OCR PDF example in the viewer](https://datascience.blog.wzb.eu/wp-content/uploads/10/2017/02/pdf2xml-viewer-page.png)\n\nThe created file in pdf2xml format can later also be used to extract structured information, which I explain in my\nseries of blog posts about [data mining PDFs](https://datascience.blog.wzb.eu/category/pdfs/).\n\n## How to use it\n\n### 1. Convert a PDF to pdf2xml format\n\nAt first, you need to convert your PDFs using the **poppler-utils**, a package which is part of most Linux distributions\nand is also available for OSX via Homebrew or MacPorts. From this package we need the command `pdftohtml` and can create\nan XML file in pdf2xml format in the following way using the Terminal:\n\n```\npdftohtml -c -hidden -xml input.pdf output.xml\n```\n\nThe arguments *input.pdf* and *output.xml* are your input PDF file and the created XML file in pdf2xml format\nrespectively. It is important that you specifiy the *-hidden* parameter when you're dealing with OCR-processed\n(\"sandwich\") PDFs. You can furthermore add the parameters *-f n* and *-l n* to set only a range of pages to be\nconverted.\n\n### 2. Start a minimal local webserver to display the text boxes in the PDF with the viewer\n\nNow that you have your file(s) in pdf2xml format, change to the directory where pdf2xml-viewer resides (where it's\n*index.html* file is). You should also copy the generated XML files to this location. Now let's start up a minimal\nlocal webserver. This can be done very easily with Python, which is installed on Linux and Mac OSX by default.\nYou can do so in the Terminal with Python 2.x:\n\n```\npython -m SimpleHTTPServer 8080\n```\n\nOr with Python 3:\n```\npython3 -m http.server 8080 --bind 127.0.0.1\n```\n\nNow you open your browser and go to the adress http://127.0.0.1:8080. The viewer shows up and you can now enter the\nfile name of your file to load (it must be relative to the directory in which pdf2xml-viewer resides). If you just\nwant to see an example, type in *example/ocr-output.pdf.xml* and load this file. Now you browse through the pages of\nyour PDF document and you'll see the text boxes with red frames. You can further examine these boxes by using your\nbrowser's inspection tools (right click on element and select \"Inspect\" in Chrome or Firefox) as seen below:\n\n![OCR example in pdf2xml-viewer with browser inspection tools](https://datascience.blog.wzb.eu/wp-content/uploads/10/2016/07/ocr-example-output-devconsole.png)\n\n### 3. Use the advanced features of the viewer\n\nYou can load a page grid JSON file that was generated with [pdftabextract](https://github.com/WZBSocialScienceCenter/pdftabextract) (function `common.save_page_grids`):\n\n![Generated page grid viewed in pdf2xml-viewer](https://datascience.blog.wzb.eu/wp-content/uploads/10/2017/02/pdf2xml-viewer-pagegrid.png)\n\n### 4. Extract data from your PDFs\n\nIf you want to extract structured data from the PDFs, you should have a look at the\n[pdftabextract](https://github.com/WZBSocialScienceCenter/pdftabextract) package.\n\n## Technical details\n\nThis viewer uses [d3.js](https://d3js.org) to display the pdf2xml file. I chose this approach because it is the fastest\nand simples in order to inspect individual elements of a (OCR-processed) PDF document, without using expensive special\nsoftware. Furthermore it allows to add additional features such as displaying overlays of calculated lines or grids.\n\n## License\n\nApache License 2.0. See LICENSE file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwzbsocialsciencecenter%2Fpdf2xml-viewer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwzbsocialsciencecenter%2Fpdf2xml-viewer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwzbsocialsciencecenter%2Fpdf2xml-viewer/lists"}