{"id":13463191,"url":"https://github.com/yob/pdf-reader","last_synced_at":"2025-05-14T22:03:49.135Z","repository":{"id":403279,"uuid":"21716","full_name":"yob/pdf-reader","owner":"yob","description":"The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.","archived":false,"fork":false,"pushed_at":"2025-04-27T00:52:38.000Z","size":27683,"stargazers_count":1856,"open_issues_count":70,"forks_count":277,"subscribers_count":50,"default_branch":"main","last_synced_at":"2025-04-30T07:04:47.734Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yob.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"MIT-LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2008-06-03T01:11:36.000Z","updated_at":"2025-04-27T00:52:42.000Z","dependencies_parsed_at":"2023-01-13T16:21:53.907Z","dependency_job_id":"e37cb88e-94b1-4b5c-ad52-c5d3e58e1812","html_url":"https://github.com/yob/pdf-reader","commit_stats":{"total_commits":1404,"total_committers":67,"mean_commits":"20.955223880597014","dds":"0.23575498575498577","last_synced_commit":"3c8ee4cba03dd20958f8141f01901d2915fccfef"},"previous_names":[],"tags_count":68,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yob%2Fpdf-reader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yob%2Fpdf-reader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yob%2Fpdf-reader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yob%2Fpdf-reader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yob","download_url":"https://codeload.github.com/yob/pdf-reader/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252264494,"owners_count":21720520,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T13:00:47.672Z","updated_at":"2025-05-07T08:26:16.919Z","avatar_url":"https://github.com/yob.png","language":"Ruby","readme":"# pdf-reader\n\nThe PDF::Reader library implements a PDF parser conforming as much as possible\nto the PDF specification from Adobe.\n\nIt provides programmatic access to the contents of a PDF file with a high\ndegree of flexibility.\n\nThe PDF 1.7 specification is a weighty document and not all aspects are\ncurrently supported. I welcome submission of PDF files that exhibit\nunsupported aspects of the spec to assist with improving our support.\n\nThis is primarily a low-level library that should be used as the foundation for\nhigher level functionality - it's not going to render a PDF for you. There are\na few exceptions to support very common use cases like extracting text from a\npage.\n\n# Installation\n\nThe recommended installation method is via Rubygems.\n\n```ruby\ngem install pdf-reader\n```\n\n# Usage\n\nBegin by creating a PDF::Reader instance that points to a PDF file. Document\nlevel information (metadata, page count, bookmarks, etc) is available via\nthis object.\n\n```ruby\nreader = PDF::Reader.new(\"somefile.pdf\")\n\nputs reader.pdf_version\nputs reader.info\nputs reader.metadata\nputs reader.page_count\n ```\n\nPDF::Reader.new accepts an IO stream or a filename. Here's an example with\nan IO stream:\n\n```ruby\nrequire 'open-uri'\n\nio     = open('http://example.com/somefile.pdf')\nreader = PDF::Reader.new(io)\nputs reader.info\n ```\n\nIf you open a PDF with File#open or IO#open, I strongly recommend using \"rb\"\nmode to ensure the file isn't mangled by ruby being 'helpful'. This is\nparticularly important on windows and MRI \u003e= 1.9.2.\n\n```ruby\nFile.open(\"somefile.pdf\", \"rb\") do |io|\n  reader = PDF::Reader.new(io)\n  puts reader.info\nend\n ```\n\nPDF is a page based file format, so most visible information is available via\npage-based iteration\n\n```ruby\nreader = PDF::Reader.new(\"somefile.pdf\")\n\nreader.pages.each do |page|\n  puts page.fonts\n  puts page.text\n  puts page.raw_content\nend\n```\n\nIf you need to access the full program for rendering a page, use the walk() method\nof PDF::Reader::Page.\n\n```ruby\nclass RedGreenBlue\n  def set_rgb_color_for_nonstroking(r, g, b)\n    puts \"R: #{r}, G: #{g}, B: #{b}\"\n  end\nend\n\nreader   = PDF::Reader.new(\"somefile.pdf\")\npage     = reader.page(1)\nreceiver = RedGreenBlue.new\npage.walk(receiver)\n```\n\nFor low level access to the objects in a PDF file, use the ObjectHash class like\nso:\n\n```ruby\nreader  = PDF::Reader.new(\"somefile.pdf\")\nputs reader.objects.inspect\n```\n\n# Text Encoding\n\nRegardless of the internal encoding used in the PDF all text will be converted\nto UTF-8 before it is passed back from PDF::Reader.\n\nStrings that contain binary data (like font blobs) will be marked as such.\n\n# Former API\n\nVersion 1.0.0 of PDF::Reader introduced a new page-based API that provides\nefficient and easy access to any page.\n\nThe pre-1.0 API was deprecated during the 1.x release series, and has been\nremoved from 2.0.0.\n\n# Exceptions\n\nThere are two key exceptions that you will need to watch out for when processing a\nPDF file:\n\nMalformedPDFError - The PDF appears to be corrupt in some way. If you believe the\nfile should be valid, or that a corrupt file didn't raise an exception, please\nforward a copy of the file to the maintainers (preferably via the google group)\nand we will attempt to improve the code.\n\nUnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently\nsupport. Again, we welcome submissions of PDF files that exhibit these features to help\nus with future code improvements.\n\nMalformedPDFError has some subclasses if you want to detect finer grained issues. If you\ndon't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.\n\nAny other exceptions should be considered bugs in either PDF::Reader (please\nreport it!).\n\n# PDF Integrity\n\nWindows developers may run into problems when running specs due to MalformedPDFError's\nThis is usually because CRLF characters are automatically added to some of the PDF's in\nthe spec folder when you checkout a branch from Git.\n\nTo remove any invalid CRLF characters added while checking out a branch from Git, run:\n\n```ruby\nrake fix_integrity\n```\n\n# Maintainers\n\n* James Healy \u003cmailto:jimmy@deefa.com\u003e\n\n# Licensing\n\nThis library is distributed under the terms of the MIT License. See the included file for\nmore detail.\n\n# Mailing List\n\nAny questions or feedback should be sent to the PDF::Reader google group. It's\nbetter that any answers be available for others instead of hiding in someone's\ninbox.\n\nhttp://groups.google.com/group/pdf-reader\n\n# Examples\n\nThe easiest way to explain how this works in practice is to show some examples.\nCheck out the examples/ directory for a few files.\n\n# Alternate Decoder\n\nFor PDF files containing Ascii85 streams, the [ascii85_native](https://github.com/AnomalousBit/ascii85_native) gem can be used for increased performance. If the ascii85_native gem is detected, pdf-reader will automatically use the gem.\n\nFirst, run `gem install ascii85_native` and then require the gem alongside pdf-reader:\n\n```ruby\nrequire \"pdf-reader\"\nrequire \"ascii85_native\"\n```\n\nAnother way of enabling native Ascii85 decoding is to place `gem 'ascii85_native'` in your project's `Gemfile`.\n\n# Known Limitations\n\nOccasionally some text cannot be extracted properly due to the way it has been\nstored, or the use of invalid bytes. In these cases PDF::Reader will output a\nlittle UTF-8 friendly box to indicate an unrecognisable character.\n\n# Resources\n\n* PDF::Reader Code Repository: http://github.com/yob/pdf-reader\n\n* PDF Specification: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf\n\n* Adobe PDF Developer Resources: http://www.adobe.com/devnet/pdf/pdf_reference.html\n\n* PDF Tutorial Slide Presentations: https://web.archive.org/web/20150110042057/http://home.comcast.net/~jk05/presentations/PDFTutorials.html\n\n* Developing with PDF (book): http://shop.oreilly.com/product/0636920025269.do\n","funding_links":[],"categories":["Documents \u0026 Reports","RUBY","Ruby"],"sub_categories":["PDF Processing"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyob%2Fpdf-reader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyob%2Fpdf-reader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyob%2Fpdf-reader/lists"}