{"id":20209708,"url":"https://github.com/florents-tselai/pgpdf","last_synced_at":"2025-04-04T19:09:28.094Z","repository":{"id":260762548,"uuid":"861424825","full_name":"Florents-Tselai/pgpdf","owner":"Florents-Tselai","description":"pdf type for Postgres","archived":false,"fork":false,"pushed_at":"2025-03-03T11:29:08.000Z","size":12541,"stargazers_count":207,"open_issues_count":1,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-28T07:17:11.494Z","etag":null,"topics":["documents","pdf","postgresql"],"latest_commit_sha":null,"homepage":"https://tselai.com/pgpdf-pdf-type-postgres","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Florents-Tselai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["Florents-Tselai"]}},"created_at":"2024-09-22T21:10:10.000Z","updated_at":"2025-03-04T03:22:49.000Z","dependencies_parsed_at":"2024-11-02T12:26:33.818Z","dependency_job_id":"5876e9ef-4966-454e-974f-b16d38fa40a8","html_url":"https://github.com/Florents-Tselai/pgpdf","commit_stats":null,"previous_names":["florents-tselai/pgpdf"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Florents-Tselai%2Fpgpdf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Florents-Tselai%2Fpgpdf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Florents-Tselai%2Fpgpdf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Florents-Tselai%2Fpgpdf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Florents-Tselai","download_url":"https://codeload.github.com/Florents-Tselai/pgpdf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247234921,"owners_count":20905854,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["documents","pdf","postgresql"],"created_at":"2024-11-14T05:41:52.345Z","updated_at":"2025-04-04T19:09:28.052Z","avatar_url":"https://github.com/Florents-Tselai.png","language":"C","funding_links":["https://github.com/sponsors/Florents-Tselai"],"categories":[],"sub_categories":[],"readme":"# pgPDF: `pdf` type for Postgres\n\n[![build](https://github.com/Florents-Tselai/pgpdf/actions/workflows/build.yml/badge.svg)](https://github.com/Florents-Tselai/pgpdf/actions/workflows/build.yml)\n![GitHub Repo stars](https://img.shields.io/github/stars/Florents-Tselai/pgpdf)\n\u003ca href=\"https://hub.docker.com/repository/docker/florents/pgpdf\"\u003e\u003cimg alt=\"Docker Pulls\" src=\"https://img.shields.io/docker/pulls/florents/pgpdf\"\u003e\u003c/a\u003e\n\nThis extension for PostgreSQL provides a `pdf` data type and assorted functions.\n\nYou can create a `pdf` type, by casting either a `text` filepath or `bytea` column.\n\n```tsql\nSELECT '/tmp/pgintro.pdf'::pdf;\n```\n\n```tsql\n                                       pdf                                        \n----------------------------------------------------------------------------------\n PostgreSQL Introduction                                                         +\n Digoal.Zhou                                                                     +\n 7/20/2011Catalog                                                                +\n  PostgreSQL Origin \n```\n\nIf you don’t have the PDF file in your filesystem,\nbut have already stored its content in a `bytea` column,\nyou can just cast it to `pdf`.\n\n```tsql\nSELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;\n```\n\n**Why?**: \nThis allows you to work with PDFs in an ACID-compliant way.\nThe usual alternative relies on external scripts or services which can easily \nmake your data ingestion pipeline brittle and leave your raw data out-of-sync.\n\nThe actual PDF parsing is done by [poppler](https://poppler.freedesktop.org).\n\nAlso check blog: \n- [Full Text Search on PDFs With Postgres](https://tselai.com/full-text-search-pdf-postgres)\n- [pgpdf: pdf type for Postgres](https://tselai.com/pgpdf-pdf-type-postgres)\n\n## Usage\n\nDownload some PDFs. \n\n```sh\nwget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf\nwget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf\n```\n\nCreate a table with a `pdf` column:\n\n```tsql\nCREATE TABLE pdfs(name text primary key, doc pdf);\n\nINSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');\nINSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');\n```\n\nParsing and validation should happen automatically.\nThe files will be read from the disk only once!\n\n\u003e [!NOTE]\n\u003e The filepath should be accessible by the `postgres` process / user!\n\u003e That's different than the user running psql.\n\u003e If you don't understand what this means, as your DBA!\n\n### Reading from URLs\n\nYou can combine pgpdf with [pgsql-http](https://github.com/pramsey/pgsql-http) \nto quickly grab remote PDFs into Postgres,\nby fetching the remote content as `bytea` and then treat it as a PDF.\n\n```tsql\nCREATE EXTENSION pgpdf;\nCREATE EXTENSION http;\n\nSELECT pdf_read_bytes(text_to_bytea(content))\nFROM http_get('https://wiki.postgresql.org/images/e/e3/Hooks_in_postgresql.pdf');\n```\n\n### String Functions and Operators\n\nStandard Postgres [String Functions and Operators](https://www.postgresql.org/docs/17/functions-string.html)\nshould work as usual:\n\n```tsql\nSELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;\n```\n\n```tsql\nSELECT upper('/tmp/pgintro.pdf'::pdf::text);\n```\n\n``` tsql\nSELECT name\nFROM pdfs\nWHERE doc::text LIKE '%Postgres%';\n```\n\n### Full-Text Search (FTS)\n\nYou can also perform full-text search (FTS), since you can work on a `pdf` file like normal text.\n\n```tsql\nSELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');\n```\n\n```tsql\n ?column? \n----------\n t\n(1 row)\n```\n\n```tsql\nSELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');\n```\n\n```tsql\n ?column? \n----------\n f\n(1 row)\n```\n\n### Document similarity with `pg_trgm`\n\nYou can use [pg_trgm](https://postgresql.org/docs/17/interactive/pgtrgm.html)\nto get the similarity between two documents:\n\n```tsql\nCREATE EXTENSION pg_trgm;\n\nSELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);\n```\n\n### Metadata\n\nThe following functions are available:\n\n- `pdf_title(pdf) → text`\n- `pdf_author(pdf) → text`\n- `pdf_num_pages(pdf) → integer`\n\n  Total number of pages in the document\n- `pdf_page(pdf, integer) → text`\n\n  Get the i-th page as text\n- `pdf_creator(pdf) → text`\n- `pdf_keywords(pdf) → text`\n- `pdf_metadata(pdf) → text`\n- `pdf_version(pdf) → text`\n- `pdf_subject(pdf) → text`\n- `pdf_creation(pdf) → timestamp`\n- `pdf_modification(pdf) → timestamp`\n\n```tsql\nSELECT pdf_title('/tmp/pgintro.pdf');\n```\n\n```tsql\n        pdf_title        \n-------------------------\n PostgreSQL Introduction\n(1 row)\n```\n\n```tsql\nSELECT pdf_author('/tmp/pgintro.pdf');\n```\n\n```tsql\n pdf_author \n------------\n 周正中\n(1 row)\n```\n\nGetting a subset of pages\n\n```tsql\nSELECT pdf_num_pages('/tmp/pgintro.pdf');\n```\n\n```tsql\n pdf_num_pages \n---------------\n            24\n(1 row)\n```\n\n```tsql\nSELECT pdf_page('/tmp/pgintro.pdf', 1);\n```\n\n```tsql\n           pdf_page           \n------------------------------\n Catalog                     +\n  PostgreSQL Origin         +\n  Layout                    +\n  Features                  +\n  Enterprise Class Attribute+\n  Case\n(1 row)\n```\n\n```tsql\nSELECT pdf_subject('/tmp/pgintro.pdf');\n```\n\n```tsql\n pdf_subject \n-------------\n \n(1 row)\n```\n\n```tsql\nSELECT pdf_creation('/tmp/pgintro.pdf');\n```\n\n```tsql\n       pdf_creation       \n--------------------------\n Wed Jul 20 11:13:37 2011\n(1 row)\n```\n\n```tsql\nSELECT pdf_modification('/tmp/pgintro.pdf');\n```\n\n```tsql\n     pdf_modification     \n--------------------------\n Wed Jul 20 11:13:37 2011\n(1 row)\n```\n\n```tsql\nSELECT pdf_creator('/tmp/pgintro.pdf');\n```\n\n```tsql\n            pdf_creator             \n------------------------------------\n Microsoft® Office PowerPoint® 2007\n(1 row)\n```\n\n```tsql\nSELECT pdf_metadata('/tmp/pgintro.pdf');\n```\n\n```tsql\n pdf_metadata \n--------------\n \n(1 row)\n```\n\n```tsql\nSELECT pdf_version('/tmp/pgintro.pdf');\n```\n\n```tsql\n pdf_version \n-------------\n PDF-1.5\n(1 row)\n```\n\n## Installation\n\nInstall [poppler](https://poppler.freedesktop.org) dependencies\n\n**Linux**\n```\nsudo apt install -y libpoppler-glib-dev pkg-config\n```\n\n**Homebrew/MacOS**\n\n```\nbrew install poppler pkgconf\n```\n\n```\ncd /tmp\ngit clone https://github.com/Florents-Tselai/pgpdf.git\ncd pgpdf\nmake\nmake install # may need sudo\n```\n\nAfter the installation, in a session:\n\n```tsql\nCREATE EXTENSION pgpdf;\n```\n\n### Docker\n\nGet the [Docker image](https://hub.docker.com/r/florents/pgpdf) with:\n\n```sh\ndocker pull florents/pgpdf:pg17\n```\n\nThis adds pgpdf to the [Postgres image](https://hub.docker.com/_/postgres) (replace `17` with your Postgres server version, and run it the same way).\n\nRun the image in a container.\n\n```sh\ndocker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17\n```\n\nThrough another terminal, connect to the running server (container).\n\n```sh\nPGPASSWORD=pass psql -h localhost -p 5432 -U postgres\n```\n\n\u003e [!WARNING]\n\u003e Reading arbitrary binary data (PDF) into your database can pose security risks.\n\u003e Only use this for files you trust.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflorents-tselai%2Fpgpdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fflorents-tselai%2Fpgpdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflorents-tselai%2Fpgpdf/lists"}