Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/florents-tselai/pgpdf
pdf type for Postgres
https://github.com/florents-tselai/pgpdf
documents pdf postgresql
Last synced: about 17 hours ago
JSON representation
pdf type for Postgres
- Host: GitHub
- URL: https://github.com/florents-tselai/pgpdf
- Owner: Florents-Tselai
- License: gpl-2.0
- Created: 2024-09-22T21:10:10.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-11-10T10:06:12.000Z (5 days ago)
- Last Synced: 2024-11-10T10:18:39.942Z (5 days ago)
- Topics: documents, pdf, postgresql
- Language: C
- Homepage: https://tselai.com/pgpdf-pdf-type-postgres
- Size: 11.9 MB
- Stars: 31
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# pgPDF: `pdf` type for Postgres
[![build](https://github.com/Florents-Tselai/pgpdf/actions/workflows/build.yml/badge.svg)](https://github.com/Florents-Tselai/pgpdf/actions/workflows/build.yml)
This extension for PostgreSQL provides a `pdf` data type and assorted functions.
The actual PDF parsing is done by [poppler](https://poppler.freedesktop.org).
```tsql
SELECT '/tmp/pgintro.pdf'::pdf;
``````tsql
----------------------------------------------------------------------------------
PostgreSQL Introduction +
Digoal.Zhou +
7/20/2011Catalog +
PostgreSQL Origin
```Also check blog:
- [Full Text Search on PDFs With Postgres](https://tselai.com/full-text-search-pdf-postgres)
- [pgpdf: pdf type for Postgres](https://tselai.com/pgpdf-pdf-type-postgres)## Usage
Creating a `pdf` type,
by casting either `text` path or `bytea` blob.```sql
SELECT '/path/to.pdf'::pdf;SELECT pg_read_binary_file('/path/to.pdf')::bytea::pdf;
```Below are some examples
```sh
wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
```### Full-Text Search (FTS)
You can also perform full-text search (FTS), since you can work on a `pdf` file like normal text.
```tsql
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
``````tsql
?column?
----------
t
(1 row)
``````tsql
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');
``````tsql
?column?
----------
f
(1 row)
```### `bytea`
If you don't have the PDF file in your filesystem but have already stored its content in a `bytea` column,
you can cast a `bytea` to `pdf`, like so:```tsql
SELECT pg_read_binary_file('/tmp/pgintro.pdf')::pdf
```### Content
```tsql
SELECT '/tmp/pgintro.pdf'::pdf;
``````tsql
----------------------------------------------------------------------------------
PostgreSQL Introduction +
Digoal.Zhou +
7/20/2011Catalog +
PostgreSQL Origin
``````tsql
SELECT pdf_title('/tmp/pgintro.pdf');
``````tsql
pdf_title
-------------------------
PostgreSQL Introduction
(1 row)
```Getting a subset of pages
```tsql
SELECT pdf_num_pages('/tmp/pgintro.pdf');
``````tsql
pdf_num_pages
---------------
24
(1 row)
``````tsql
SELECT pdf_page('/tmp/pgintro.pdf', 1);
``````tsql
pdf_page
------------------------------
Catalog +
PostgreSQL Origin +
Layout +
Features +
Enterprise Class Attribute+
Case
(1 row)
``````tsql
SELECT pdf_subject('/tmp/pgintro.pdf');
``````tsql
pdf_subject
-------------
(1 row)
```### Metadata
The following functions are available:
- `pdf_title(pdf) → text`
- `pdf_author(pdf) → text`
- `pdf_num_pages(pdf) → integer`Total number of pages in the document
- `pdf_page(pdf, integer) → text`Get the i-th page as text
- `pdf_creator(pdf) → text`
- `pdf_keywords(pdf) → text`
- `pdf_metadata(pdf) → text`
- `pdf_version(pdf) → text`
- `pdf_subject(pdf) → text`
- `pdf_creation(pdf) → timestamp`
- `pdf_modification(pdf) → timestamp````tsql
SELECT pdf_author('/tmp/pgintro.pdf');
``````tsql
pdf_author
------------
周正中
(1 row)
``````tsql
SELECT pdf_creation('/tmp/pgintro.pdf');
``````tsql
pdf_creation
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
``````tsql
SELECT pdf_modification('/tmp/pgintro.pdf');
``````tsql
pdf_modification
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
``````tsql
SELECT pdf_creator('/tmp/pgintro.pdf');
``````tsql
pdf_creator
------------------------------------
Microsoft® Office PowerPoint® 2007
(1 row)
``````tsql
SELECT pdf_metadata('/tmp/pgintro.pdf');
``````tsql
pdf_metadata
--------------
(1 row)
``````tsql
SELECT pdf_version('/tmp/pgintro.pdf');
``````tsql
pdf_version
-------------
PDF-1.5
(1 row)
```## Installation
```
sudo apt install -y libpoppler-glib-dev pkg-config
```
```
cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install
``````tsql
CREATE EXTENSION pgpdf;
```> [!WARNING]
> Reading arbitrary binary data (PDF) into your database can pose security risks.
> Only use this for files you trust.