https://github.com/papercutsoftware/pdfsearch

A full text search library for PDFs.
https://github.com/papercutsoftware/pdfsearch

Last synced: 7 months ago
JSON representation

A full text search library for PDFs.

Host: GitHub
URL: https://github.com/papercutsoftware/pdfsearch
Owner: PaperCutSoftware
License: other
Created: 2019-06-03T00:49:20.000Z (over 6 years ago)
Default Branch: main
Last Pushed: 2020-09-29T05:29:48.000Z (over 5 years ago)
Last Synced: 2025-04-14T07:45:28.948Z (10 months ago)
Language: Go
Homepage:
Size: 252 KB
Stars: 67
Watchers: 33
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Pure Go Full Text Search of PDF Files

This library implements full text search for PDFs.

* The public APIs are in [index_search.go](index_search.go).

The are some command lines programs that demonstrate the library's functionality.

* [examples/pdf_search_demo.go](examples/pdf_search_demo.go) demonstrates the main APIs.

* [examples/index.go](examples/index.go) builds an index over a set of PDFs.

* [examples/search.go](examples/search.go) searches the index build by [examples/index.go](examples/index.go).

Binary versions (executables) of these three programs are available in

[releases](https://github.com/PaperCutSoftware/pdfsearch/releases/tag/v0.0.1).

There are 64-bit binaries for Windows, Mac and Linux. The binaries do not require a UniDoc license.

## Installation

    git clone https://github.com/PaperCutSoftware/pdfsearch

Replace `uniDocLicenseKey` and `companyName` in [unidoc_glue.go](internal/doclib/unidoc_glue.go)

with valid [UniDoc](https://unidoc.io/) license fields.

    cd pdfsearch/examples

    go build pdf_search_demo.go

    go build index.go

    go build search.go

### [examples/pdf_search_demo.go](examples/pdf_search_demo.go)

__Usage__: `./pdf_search_demo  -f  `

__Example__: `./pdf_search_demo  -f PDF32000_2008.pdf cubic Bézier curve`

The example will search `PDF32000_2008.pdf` for _cubic Bézier curve_.

`pdf_search_demo.go` shows how to use the APIs in [index_search.go](index_search.go) to

* create indexes over PDFs,

* search those indexes using full-text search, and

* mark up PDFs with the locations of the search matches on pages.

### [examples/index.go](examples/index.go)

__Usage__: `./index `

__Example__: `./index ~/climate/**/*.pdf`

The example creates an on-disk index over the PDFs in `~/climate/` and its subdirectories.

### [examples/search.go](examples/search.go)

__Usage__: `./search `

__Example__: `./search integrated assessment model`

The example searches the on-disk index created by [examples/index.go](examples/index.go)

for _integrated assessment model_.

## Libraries

[index_search.go](index_search.go) uses [UniDoc](https://unidoc.io/) for PDF parsing and [bleve](http://github.com/blevesearch/bleve) for search.

## Talks about this library

[GopherCon AU 2019](https://docs.google.com/presentation/d/14FDuKAPgWM2z4V1xag0HFEzL3IJfaS4a7Wt0ChxDG6s/edit?usp=sharing)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/papercutsoftware/pdfsearch

Awesome Lists containing this project

README