Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/neuml/txtmarker

Highlight text in documents
https://github.com/neuml/txtmarker

highlight pdf python search text

Last synced: 2 months ago
JSON representation

Highlight text in documents

Host: GitHub
URL: https://github.com/neuml/txtmarker
Owner: neuml
License: apache-2.0
Created: 2020-12-02T20:59:45.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2023-09-23T13:09:35.000Z (over 1 year ago)
Last Synced: 2024-11-17T07:41:40.288Z (3 months ago)
Topics: highlight, pdf, python, search, text
Language: Python
Homepage:
Size: 834 KB
Stars: 73
Watchers: 4
Forks: 11
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Highlight text in documents

-------------------------------------------------------------------------------------------------------------------------------------------------------

![demo](https://raw.githubusercontent.com/neuml/txtmarker/master/demo.png)

txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scan an input document and creates a modified version with highlights embedded.

Current file formats supported:

- pdf

## Installation
The easiest way to install is via pip and PyPI

pip install txtmarker

You can also install txtmarker directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/txtmarker

Python 3.8+ is supported

## Examples

The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.

### Notebooks

| Notebook | Description | |
|:----------|:-------------|------:|
| [Introducing txtmarker](https://github.com/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) | Overview of the functionality provided by txtmarker | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) |
| [Highlighting with Transformers](https://github.com/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) | AI-driven highlighting with Transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) |

## Configuration

The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.

### Create a new highlighter

```python
from txtmarker.factory import Factory
highlighter = Factory.create("pdf")
```

#### extension
```yaml
extension: string
```

Type of highlighter to create (i.e. pdf)

#### Optional constructor arguments:

#### formatter
```yaml
formatter: callable
```

Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.

#### chunks
```yaml
chunks: int
```

Splits queries into multiple chunks. This is designed for very long text matches.

### Highlight text

```python
highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])
```

#### infile
```yaml
infile: string
```

Full path to input file

#### outfile
```yaml
outfile: string
```

Full path to output file, i.e. the highlighted file

#### highlights
```yaml
highlights: list of (string, string|regex)
```

List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression.