Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/neuml/txtmarker
Highlight text in documents
https://github.com/neuml/txtmarker
highlight pdf python search text
Last synced: 2 months ago
JSON representation
Highlight text in documents
- Host: GitHub
- URL: https://github.com/neuml/txtmarker
- Owner: neuml
- License: apache-2.0
- Created: 2020-12-02T20:59:45.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2023-09-23T13:09:35.000Z (over 1 year ago)
- Last Synced: 2024-11-17T07:41:40.288Z (3 months ago)
- Topics: highlight, pdf, python, search, text
- Language: Python
- Homepage:
- Size: 834 KB
- Stars: 73
- Watchers: 4
- Forks: 11
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Highlight text in documents-------------------------------------------------------------------------------------------------------------------------------------------------------
![demo](https://raw.githubusercontent.com/neuml/txtmarker/master/demo.png)
txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scan an input document and creates a modified version with highlights embedded.
Current file formats supported:
## Installation
The easiest way to install is via pip and PyPIpip install txtmarker
You can also install txtmarker directly from GitHub. Using a Python Virtual Environment is recommended.
pip install git+https://github.com/neuml/txtmarker
Python 3.8+ is supported
## Examples
The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.
### Notebooks
| Notebook | Description | |
|:----------|:-------------|------:|
| [Introducing txtmarker](https://github.com/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) | Overview of the functionality provided by txtmarker | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) |
| [Highlighting with Transformers](https://github.com/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) | AI-driven highlighting with Transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) |## Configuration
The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.
### Create a new highlighter
```python
from txtmarker.factory import Factory
highlighter = Factory.create("pdf")
```#### extension
```yaml
extension: string
```Type of highlighter to create (i.e. pdf)
#### Optional constructor arguments:
#### formatter
```yaml
formatter: callable
```Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.
#### chunks
```yaml
chunks: int
```Splits queries into multiple chunks. This is designed for very long text matches.
### Highlight text
```python
highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])
```#### infile
```yaml
infile: string
```Full path to input file
#### outfile
```yaml
outfile: string
```Full path to output file, i.e. the highlighted file
#### highlights
```yaml
highlights: list of (string, string|regex)
```List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression.