https://github.com/neuml/txtmarker
Highlight text in documents
https://github.com/neuml/txtmarker
highlight pdf python search text
Last synced: 8 months ago
JSON representation
Highlight text in documents
- Host: GitHub
- URL: https://github.com/neuml/txtmarker
- Owner: neuml
- License: apache-2.0
- Created: 2020-12-02T20:59:45.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-09-23T13:09:35.000Z (over 2 years ago)
- Last Synced: 2024-11-17T07:41:40.288Z (over 1 year ago)
- Topics: highlight, pdf, python, search, text
- Language: Python
- Homepage:
- Size: 834 KB
- Stars: 73
- Watchers: 4
- Forks: 11
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Highlight text in documents
-------------------------------------------------------------------------------------------------------------------------------------------------------

txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.
Current file formats supported:
- pdf
## Installation
The easiest way to install is via pip and PyPI
```
pip install txtmarker
```
Python 3.10+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.
txtmarker can also be installed directly from GitHub to access the latest, unreleased features.
```
pip install git+https://github.com/neuml/txtmarker
```
## Examples
The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.
### Notebooks
| Notebook | Description | |
|:----------|:-------------|------:|
| [Introducing txtmarker](https://github.com/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) | Overview of the functionality provided by txtmarker | [](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) |
| [Highlighting with Transformers](https://github.com/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) | AI-driven highlighting with Transformers | [](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) |
## Configuration
The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.
### Create a new highlighter
Creates a new highlighter instance.
```python
from txtmarker.factory import Factory
highlighter = Factory.create("pdf")
```
#### extension
```yaml
extension: string
```
Type of highlighter to create (i.e. pdf)
#### Optional constructor arguments:
#### formatter
```yaml
formatter: callable
```
Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.
#### chunks
```yaml
chunks: int
```
Splits queries into multiple chunks. This is designed for very long text matches.
### Page text
Extracts page text from `infile` and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.
```python
highlighter.pages("input.pdf")
```
#### infile
```yaml
infile: string
```
Full path to input file
### Highlight text
Highlights using provided annotations. Annotated file is stored as `outfile`.
```python
highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])
```
#### infile
```yaml
infile: string
```
Full path to input file
#### outfile
```yaml
outfile: string
```
Full path to output file, i.e. the highlighted file
#### highlights
```yaml
highlights: list of (string, string|regex)
```
List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call `re.escape`).