https://github.com/impresso/impresso-pipelines

Reusable NLP pipelines: identify language, assess OCR quality, model topics, and extract news‑agency entities from any text.
https://github.com/impresso/impresso-pipelines

Last synced: 5 months ago
JSON representation

Reusable NLP pipelines: identify language, assess OCR quality, model topics, and extract news‑agency entities from any text.

Host: GitHub
URL: https://github.com/impresso/impresso-pipelines
Owner: impresso
Created: 2025-03-07T11:59:59.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-12-02T09:41:11.000Z (6 months ago)
Last Synced: 2025-12-05T03:17:13.872Z (6 months ago)
Language: Python
Homepage:
Size: 4.36 MB
Stars: 2
Watchers: 6
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md

Awesome Lists containing this project

README

          # Python Package: `impresso-pipelines`

[![PyPI](https://img.shields.io/pypi/v/impresso-pipelines)](https://pypi.org/project/impresso-pipelines/)

[![Python versions](https://img.shields.io/pypi/pyversions/impresso-pipelines)](https://pypi.org/project/impresso-pipelines/)

[![Weekly Downloads](https://img.shields.io/pypi/dm/impresso-pipelines)](https://pypi.org/project/impresso-pipelines/)

[![Contributors](https://img.shields.io/github/contributors/impresso/impresso-pipelines)](https://github.com/impresso/impresso-pipelines/graphs/contributors)

[![QA Workflow](https://github.com/impresso/impresso-pipelines/actions/workflows/qa.yml/badge.svg)](https://github.com/impresso/impresso-pipelines/actions/workflows/qa.yml)

## Overview

This repository contains a Python package designed for modular and efficient text processing workflows. Currently, it includes the following subpackages:

- **Language Identification Pipeline**: Identifies the language of input text and returns a probability score.

- **OCR QA Pipeline**: Assesses the quality of OCR text by estimating the proportion of recognized vocabulary items (0–1), using efficient language-specific Bloom filters.

- **LDA Topic Modeling Pipeline**: Soft clustering of input texts using LDA-based topic modeling.

- **News Agencies Pipeline**: Extracts and ranks news agency entities from text, providing relevance scores and optional links to Wikidata.

- **Advertisement Classifier**: Identifies advertisements in historical newspaper content using a fine-tuned XLM-RoBERTa model with rule-based features.

- **Lucene/Solr normalization Pipeline**: Replicates Solr's language-specific text normalization to clarify how input text is tokenized and indexed in impresso.

## Installation

### Quick Install (with uv - recommended)

[uv](https://github.com/astral-sh/uv) is an extremely fast Python package installer (10-100x faster than pip):

```bash

# Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the package with all dependencies

uv pip install "impresso-pipelines[all]"

```

### Standard Install (with pip)

To install the full package with all submodules:

```bash

pip install "impresso-pipelines[all]"

```

The `[all]` extra installs all dependencies required for each component.

### Install Individual Modules

To install individual modules without unnecessary dependencies, use:

```bash

pip install "impresso-pipelines[langident]"         # Language Identification

pip install "impresso-pipelines[ocrqa]"             # OCR QA

pip install "impresso-pipelines[ldatopics]"         # LDA Topics

pip install "impresso-pipelines[newsagencies]"      # News Agencies

pip install "impresso-pipelines[adclassifier]"      # Advertisement Classifier

pip install "impresso-pipelines[solrnormalization]" # Solr text normalization

```

### Development Setup

For contributors, we support both **uv** (faster) and **Poetry**:

```bash

# Clone the repository

git clone https://github.com/impresso/impresso-pipelines.git

cd impresso-pipelines

# Option 1: Using uv (recommended - 3-6x faster)

uv sync --extra all --extra dev

# Option 2: Using Poetry

poetry install --all-extras --with dev

# Or use Make (auto-detects uv or Poetry)

make install-dev

```

See [UV_MIGRATION.md](UV_MIGRATION.md) for more details on using uv.

## Usage

Each pipeline is instantiated from a corresponding class.

```python

from impresso_pipelines.langident import LangIdentPipeline

from impresso_pipelines.ocrqa import OCRQAPipeline

from impresso_pipelines.ldatopics import LDATopicsPipeline

from impresso_pipelines.newsagencies import NewsAgenciesPipeline

from impresso_pipelines.adclassifier import AdClassifierPipeline

from impresso_pipelines.solrnormalization import SolrNormalizationPipeline

```

## Pipeline Examples

For usage examples, refer to the individual README files:

- [Langident Pipeline](README_langident.md)

- [OCR QA Pipeline](README_ocrqa.md)

- [LDA Topics Pipeline](README_ldatopics.md)

- [News Agencies Pipeline](README_newsagencies.md)

- [Advertisement Classifier](README_adclassifier.md)

- [Solr normalization Pipeline](README_solrnormalization.md)

See also the interactive notebooks for further examples:

- [langident_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/langident_pipeline_demo.ipynb)

- [ocrqa_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/ocrqa_pipeline_demo.ipynb)

- [ldatopics_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/ldatopics_pipeline_demo.ipynb)

- [newsagencies_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/newsagencies_pipeline_demo.ipynb)

- [solrnormalization_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/solrnormalization_pipeline_demo.ipynb).

## Future Plans

Additional functionality will be added to extend use cases and support further processing tasks.

## Local Development

For contributors and developers who want to test locally before pushing to GitHub:

### Quick Start

```bash

# Clone and install

git clone https://github.com/impresso/impresso-pipelines.git

cd impresso-pipelines

# Option 1: Poetry (recommended for full development)

make install-dev

# Option 2: Pip editable mode (faster for testing changes)

make install-editable-dev

# Run tests

make test

# Run all QA checks (mimics CI)

make qa

```

### Available Commands

```bash

make help              # Show all available commands

make install          # Install package with all extras

make install-dev      # Install with dev dependencies

make test             # Run tests (skipping JVM tests)

make test-all         # Run all tests including JVM tests

make test-ocrqa       # Run only OCRQA tests

make test-cov         # Run tests with coverage report

make lint             # Run linting checks

make format           # Format code with black

make type-check       # Run type checking

make qa               # Run all QA checks

make clean            # Remove build artifacts

```

For detailed development instructions, see [CONTRIBUTING.md](CONTRIBUTING.md).

## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2025 The Impresso team.

### License

This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/impresso/impresso-pipelines

Awesome Lists containing this project

README