https://github.com/impresso/impresso-pipelines
Reusable NLP pipelines: identify language, assess OCR quality, model topics, and extract news‑agency entities from any text.
https://github.com/impresso/impresso-pipelines
Last synced: 5 months ago
JSON representation
Reusable NLP pipelines: identify language, assess OCR quality, model topics, and extract news‑agency entities from any text.
- Host: GitHub
- URL: https://github.com/impresso/impresso-pipelines
- Owner: impresso
- Created: 2025-03-07T11:59:59.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-12-02T09:41:11.000Z (6 months ago)
- Last Synced: 2025-12-05T03:17:13.872Z (6 months ago)
- Language: Python
- Homepage:
- Size: 4.36 MB
- Stars: 2
- Watchers: 6
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
README
# Python Package: `impresso-pipelines`
[](https://pypi.org/project/impresso-pipelines/)
[](https://pypi.org/project/impresso-pipelines/)
[](https://pypi.org/project/impresso-pipelines/)
[](https://github.com/impresso/impresso-pipelines/graphs/contributors)
[](https://github.com/impresso/impresso-pipelines/actions/workflows/qa.yml)
## Overview
This repository contains a Python package designed for modular and efficient text processing workflows. Currently, it includes the following subpackages:
- **Language Identification Pipeline**: Identifies the language of input text and returns a probability score.
- **OCR QA Pipeline**: Assesses the quality of OCR text by estimating the proportion of recognized vocabulary items (0–1), using efficient language-specific Bloom filters.
- **LDA Topic Modeling Pipeline**: Soft clustering of input texts using LDA-based topic modeling.
- **News Agencies Pipeline**: Extracts and ranks news agency entities from text, providing relevance scores and optional links to Wikidata.
- **Advertisement Classifier**: Identifies advertisements in historical newspaper content using a fine-tuned XLM-RoBERTa model with rule-based features.
- **Lucene/Solr normalization Pipeline**: Replicates Solr's language-specific text normalization to clarify how input text is tokenized and indexed in impresso.
## Installation
### Quick Install (with uv - recommended)
[uv](https://github.com/astral-sh/uv) is an extremely fast Python package installer (10-100x faster than pip):
```bash
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install the package with all dependencies
uv pip install "impresso-pipelines[all]"
```
### Standard Install (with pip)
To install the full package with all submodules:
```bash
pip install "impresso-pipelines[all]"
```
The `[all]` extra installs all dependencies required for each component.
### Install Individual Modules
To install individual modules without unnecessary dependencies, use:
```bash
pip install "impresso-pipelines[langident]" # Language Identification
pip install "impresso-pipelines[ocrqa]" # OCR QA
pip install "impresso-pipelines[ldatopics]" # LDA Topics
pip install "impresso-pipelines[newsagencies]" # News Agencies
pip install "impresso-pipelines[adclassifier]" # Advertisement Classifier
pip install "impresso-pipelines[solrnormalization]" # Solr text normalization
```
### Development Setup
For contributors, we support both **uv** (faster) and **Poetry**:
```bash
# Clone the repository
git clone https://github.com/impresso/impresso-pipelines.git
cd impresso-pipelines
# Option 1: Using uv (recommended - 3-6x faster)
uv sync --extra all --extra dev
# Option 2: Using Poetry
poetry install --all-extras --with dev
# Or use Make (auto-detects uv or Poetry)
make install-dev
```
See [UV_MIGRATION.md](UV_MIGRATION.md) for more details on using uv.
## Usage
Each pipeline is instantiated from a corresponding class.
```python
from impresso_pipelines.langident import LangIdentPipeline
from impresso_pipelines.ocrqa import OCRQAPipeline
from impresso_pipelines.ldatopics import LDATopicsPipeline
from impresso_pipelines.newsagencies import NewsAgenciesPipeline
from impresso_pipelines.adclassifier import AdClassifierPipeline
from impresso_pipelines.solrnormalization import SolrNormalizationPipeline
```
## Pipeline Examples
For usage examples, refer to the individual README files:
- [Langident Pipeline](README_langident.md)
- [OCR QA Pipeline](README_ocrqa.md)
- [LDA Topics Pipeline](README_ldatopics.md)
- [News Agencies Pipeline](README_newsagencies.md)
- [Advertisement Classifier](README_adclassifier.md)
- [Solr normalization Pipeline](README_solrnormalization.md)
See also the interactive notebooks for further examples:
- [langident_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/langident_pipeline_demo.ipynb)
- [ocrqa_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/ocrqa_pipeline_demo.ipynb)
- [ldatopics_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/ldatopics_pipeline_demo.ipynb)
- [newsagencies_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/newsagencies_pipeline_demo.ipynb)
- [solrnormalization_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/solrnormalization_pipeline_demo.ipynb).
## Future Plans
Additional functionality will be added to extend use cases and support further processing tasks.
## Local Development
For contributors and developers who want to test locally before pushing to GitHub:
### Quick Start
```bash
# Clone and install
git clone https://github.com/impresso/impresso-pipelines.git
cd impresso-pipelines
# Option 1: Poetry (recommended for full development)
make install-dev
# Option 2: Pip editable mode (faster for testing changes)
make install-editable-dev
# Run tests
make test
# Run all QA checks (mimics CI)
make qa
```
### Available Commands
```bash
make help # Show all available commands
make install # Install package with all extras
make install-dev # Install with dev dependencies
make test # Run tests (skipping JVM tests)
make test-all # Run all tests including JVM tests
make test-ocrqa # Run only OCRQA tests
make test-cov # Run tests with coverage report
make lint # Run linting checks
make format # Format code with black
make type-check # Run type checking
make qa # Run all QA checks
make clean # Remove build artifacts
```
For detailed development instructions, see [CONTRIBUTING.md](CONTRIBUTING.md).
## About Impresso
### Impresso project
[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
### Copyright
Copyright (C) 2025 The Impresso team.
### License
This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.
---