An open API service indexing awesome lists of open source software.

https://github.com/fairdataihub/poster2json


https://github.com/fairdataihub/poster2json

Last synced: 4 months ago
JSON representation

Awesome Lists containing this project

README

          

logo


poster2json


Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.




contributors


stars


open issues


license




PyPI Version


PyPI Downloads


DOI


Documentation
·
Changelog
·
Report Bug
·
Request Feature



---

## Description

**poster2json** extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the [poster-json-schema](https://github.com/fairdataihub/poster-json-schema).

The pipeline uses:

- [**Llama-3.1-8B-Poster-Extraction**](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) for JSON structuring
- **Qwen2-VL-7B** for vision-based OCR of image posters
- **pdfalto** for layout-aware PDF text extraction

## Quick Start

### Installation

```bash
pip install poster2json
```

### CLI Usage

```bash
# Extract metadata from a poster
poster2json extract poster.pdf -o result.json

# Validate extracted JSON
poster2json validate result.json

# Process multiple posters
poster2json batch ./posters/ -o ./output/
```

### Python API

```python
from poster2json import extract_poster, validate_poster

# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])

# Validate the result
is_valid = validate_poster(result)
```

## Output Format

Output conforms to the [poster-json-schema](https://github.com/fairdataihub/poster-json-schema) (DataCite-based):

```json
{
"$schema": "https://posters.science/schema/v0.1/poster_schema.json",
"creators": [
{
"name": "Garcia, Sofia",
"givenName": "Sofia",
"familyName": "Garcia",
"affiliation": ["University"]
}
],
"titles": [
{ "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
],
"posterContent": {
"sections": [
{ "sectionTitle": "Abstract", "sectionContent": "..." },
{ "sectionTitle": "Methods", "sectionContent": "..." },
{ "sectionTitle": "Results", "sectionContent": "..." }
]
},
"imageCaptions": [{ "captions": ["Figure 1.", "ROC curves showing..."] }],
"tableCaptions": [{ "captions": ["Table 1.", "Performance metrics"] }]
}
```

## System Requirements

| Requirement | Specification |
| ----------- | -------------------------------- |
| GPU | NVIDIA CUDA-capable, ≥16GB VRAM |
| RAM | ≥32GB recommended |
| Python | 3.10+ |
| OS | Linux, macOS, Windows (via WSL2) |

## Performance

Validated on 10 manually annotated scientific posters:

| Metric | Score | Threshold |
| ---------------- | ----- | --------- |
| Word Capture | 0.96 | ≥0.75 |
| ROUGE-L | 0.89 | ≥0.75 |
| Number Capture | 0.93 | ≥0.75 |
| Field Proportion | 0.99 | 0.50–2.00 |

**Pass Rate**: 10/10 (100%)

## Documentation

| Document | Description |
| ------------------------------------ | ------------------------------- |
| [Architecture](docs/architecture.md) | Technical details & methodology |
| [Evaluation](docs/evaluation.md) | Validation metrics & results |

## Development Setup

```bash
# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows

# Install poetry
pip install poetry

# Install dependencies
poetry install

# Run tests
poe test

# Format code
poe format
```

If you are on windows and have multiple python versions, you can use the following commands:

```bash
py -0p # list all python versions

py -3.12 -m venv .venv
```

## License

MIT License - see [LICENSE](LICENSE.md) for details.

## Citation

```bibtex
@software{poster2json2026,
title = {poster2json: Scientific Poster to JSON Metadata Extraction},
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
year = {2026},
url = {https://github.com/fairdataihub/poster2json},
doi = {10.5281/zenodo.18320010}
}
```

## Acknowledgements

- [FAIR Data Innovations Hub](https://fairdataihub.org/)
- Meta AI for Llama 3.1
- Alibaba Cloud for Qwen2-VL
- Part of the [posters.science](https://posters.science) platform

## Contributing

Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.