https://github.com/fairdataihub/poster2json
https://github.com/fairdataihub/poster2json
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/fairdataihub/poster2json
- Owner: fairdataihub
- License: mit
- Created: 2026-02-04T20:10:05.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-02-24T04:52:53.000Z (4 months ago)
- Last Synced: 2026-02-24T10:46:26.909Z (4 months ago)
- Language: Python
- Size: 2.89 MB
- Stars: 7
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
Awesome Lists containing this project
README

poster2json
Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.
Documentation
·
Changelog
·
Report Bug
·
Request Feature
---
## Description
**poster2json** extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the [poster-json-schema](https://github.com/fairdataihub/poster-json-schema).
The pipeline uses:
- [**Llama-3.1-8B-Poster-Extraction**](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) for JSON structuring
- **Qwen2-VL-7B** for vision-based OCR of image posters
- **pdfalto** for layout-aware PDF text extraction
## Quick Start
### Installation
```bash
pip install poster2json
```
### CLI Usage
```bash
# Extract metadata from a poster
poster2json extract poster.pdf -o result.json
# Validate extracted JSON
poster2json validate result.json
# Process multiple posters
poster2json batch ./posters/ -o ./output/
```
### Python API
```python
from poster2json import extract_poster, validate_poster
# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])
# Validate the result
is_valid = validate_poster(result)
```
## Output Format
Output conforms to the [poster-json-schema](https://github.com/fairdataihub/poster-json-schema) (DataCite-based):
```json
{
"$schema": "https://posters.science/schema/v0.1/poster_schema.json",
"creators": [
{
"name": "Garcia, Sofia",
"givenName": "Sofia",
"familyName": "Garcia",
"affiliation": ["University"]
}
],
"titles": [
{ "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
],
"posterContent": {
"sections": [
{ "sectionTitle": "Abstract", "sectionContent": "..." },
{ "sectionTitle": "Methods", "sectionContent": "..." },
{ "sectionTitle": "Results", "sectionContent": "..." }
]
},
"imageCaptions": [{ "captions": ["Figure 1.", "ROC curves showing..."] }],
"tableCaptions": [{ "captions": ["Table 1.", "Performance metrics"] }]
}
```
## System Requirements
| Requirement | Specification |
| ----------- | -------------------------------- |
| GPU | NVIDIA CUDA-capable, ≥16GB VRAM |
| RAM | ≥32GB recommended |
| Python | 3.10+ |
| OS | Linux, macOS, Windows (via WSL2) |
## Performance
Validated on 10 manually annotated scientific posters:
| Metric | Score | Threshold |
| ---------------- | ----- | --------- |
| Word Capture | 0.96 | ≥0.75 |
| ROUGE-L | 0.89 | ≥0.75 |
| Number Capture | 0.93 | ≥0.75 |
| Field Proportion | 0.99 | 0.50–2.00 |
**Pass Rate**: 10/10 (100%)
## Documentation
| Document | Description |
| ------------------------------------ | ------------------------------- |
| [Architecture](docs/architecture.md) | Technical details & methodology |
| [Evaluation](docs/evaluation.md) | Validation metrics & results |
## Development Setup
```bash
# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows
# Install poetry
pip install poetry
# Install dependencies
poetry install
# Run tests
poe test
# Format code
poe format
```
If you are on windows and have multiple python versions, you can use the following commands:
```bash
py -0p # list all python versions
py -3.12 -m venv .venv
```
## License
MIT License - see [LICENSE](LICENSE.md) for details.
## Citation
```bibtex
@software{poster2json2026,
title = {poster2json: Scientific Poster to JSON Metadata Extraction},
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
year = {2026},
url = {https://github.com/fairdataihub/poster2json},
doi = {10.5281/zenodo.18320010}
}
```
## Acknowledgements
- [FAIR Data Innovations Hub](https://fairdataihub.org/)
- Meta AI for Llama 3.1
- Alibaba Cloud for Qwen2-VL
- Part of the [posters.science](https://posters.science) platform
## Contributing
Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.