{"id":50536737,"url":"https://github.com/fairdataihub/poster-sentry","last_synced_at":"2026-06-03T17:01:03.826Z","repository":{"id":347726898,"uuid":"1195096060","full_name":"fairdataihub/poster-sentry","owner":"fairdataihub","description":"Lightweight multimodal scientific poster classifier — text + visual + structural features. Part of posters.science.","archived":false,"fork":false,"pushed_at":"2026-03-29T08:03:15.000Z","size":2058,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-29T10:21:06.079Z","etag":null,"topics":["document-classification","fair-data","multimodal","posters-science","quality-control","scientific-posters"],"latest_commit_sha":null,"homepage":"https://posters.science","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fairdataihub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-29T08:00:01.000Z","updated_at":"2026-03-29T08:03:31.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/fairdataihub/poster-sentry","commit_stats":null,"previous_names":["fairdataihub/poster-sentry"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/fairdataihub/poster-sentry","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fposter-sentry","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fposter-sentry/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fposter-sentry/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fposter-sentry/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fairdataihub","download_url":"https://codeload.github.com/fairdataihub/poster-sentry/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fposter-sentry/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33874679,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-classification","fair-data","multimodal","posters-science","quality-control","scientific-posters"],"created_at":"2026-06-03T17:01:02.640Z","updated_at":"2026-06-03T17:01:03.768Z","avatar_url":"https://github.com/fairdataihub.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PosterSentry\n\n**Lightweight multimodal classifier for scientific poster quality control in open repositories.**\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/fairdataihub/poster-sentry)\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"PosterSentry.png\" alt=\"PosterSentry\" title=\"This image was generated by AI\" width=\"400\"\u003e\n\u003c/p\u003e\n\nPart of the quality control pipeline for [**posters.science**](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).\n\nDeveloped by the [**FAIR Data Innovations Hub**](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMI2).\n\n## The Problem\n\nOpen repositories like Zenodo and Figshare host tens of thousands of records labeled as scientific posters. However, approximately **20% of these records are mislabeled** — containing multi-page papers, conference proceedings, abstract booklets, slide decks, or other non-poster documents. This label noise is a significant barrier to automated poster processing at scale.\n\n## Architecture\n\nPosterSentry classifies PDFs using three complementary feature channels concatenated into a **542-dimensional** vector:\n\n| Channel | Features | Dimensions | Signal |\n|---------|----------|------------|--------|\n| **Text** | model2vec (potion-base-32M) embedding | 512 | Semantic content |\n| **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |\n| **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |\n\nA StandardScaler normalizes all features (preventing the 512-d text embedding from drowning out structural/visual signal), then a LogisticRegression classifier produces the final prediction.\n\nThe classifier head is a single linear layer stored as a numpy `.npz` file (**10 KB**). Inference is pure numpy — no GPU or deep learning framework required.\n\n## Performance\n\nValidated on 3,606 real scientific documents (zero synthetic data):\n\n| Metric | Value |\n|--------|-------|\n| **Accuracy** | **87.3%** |\n| F1 (poster) | 87.1% |\n| F1 (non-poster) | 87.4% |\n| Precision (poster) | 88.2% |\n| Recall (poster) | 85.9% |\n| Inference speed | \u003c 1 sec/PDF (CPU) |\n\nApplied to 30,205 PDFs from Zenodo and Figshare, PosterSentry classified **80.2% as true posters** and 19.8% as non-posters, with mean confidence of 0.799.\n\n### Top Discriminative Features\n\n| Feature | Coefficient | Signal |\n|---------|-------------|--------|\n| `size_per_page_kb` | +7.65 | Posters are dense, high-res single pages |\n| `page_count` | -5.49 | More pages = not a poster |\n| `file_size_kb` | -5.44 | Multi-page docs are bigger overall |\n| `is_landscape` | +0.98 | Some posters are landscape |\n| `color_diversity` | +0.95 | Posters are visually rich |\n| `edge_density` | +0.79 | More visual edges in posters |\n\n## Quick Start\n\n### Installation\n\n```bash\npip install poster-sentry\n```\n\n### CLI Usage\n\n```bash\n# Classify a single PDF\nposter-sentry classify document.pdf\n\n# Classify multiple PDFs\nposter-sentry classify *.pdf --output results.tsv\n\n# Print model info\nposter-sentry info\n```\n\n### Python API\n\n```python\nfrom poster_sentry import PosterSentry\n\nsentry = PosterSentry()\nsentry.initialize()\n\n# Classify a PDF (uses text + visual + structural features)\nresult = sentry.classify(\"document.pdf\")\nprint(f\"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}\")\n\n# Batch classification\nresults = sentry.classify_batch([\"poster1.pdf\", \"paper.pdf\", \"newsletter.pdf\"])\n\n# Text-only classification (no PDF needed)\nresult = sentry.classify_text(\"Title: My Poster\\nAuthors: ...\")\n```\n\n### Pipeline Position\n\nPosterSentry sits at the front of the posters.science pipeline — it screens incoming PDFs before expensive LLM-based extraction:\n\n```\nPDF Input\n   |\n   v\nPosterSentry          --\u003e  poster2json                     --\u003e  FAIR output\n(classify: poster?)        (Llama 3.1 8B structured extraction)  (poster-json-schema)\n```\n\n## System Requirements\n\n| Requirement | Value |\n|-------------|-------|\n| CPU | Any modern CPU (no GPU needed) |\n| RAM | 4 GB+ |\n| Python | 3.10+ |\n| Model size | 10 KB head + ~60 MB embeddings (downloaded once) |\n\n## Related Resources\n\n| Resource | Description |\n|----------|-------------|\n| [poster-sentry (HuggingFace)](https://huggingface.co/fairdataihub/poster-sentry) | Model weights and config |\n| [poster-sentry-training-data (HuggingFace)](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data) | Training dataset (3,606 samples) |\n| [poster-sentry-training (GitHub)](https://github.com/fairdataihub/poster-sentry-training) | Training code and replication |\n| [poster2json](https://github.com/fairdataihub/poster2json) | Poster to structured JSON extraction |\n| [posters.science](https://posters.science) | Platform |\n\n## Development\n\n```bash\ngit clone https://github.com/fairdataihub/poster-sentry.git\ncd poster-sentry\npip install -e \".[dev]\"\npytest\n```\n\n## Citation\n\n```bibtex\n@software{poster_sentry_2026,\n  title = {PosterSentry: Multimodal Scientific Poster Classifier},\n  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},\n  year = {2026},\n  url = {https://github.com/fairdataihub/poster-sentry},\n  note = {Part of the posters.science initiative at FAIR Data Innovations Hub}\n}\n```\n\n## License\n\nMIT License. See [LICENSE](LICENSE) for details.\n\n## Acknowledgments\n\n- [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMI2)\n- [posters.science](https://posters.science) platform\n- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone\n- Funded by [The Navigation Fund](https://doi.org/10.71707/rk36-9x79) — \"Poster Sharing and Discovery Made Easy\"\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffairdataihub%2Fposter-sentry","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffairdataihub%2Fposter-sentry","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffairdataihub%2Fposter-sentry/lists"}