{"id":13510710,"url":"https://github.com/neuml/paperai","last_synced_at":"2025-10-31T21:26:49.063Z","repository":{"id":45720635,"uuid":"281475256","full_name":"neuml/paperai","owner":"neuml","description":"📄 🤖 Semantic search and workflows for medical/scientific papers","archived":false,"fork":false,"pushed_at":"2025-04-21T17:36:05.000Z","size":1812,"stargazers_count":1394,"open_issues_count":0,"forks_count":108,"subscribers_count":24,"default_branch":"master","last_synced_at":"2025-04-21T18:30:50.068Z","etag":null,"topics":["ai","artificial-intelligence","document-search","machine-learning","medical","nlp","python","scientific-papers","search","txtai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/neuml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-07-21T18:33:30.000Z","updated_at":"2025-04-21T17:36:08.000Z","dependencies_parsed_at":"2023-09-23T17:11:45.084Z","dependency_job_id":"316f46e1-a54a-446d-bbf7-bb2d83ed3125","html_url":"https://github.com/neuml/paperai","commit_stats":{"total_commits":205,"total_committers":1,"mean_commits":205.0,"dds":0.0,"last_synced_commit":"fed05847c089c191068db84e9997d105dd216dee"},"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuml%2Fpaperai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuml%2Fpaperai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuml%2Fpaperai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuml%2Fpaperai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/neuml","download_url":"https://codeload.github.com/neuml/paperai/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254092789,"owners_count":22013290,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","document-search","machine-learning","medical","nlp","python","scientific-papers","search","txtai"],"created_at":"2024-08-01T02:01:51.027Z","updated_at":"2025-10-31T21:26:49.054Z","avatar_url":"https://github.com/neuml.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/neuml/paperai/master/logo.png\"/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cb\u003eAI for medical and scientific papers\u003c/b\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/neuml/paperai/releases\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/release/neuml/paperai.svg?style=flat\u0026color=success\" alt=\"Version\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/neuml/paperai/releases\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/release-date/neuml/paperai.svg?style=flat\u0026color=blue\" alt=\"GitHub Release Date\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/neuml/paperai/issues\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/issues/neuml/paperai.svg?style=flat\u0026color=success\" alt=\"GitHub issues\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/neuml/paperai\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/last-commit/neuml/paperai.svg?style=flat\u0026color=blue\" alt=\"GitHub last commit\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/neuml/paperai/actions?query=workflow%3Abuild\"\u003e\n        \u003cimg src=\"https://github.com/neuml/paperai/workflows/build/badge.svg\" alt=\"Build Status\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://coveralls.io/github/neuml/paperai?branch=master\"\u003e\n        \u003cimg src=\"https://img.shields.io/coverallsCoverage/github/neuml/paperai\" alt=\"Coverage Status\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n-------------------------------------------------------------------------------------------------------------------------------------------------------\n\n`paperai` is an AI application for medical and scientific papers.\n\n![demo](https://raw.githubusercontent.com/neuml/paperai/master/demo.png)\n\n⚡ Supercharge research tasks with AI-driven report generation. A `paperai` application goes through repositories of articles and generates bulk answers to questions backed by Large Language Model (LLM) prompts and Retrieval Augmented Generation (RAG) pipelines.\n\nA `paperai` configuration file enables bulk LLM inference operations in a performant manner. Think of it like kicking off hundreds of ChatGPT prompts over your data.\n\n![architecture](https://raw.githubusercontent.com/neuml/paperai/master/images/architecture.png#gh-light-mode-only)\n![architecture](https://raw.githubusercontent.com/neuml/paperai/master/images/architecture-dark.png#gh-dark-mode-only)\n\n`paperai` can generate reports in Markdown, CSV and annotate answers directly on PDFs (when available).\n\n## Installation\n\nThe easiest way to install is via pip and PyPI\n\n```\npip install paperai\n```\n\nPython 3.10+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.\n\n`paperai` can also be installed directly from GitHub to access the latest, unreleased features.\n\n```\npip install git+https://github.com/neuml/paperai\n```\n\nSee [this link](https://neuml.github.io/txtai/install/#environment-specific-prerequisites) to help resolve environment-specific install issues.\n\n### Docker\n\nRun the steps below to build a docker image with `paperai` and all dependencies.\n\n```\nwget https://raw.githubusercontent.com/neuml/paperai/master/docker/Dockerfile\ndocker build -t paperai .\ndocker run --name paperai --rm -it paperai\n```\n\npaperetl can be added in to have a single image to index and query content. Follow the instructions to build a [paperetl docker image](https://github.com/neuml/paperetl#docker) and then run the following.\n\n```\ndocker build -t paperai --build-arg BASE_IMAGE=paperetl --build-arg START=/scripts/start.sh .\ndocker run --name paperai --rm -it paperai\n```\n\n## Examples\n\nThe following notebooks and applications demonstrate the capabilities provided by `paperai`.\n\n### Notebooks\n\n| Notebook  | Description  |       |\n|:----------|:-------------|------:|\n| [Introducing paperai](https://github.com/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) | Overview of the functionality provided by paperai | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) |\n| [Medical Research Project](https://github.com/neuml/paperai/blob/master/examples/02_Medical_Research_Project.ipynb) | Research young onset colon cancer | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/02_Medical_Research_Project.ipynb) |\n\n### Applications\n\n| Application  | Description  |\n|:----------|:-------------|\n| [Search](https://github.com/neuml/paperai/blob/master/examples/search.py) | Search a `paperai` index. Set query parameters, execute searches and display results. |\n\n## Building a model\n\n`paperai` indexes databases previously built with [paperetl](https://github.com/neuml/paperetl). The following shows how to create a new `paperai` index.\n\n1. (Optional) Create an index.yml file\n\n    `paperai` uses the default txtai embeddings configuration when not specified. Alternatively, an index.yml file can be specified that takes all the same options as a txtai embeddings instance. See the [txtai documentation](https://neuml.github.io/txtai/embeddings/configuration) for more on the possible options. A simple example is shown below.\n\n    ```\n    path: sentence-transformers/all-MiniLM-L6-v2\n    content: True\n    ```\n\n2. Build embeddings index\n\n    ```\n    python -m paperai.index \u003cpath to input data\u003e \u003coptional index configuration\u003e\n    ```\n\nThe paperai.index process requires an input data path and optionally takes index configuration. This configuration can either be a vector model path or an index.yml configuration file.\n\n## Running queries\n\nThe fastest way to run queries is to start a `paperai` shell\n\n```\npaperai \u003cpath to model directory\u003e\n```\n\nA prompt will come up. Queries can be typed directly into the console.\n\n## Report schema\n\nThe following steps through an example `paperai` report configuration file and describes each section.\n\n```yaml\nname: ColonCancer\noptions:\n    llm: Intelligent-Internet/II-Medical-8B-1706-GGUF/II-Medical-8B-1706.Q4_K_M.gguf\n    system: You are a medical literature document parser. You extract fields from data.\n    template: |\n        Quickly extract the following field using the provided rules and context.\n\n        Rules:\n          - Keep it simple, don't overthink it\n          - ONLY extract the data\n          - NEVER explain why the field is extracted\n          - NEVER restate the field name only give the field value\n          - Say no data if the field can't be found within the context\n\n        Field:\n        {question}\n\n        Context:\n        {context}\n\n    context: 5\n    params:\n        maxlength: 4096\n        stripthink: True\n\nResearch:\n    query: colon cancer young adults\n    columns:\n        - name: Date\n        - name: Study\n        - name: Study Link\n        - name: Journal\n        - {name: Sample Size, query: number of patients, question: Sample Size}\n        - {name: Objective, query: objective, question: Study Objective}\n        - {name: Causes, query: possible causes, question: List of possible causes}\n        - {name: Detection, query: diagnosis, question: List of ways to diagnose}\n```\n\n### Configuration\n\nThe following shows the top level configuration options.\n\n| Field  | Description  |\n|:------------ |:-------------|\n| name | Report name |\n| options | RAG pipeline options - set the LLM, prompt templates, max length and more|\n| report | Each unique top level parameter sets the report name. In the example above, it's called `Research` |\n| query | Vector query that identifies the top n documents |\n| columns | List of columns |\n\n### Standard columns\n\nStandard columns use the article data store metadata to simply copy fields into a report. Set the column `name` to one of the values below.\n\n| Field  | Description  |\n|:------------ |:-------------|\n| Id | Article unique identifier |\n| Date | Article publication date |\n| Study | Title of the article |\n| Study Link | HTTP link to the study | \n| Journal | Publication name | \n| Source | Data source name | \n| Entry | Article entry date |\n| Matches | Sections that caused this article to match the report query | \n\n### Generated columns\n\nThe most novel feature of `paperai` is being able to generate dynamic columns driven by a RAG pipeline. Each field takes the following parameters.\n\n| Parameter  | Description  |\n|:------------ |:-------------|\n| name | Column name |\n| query | search/similarity query |\n| question | llm question parameter |\n\nFor each matching article, the `query` sorts each section by relevance to that query. This can be a vector query, keyword query or hybrid query. This is controlled by the embeddings index configuration. The `question` is plugged into the RAG pipeline template along with the top n matching context elements from the query. The generated column is stored as `name` in the report output.\n\n## Building a report file\n\nReports can generate output in multiple formats. An example report call:\n\n```\npython -m paperai.report crc.yml 10 csv \u003cpath to model directory\u003e\n```\n\nIn the example above, a file named Research.csv will be created with the top 10 most relevant articles.\n\nThe following report formats are supported:\n\n- Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file.\n- CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file.\n- Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.\n\nSee the [examples](https://github.com/neuml/paperai/tree/master/examples) directory for report examples. Additional historical report configuration files can be found [here](https://github.com/neuml/cord19q/tree/master/tasks).\n\n## Tech Overview\n\n`paperai` is a combination of a [txtai](https://github.com/neuml/txtai) embeddings index, a SQLite database with the articles and an LLM. These components are joined together in a [txtai RAG pipeline](https://neuml.github.io/txtai/pipeline/text/rag/).\n\nEach article is parsed into sections and stored in a data store along with the article metadata. Embeddings are built over the full corpus. The LLM analyzes context-limited requests and generates outputs.\n\nMultiple entry points exist to interact with the model.\n\n- paperai.report - Builds a report for a series of queries. For each query, the top scoring articles are shown along with matches from those articles. There is also a highlights section showing the most relevant results.\n- paperai.query - Runs a single query from the terminal\n- paperai.shell - Allows running multiple queries from the terminal\n\n## Recognition\n\n`paperai` and/or NeuML has been recognized in the following articles.\n\n- [Machine-Learning Experts Delve Into 47,000 Papers on Coronavirus Family](https://www.wsj.com/articles/machine-learning-experts-delve-into-47-000-papers-on-coronavirus-family-11586338201)\n- [Data scientists assist medical researchers in the fight against COVID-19](https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-data-scientists-help-with-coronavirus)\n- [CORD-19 Kaggle Challenge Awards](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/161447)\n","funding_links":[],"categories":["Python","python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuml%2Fpaperai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneuml%2Fpaperai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuml%2Fpaperai/lists"}