{"id":25458662,"url":"https://github.com/elias-jhsph/scienceai","last_synced_at":"2025-12-27T11:39:45.864Z","repository":{"id":244910775,"uuid":"816673039","full_name":"elias-jhsph/scienceai","owner":"elias-jhsph","description":"An AI-powered scientific literature search engine that uses OpenAI's language models to analyze research papers. It enables users to extract data, ask complex questions, and perform ad hoc literature reviews, handling hundreds of papers simultaneously without needing metadata.","archived":false,"fork":false,"pushed_at":"2024-06-21T20:11:28.000Z","size":147,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-16T14:51:32.435Z","etag":null,"topics":["ai","data-extraction","dictdatabase","flask","literature-review","llm","openai","pymupdf","research-project","research-tool","scientific-publications","scientific-research"],"latest_commit_sha":null,"homepage":"https://eliastechlabs.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elias-jhsph.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-18T07:34:02.000Z","updated_at":"2025-02-01T15:21:03.000Z","dependencies_parsed_at":"2024-06-20T12:39:01.501Z","dependency_job_id":null,"html_url":"https://github.com/elias-jhsph/scienceai","commit_stats":null,"previous_names":["elias-jhsph/scienceai","elias-jhsph/science-ai"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elias-jhsph%2Fscienceai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elias-jhsph%2Fscienceai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elias-jhsph%2Fscienceai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elias-jhsph%2Fscienceai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elias-jhsph","download_url":"https://codeload.github.com/elias-jhsph/scienceai/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239401135,"owners_count":19632122,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","data-extraction","dictdatabase","flask","literature-review","llm","openai","pymupdf","research-project","research-tool","scientific-publications","scientific-research"],"created_at":"2025-02-18T03:20:13.792Z","updated_at":"2025-12-27T11:39:45.846Z","avatar_url":"https://github.com/elias-jhsph.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ScienceAI\n\n[![PyPI version](https://badge.fury.io/py/scienceai-llm.svg)](https://badge.fury.io/py/scienceai-llm)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![Tests](https://github.com/elias-jhsph/scienceai/actions/workflows/test.yml/badge.svg)](https://github.com/elias-jhsph/scienceai/actions/workflows/test.yml)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)\n\n**An AI-Powered Research Assistant for Systematic Literature Analysis**\n\nScienceAI is a Python application that transforms how researchers analyze scientific literature. Unlike a standard LLM chatbot, ScienceAI is specifically designed to handle complex, multi-paper research tasks through an intelligent agent-based architecture that supports both GPT-5.2, Claude, and Gemini models.\n\n---\n\n## 🎯 Why ScienceAI vs. a Regular LLM Chatbot?\n\n| Standard LLM Chatbot | ScienceAI |\n|---------------------|-----------|\n| Single conversation context | **Multi-agent system** with specialized analyst agents |\n| Manual upload of each document excerpt | **Automatic processing** of hundreds of PDFs |\n| Limited by context window (~200K tokens) | **Processes entire paper collections** regardless of size |\n| Requires you to extract data manually | **Automated data extraction** with structured schemas |\n| One-off responses | **Persistent analysis** with downloadable results |\n| No systematic validation | **Built-in validation** and provenance tracking |\n| Generic responses | **Evidence-based answers** with source citations |\n\n### The Key Difference: Agentic Architecture\n\nScienceAI employs a **Principal Investigator (PI)** that:\n- Breaks down your research question into manageable sub-tasks\n- Creates specialized **Analyst Agents** for each sub-task\n- Coordinates parallel data extraction across your entire paper collection\n- Synthesizes findings from multiple analysts\n- Provides comprehensive, evidence-backed answers\n\nThis means you can ask: *\"Extract healing times, sample sizes, and intervention types from all these papers\"* and ScienceAI will automatically create the right analysts, define extraction schemas, process all papers, and return structured CSV data—something impossible with a standard chatbot.\n\n---\n\n## 🚀 Main Features\n\n- **📚 Automated Paper Processing**: Upload PDFs and let ScienceAI extract text, figures, tables, and metadata automatically\n- **🤖 AI-Driven Multi-Agent Analysis**: The PI delegates tasks to specialized Analyst Agents that work autonomously\n- **📊 Structured Data Extraction**: Define data schemas and extract information systematically across all papers\n- **💬 Interactive Research Discussion**: Ask complex research questions and receive evidence-backed answers\n- **🔍 Provenance Tracking**: Every extracted data point includes source quotes and derivation explanations\n- **📈 Export \u0026 Visualization**: Download extracted data as CSV, export papers with metadata, view analysis results in an interactive interface\n- **🌙 Dark Mode**: Fully supported dark mode for comfortably working in low-light environments, including specialized styling for data viewers.\n- **💾 Project Management**: Save and resume research projects with full checkpoint support\n\n---\n\n## 📦 Installation\n\n**Requirements**: Python 3.11+ and an OpenAI API key\n\n```bash\npip install scienceai-llm\n```\n\n---\n\n## 🎬 Getting Started\n\n### 1. Launch ScienceAI\n\n```bash\nscienceai\n```\n\nThis starts a local web server. Open your browser to:\n```\nhttp://localhost:4242\n```\n\nYou will be prompted to enter your OpenAI API key. This key is used to authenticate requests to the OpenAI API. You can find your API key in your OpenAI account settings.\n\nEnter your project name and click **\"Start\"** to create a new project or load an existing one.\n\n\u003e **Tip**: You can switch between **OpenAI**, **Anthropic (Claude)**, and **Google (Gemini)** models using the \"LLM Provider\" card in the main menu once started. See [Configuration](#-configuration) for setup details.\n\n![Papers Panel - Your Literature Library](images/main_menu.png)\n\n### 2. Understanding the Interface\n\n![Papers Panel - Your Literature Library](images/papers_panel.png)\n\n**Papers Panel (Left Side)**: This is your literature library showing all uploaded PDFs with:\n- **Search Bar** at the top to filter papers by title, author, or keywords\n- **Automatically Detected Metadata**: Author, Date, Title, Journal\n- **Paper IDs**: Each paper gets a unique identifier\n- **Analyst Tracking**: Shows which analysts have processed each paper\n- **Add Papers Button**: Upload additional PDFs to your project at any time\n\nYou can upload PDFs individually or as a zip folder during project creation, or add more later via the \"Add Papers\" button.\n\n---\n\n### 3. Chatting with the Principal Investigator\n\n![Science Discussion - Your Research Conversation](images/chat_interface.png)\n\n**Science Discussion Panel (Right Side)**: This is where you interact with the Principal Investigator (PI). The PI:\n- Understands complex research questions\n- Plans multi-step analysis strategies\n- Creates and manages Analyst Agents to accomplish your goals\n- Presents synthesized findings with evidence\n\n**Key Features:**\n- **Message Status**: Messages show \"Processed\" (waiting for your input) or \"Pending\" (PI is working)\n- **\"Show work...\" Links**: Click to see detailed tool calls and PI reasoning (see below)\n- **Timestamps**: Track when each interaction occurred\n- **Brain Indicator 🧠**: A floating emoji shows real-time context (memory) usage. It turns yellow ⚠️ or red 🔴 as the model's memory fills up.\n\n#### 🔍 Transparency: \"Show work...\" Feature\n\n![Show Work Collapsed](images/show_work_collapsed.png)\n\nMessages from the PI include a **\"Show work...\"** link. This transparency feature lets you see exactly what the PI is doing behind the scenes.\n\n![Show Work Expanded](images/show_work_expanded.png)\n\n**Click \"Show work...\"** to reveal:\n- **Tool Calls**: Every function the PI called (e.g., `read_paper_chunks`, `create_analyst`, `search_database`)\n- **Arguments**: The exact parameters passed to each tool\n- **Outputs**: Results returned from each operation\n- **Reasoning**: The PI's step-by-step decision-making process\n\nThis is invaluable for:\n- **Understanding** how ScienceAI processes your requests\n- **Debugging** unexpected results\n- **Learning** how to phrase better questions\n- **Trust** through complete transparency\n\nClick **\"Hide work...\"** to collapse the details again.\n\n**Example Questions to Ask:**\n- *\"Extract sample sizes, intervention types, and outcomes from all studies\"*\n- *\"Which papers found significant effects for [specific intervention]?\"*\n- *\"Create a summary table comparing study methodologies\"*\n- *\"What are the outcome measures used across these papers?\"*\n\n\n\n#### 🔄 Resetting the Conversation\n\nIf you wish to start fresh while keeping your uploaded papers, use the **Reset Conversation** button (or the undo arrow icon in the chat interface). This will:\n- Clear the chat history\n- Reset the Principal Investigator's memory\n- Fix any potential database locks\n- **Keep** your uploaded papers and extracted data collections\n\n---\n\n### 4. Working with Analyst Agents\n\n![Analysis Panel - Your Data Extraction Agents](images/analysts_panel.png)\n\n**Analysis Panel (Bottom Section)**: When you request data extraction or specific analyses, the PI creates specialized **Analyst Agents**. This panel shows:\n\n- **Analyst Categories**: Different types of analysts (e.g., \"Study Categorization \u0026 Eligibility Analyst\", \"Nonunion and Union Status Analyst\")\n- **Data Collections**: Each analyst creates structured data collections with names like \"NonunionSmokingData2\"\n- **Load Button**: Click to view the extracted data in a table format\n- **Download Button**: Export data as CSV for analysis in Excel, R, or Python\n\nEach analyst autonomously:\n1. Defines an extraction schema based on your request\n2. Processes all relevant papers\n3. Validates extracted data for accuracy\n4. Provides results with source citations\n\n---\n\n### 5. Viewing Extracted Data\n\n**Data Tables**: Click \"Load\" on any data evidence_files to see the extracted data in a structured table format. Each row represents data from a paper, with columns showing:\n\n- **Standard Fields**: Data you requested (e.g., smoking status, healing time, sample size)\n- **Provenance Metadata**: Automatically added by ScienceAI\n  - `_source_quote`: The exact text from the paper supporting this data\n  - `_derivation`: Explanation of how calculated/inferred values were determined\n  - `_source_location`: Where in the paper this data was found\n\n**Key Features:**\n- **Sortable Columns**: Click headers to sort\n- **Download CSV**: Click the download button to export for further analysis\n- **Source Verification**: Every data point links back to the original paper text\n\n#### 👁️ Viewing Raw Data: JSON and CSV Viewers\n\nIn the **Analysis Panel**, each data collection offers multiple view formats:\n\n![JSON Viewer with Syntax Highlighting](images/json_viewer.png)\n\n**JSON Data Eye Icon (👁️)**: Click the eye icon next to \"JSON Data\" to open an interactive JSON viewer featuring:\n- **Syntax Highlighting**: Easy-to-read colored formatting\n- **Collapsible Sections**: Expand/collapse nested objects and arrays\n- **Copy Button**: Copy the entire JSON to clipboard\n- **Raw Format**: See the exact data structure as stored\n\n![CSV Viewer with Data Grid](images/csv_viewer.png)\n\n**CSV Data Eye Icon (👁️)**: Click the eye icon next to \"CSV Data\" to open a spreadsheet-style viewer with:\n- **Grid Layout**: See your data in familiar rows and columns\n- **Quick Preview**: View data without downloading\n- **Inspect Format**: Check CSV structure before exporting\n\nThese viewers help you:\n- **Verify** data quality before export\n- **Debug** extraction issues by inspecting raw values\n- **Choose** the best format (JSON vs CSV) for your workflow\n- **Inspect** data structure and field types\n\nClick the **Close** button or press **Esc** to dismiss the viewer.\n\n---\n\n### 6. Exporting Your Work\n\n![Export Menu - Download Papers and Data](images/export_menu.png)\n\n**Export Button (📦)**: Located in the bottom control panel, this opens the Export Papers menu where you can:\n\n**Select Papers to Export:**\n- **All**: Export every paper in your project\n- **User Defined Tag**: Filter by custom tags you've applied\n\n**Customize Filenames** with detected metadata:\n- Choose which fields to include: DOI, Date, First Author, Title, Journal, Tags\n- Set the order of fields in the filename\n- Choose separator (underscore, dash, space)\n- Preview: `2023_Smith_ImplantFailureRates_JBJS.pdf`\n\n**Bottom Control Panel Buttons:**\n- **💾 Checkpoints**: Download auto-generated checkpoint saves that allow you to resume your project at the last saved state or share it with others\n- **📦 Export**: Export papers with custom filenames\n- **📊 Extracted Data**: Combines ALL extracted data into a single CSV file that you can use for analysis and verification of extracted data quality (column names may be very long, so you may want to rename them)\n- **❌ Close**: Return to project selection screen\n\n---\n\n## 💡 Example Use Cases\n\n### 1. **Systematic Literature Reviews**\nUpload 100+ papers, ask the PI to categorize them by intervention type, extract study characteristics, and generate summary tables—all automatically.\n\n### 2. **Meta-Analysis Data Extraction**\nRequest extraction of effect sizes, sample sizes, and study parameters. ScienceAI handles the schema definition, extraction, validation, and CSV export.\n\n### 3. **Research Gap Analysis**\nAsk \"What methodologies are under-represented?\" and let analysts scan all papers to identify patterns and gaps.\n\n### 4. **Evidence Synthesis**\n\"Summarize all findings related to [X]\" triggers analysts to extract relevant sections, synthesize findings, and provide citations.\n\n---\n\n## 🐍 Python Library Usage\n\nScienceAI can also be used as a Python library to integrate its capabilities into your own scripts and applications.\n\n### Initialization\n\n```python\nfrom scienceai.client import ScienceAI\n\n# Initialize the client (starts backend automatically)\nclient = ScienceAI(project_name=\"MyResearchProject\")\n```\n\n### Ingesting Papers\nYou can upload papers programmatically and trigger preprocessing.\n\n```python\n# Upload papers and wait for preprocessing to complete\nclient.upload_papers([\"/path/to/paper1.pdf\", \"/path/to/paper2.pdf\"])\n\n# Or upload without immediate preprocessing\nclient.upload_papers([\"/path/to/paper3.pdf\"], trigger_preprocess=False)\n\n# Manually trigger preprocessing later\nclient.preprocess()\n```\n\n### Chatting with the PI\n\nInteract with the Principal Investigator to ask questions or request analyses.\n\n```python\n# Send a message and wait for the response (blocking)\nresponse = client.chat(\"Summarize the findings of the uploaded papers.\")\nprint(response)\n\n# Non-blocking chat\nclient.chat_background(\"Extract sample sizes from all papers.\")\n\n# Poll for status\nwhile True:\n    result = client.poll()\n    if result:\n        print(\"Response received:\", result)\n        break\n    print(\"Working...\")\n    time.sleep(1)\n\n# Get full history\nhistory = client.history()\n```\n\n---\n\n## 🏗️ How It Works: Architecture Overview\n\n### The Principal Investigator (PI)\nYour main interface—a conversational AI that:\n- Understands research objectives\n- Plans analysis strategies\n- Creates and manages Analyst Agents\n- Synthesizes multi-agent findings\n- Communicates results clearly\n\n### Analyst Agents\nSpecialized workers created on-demand:\n- Each has a focused research goal\n- Autonomously defines data schemas\n- Extracts, validates, and exports data\n- Provides evidence-backed conclusions\n\n### Data Extraction Engine\n- **Flexible Schemas**: Support for numbers, dates, text blocks, categorical data, and more\n- **Derivation Support**: Extract calculated or inferred values with explanations\n- **Automatic Provenance**: Every data point links to source location and quotes\n- **Validation**: Built-in error checking and re-extraction on failure\n\n### Database \u0026 Storage\n- Persistent project storage\n- Efficient paper and metadata management\n- Data collection tracking\n- Checkpoint and export functionality\n\n---\n\n## 🔧 Configuration\n\n### LLM Provider Selection\n\nScienceAI supports multiple LLM providers with flexible authentication options:\n\n#### Supported Providers\n- **OpenAI** (GPT-4, GPT-5, o4-mini): Default provider\n- **Anthropic** (Claude Sonnet/Opus 4.5): Via direct API or Google Vertex AI\n- **Google** (Gemini 3 Pro): Via API key or Vertex AI service account\n\n#### Setting Up Providers\n\n**OpenAI (Required for  Default Setup)**\n```bash\n# Method 1: Interactive setup\nscienceai --setup-keys\n\n# Method 2: Direct key setting\nscienceai --set-key openai YOUR_OPENAI_API_KEY\n\n# Method 3: Environment variable\nexport OPENAI_API_KEY=\"sk-...\"\n```\n\n**Anthropic Claude (Optional)**\n```bash\n# Direct API (recommended for personal use)\nscienceai --set-key anthropic YOUR_ANTHROPIC_API_KEY\n\n# Or via environment variable\nexport ANTHROPIC_API_KEY=\"sk-ant-...\"\n```\n\n**Google Gemini (Optional)**\n```bash\n# Standard API key (simple setup)\nscienceai --set-key google YOUR_GOOGLE_API_KEY\n\n# Or via environment variable\nexport GOOGLE_API_KEY=\"...\"\n# or\nexport GEMINI_API_KEY=\"...\"\n```\n\n#### GCP Service Account for Production/Enterprise\n\nFor production deployments or enterprise use, you can use a GCP service account for both Gemini and Claude on Vertex AI:\n\n**Setup:**\n```bash\nscienceai --gcp-service-account /path/to/service-account.json\n```\n\nThis will:\n1. Validate your service account file\n2. Extract the project ID automatically\n3. **Prompt you interactively**:\n   ```\n   ✓ Valid service account file for project: my-project-123\n     This service account can be used for:\n       1. Google Gemini (native GCP models)\n       2. Claude on Vertex AI (Anthropic partner models)\n\n   Use this service account for Claude on Vertex AI? (y/n):\n   ```\n4. Ask for your preferred Vertex AI region:\n   ```\n   Common Vertex AI regions:\n     - us-east5 (US East)\n     - us-central1 (US Central)\n     - europe-west1 (Europe West)\n   Enter Vertex AI region (default: us-east5):\n   ```\n5. Save the configuration\n\n**Remove GCP Configuration:**\n```bash\nscienceai --remove-gcp-config\n```\n\nThis command allows you to selectively remove Gemini and/or Claude Vertex configurations, reverting to API key authentication.\n\n**Priority Order:**\n- If both GCP service account AND API key are configured for a provider:\n  1. **GCP Service Account** takes priority (recommended for production)\n  2. **API Key** is used as fallback\n\nThis design allows smooth transitions between development (API key) and production (service account) environments.\n\n### Provider Switching\n\nSwitch between providers via the **LLM Provider** card in the menu UI. Select:\n- **OpenAI** (GPT models)\n- **Claude** (Anthropic direct API)\n- **Claude on Vertex** (via GCP - if configured)\n- **Gemini** (Google models)\n\nUnavailable providers (missing API keys) are grayed out.\n\n### Validate Your Configuration\n\nTest all configured API keys:\n```bash\nscienceai --validate-keys\n```\n\nOutput:\n```\nValidating configured API keys...\n\n  ✓ openai: Valid (gpt-5.2 accessible)\n  ✓ anthropic: Valid (claude-sonnet-4-5 accessible)\n  ✗ google: Invalid (API key expired)\n\n⚠ Some keys failed validation\n```\n\n### CLI Options Reference\n\n```bash\n# API Key Management\nscienceai --setup-keys                    # Interactive key setup\nscienceai --set-key PROVIDER KEY         # Set a specific key\nscienceai --validate-keys                # Validate all keys\n\n# GCP Service Account\nscienceai --gcp-service-account PATH     # Configure service account\nscienceai --remove-gcp-config            # Remove service account config\n\n# Provider Selection\nscienceai --provider anthropic           # Start with specific provider\n\n# Server Options\nscienceai --port 8080                    # Custom port (default: 4242)\nscienceai --skip-validation              # Skip startup key validation\n\n# Logging\nscienceai -v                             # Verbose (INFO level)\nscienceai --debug                        # Debug logging\nscienceai --log-level WARNING            # Specific log level\n```\n\n### Configuration Files\n\nAPI keys and GCP configuration are stored in:\n```\n~/Documents/ScienceAI/scienceai-keys.json\n```\n\nExample structure:\n```json\n{\n  \"openai\": \"sk-...\",\n  \"anthropic\": \"sk-ant-...\",\n  \"google\": \"AIza...\",\n  \"google_gcp\": {\n    \"service_account_path\": \"/path/to/sa.json\",\n    \"project_id\": \"my-project-123\",\n    \"region\": \"us-east5\"\n  },\n  \"anthropic_vertex\": {\n    \"service_account_path\": \"/path/to/sa.json\",\n    \"project_id\": \"my-project-123\",\n    \"region\": \"us-east5\"\n  }\n}\n```\n\n---\n\n\n## 📚 Detailed Documentation\n\n### 🧠 Principal Investigator (PI)\nThe **Principal Investigator** (`src/scienceai/principal_investigator.py`) is the central orchestrator of the system. It uses an LLM-driven reasoning loop to:\n\n1.  **Plan Research**: Decomposes user queries into sub-tasks.\n2.  **Delegate**: Spawns **Analyst Agents** using `delegate_research()` to handle specific data extraction or analysis tasks.\n3.  **Execute Code**: Uses `run_python_code()` to perform statistical analysis, generate plots, or manipulate data using Python (pandas, matplotlib, etc.).\n4.  **Synthesize**: Aggregates results from multiple analysts using `reflect_on_delegations()` to provide a cohesive answer.\n5.  **Transparency**: All PI actions are recorded and visible via the \"Show work...\" feature in the UI, exposing tool calls, arguments, and internal reasoning.\n\n### 🕵️ Analyst Agents\n**Analyst Agents** (`src/scienceai/analyst.py`) are specialized, autonomous workers created by the PI. Each analyst has a specific `goal` (e.g., \"Extract patient demographics\") and follows this workflow:\n\n1.  **Paper Selection**: Identifies relevant papers using `get_all_papers()` or filters by criteria.\n2.  **Schema Generation**: Automatically generates a JSON schema for data extraction based on its goal.\n3.  **Concurrent Extraction**: Runs `extract_data()` across all selected papers in parallel.\n4.  **Validation**: Uses `reflect_on_evidence()` to verify that extracted data is supported by the source text.\n5.  **Data Collection**: Saves structured results into a named collection (e.g., `DemographicsData`) which becomes available to the PI and the user.\n\n### ⛏️ Data Extraction Engine\nThe **Data Extraction Engine** (`src/scienceai/data_extractor.py`) is the core NLP component responsible for turning unstructured PDF text into structured data.\n\n-   **Supported Types**: `number`, `date`, `text_block`, `categorical`, `boolean`, `array`, `object`.\n-   **Provenance Injection**: Automatically adds metadata to every extracted field:\n    -   `_source_quote`: The verbatim text from the paper supporting the data.\n    -   `_source_location`: Page number and context.\n    -   `_derivation`: Logic used to calculate values (e.g., \"Calculated as 15 males + 12 females\").\n-   **Reflection \u0026 Validation**: The `reflect_on_data_extraction()` function acts as a critic, comparing the extracted JSON against the paper's text to catch hallucinations or errors before saving.\n\n### 💾 Database \u0026 Storage\nManaged by `DatabaseManager` (`src/scienceai/database_manager.py`), the system uses a file-based storage approach for portability and simplicity.\n\n-   **Paper Ingestion**: PDFs are hashed (`sha256`) to prevent duplicates. Text, tables, and figures are extracted and stored.\n-   **Storage Format**: Uses `dictdatabase` to store project state, chat history, and data collections as JSON files.\n-   **Checkpoints**: The system supports full project checkpointing. The `save_database()` function creates a zip archive of the project directory, allowing users to backup, share, or resume their work at any time.\n-   **Export**: Data can be exported as CSVs, and papers can be renamed/exported based on their metadata.\n\n---\n\n## 🤝 Contributing\n\nWe welcome contributions! Here's how:\n\n- **Report Bugs**: Open an issue on GitHub with reproduction steps\n- **Feature Requests**: Suggest new capabilities or improvements\n- **Pull Requests**: Fork, develop, and submit PRs for review\n\n---\n\n## 📄 License\n\nSee LICENSE file for details.\n\n---\n\n## 🆘 Troubleshooting\n\n**Papers not processing?**\nCheck that PDFs are valid and not password-protected.\n\n**API errors?**\nVerify your API key or Service Account is valid and has available credits.\n\n**Analyst not completing?**\nCheck the chat panel for error messages—the PI will explain any issues.\n\n**Cannot download data?**\nEnsure analysts have completed their data collections before exporting.\n\n**\"Context Limit Reached\" Warning?**\nThis means the conversation has exceeded the LLM's memory. ScienceAI will automatically compress older messages to free up space. You can also use the **Reset Conversation** feature to clear the history while keeping your uploaded papers.\n\n---\n\n**Ready to transform your literature review workflow? Install ScienceAI and start asking research questions!**\n\n```bash\npip install scienceai-llm\nscienceai\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felias-jhsph%2Fscienceai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felias-jhsph%2Fscienceai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felias-jhsph%2Fscienceai/lists"}