{"id":30110266,"url":"https://github.com/sodascience/biodiversityasset","last_synced_at":"2026-06-09T16:31:46.082Z","repository":{"id":304015009,"uuid":"1017489414","full_name":"sodascience/BiodiversityASSET","owner":"sodascience","description":"LLM-powered analysis of biodiversity-related investment activities in financial reports","archived":false,"fork":false,"pushed_at":"2025-11-27T14:53:02.000Z","size":158,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-30T07:46:19.557Z","etag":null,"topics":["biodiversity","economics","llm","nlp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sodascience.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-10T15:58:18.000Z","updated_at":"2025-11-27T14:53:05.000Z","dependencies_parsed_at":"2025-07-10T22:56:12.179Z","dependency_job_id":"f6870356-4ade-457f-973d-d110cc9b6823","html_url":"https://github.com/sodascience/BiodiversityASSET","commit_stats":null,"previous_names":["sodascience/biodiversityasset"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sodascience/BiodiversityASSET","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2FBiodiversityASSET","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2FBiodiversityASSET/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2FBiodiversityASSET/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2FBiodiversityASSET/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sodascience","download_url":"https://codeload.github.com/sodascience/BiodiversityASSET/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2FBiodiversityASSET/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34116457,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biodiversity","economics","llm","nlp"],"created_at":"2025-08-10T04:23:01.304Z","updated_at":"2026-06-09T16:31:46.076Z","avatar_url":"https://github.com/sodascience.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BiodiversityASSET\n\n\u003e **LLM-powered analysis of biodiversity-related investment activities in financial reports**\n\n[![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)](https://python.org)\n[![OpenAI](https://img.shields.io/badge/OpenAI-Batch%20API-green.svg)](https://platform.openai.com/docs/guides/batch)\n[![uv](https://img.shields.io/badge/uv-package%20manager-purple.svg)](https://github.com/astral-sh/uv)\n\nBiodiversityASSET is a comprehensive pipeline for extracting, classifying, and analyzing biodiversity-related content from investor reports. The system uses LLMs to evaluate paragraphs across three key dimensions:\n\n1. **🌿 Biodiversity relevance** - Identifies content related to biodiversity and environmental impact\n2. **💰 Investment activity** - Classifies paragraphs containing concrete investment activities  \n3. **📊 Assetization characteristics** - Scores content on intrinsic value, cash flow, and ownership/control\n\n## Table of Contents\n\n- [Key Features](#key-features)\n- [Quick Start](#quick-start)\n- [Processing Pipeline](#processing-pipeline)\n- [Batch Job Management](#batch-job-management)\n- [Prompt Customization](#prompt-customization)\n- [Project Structure](#project-structure)\n- [Output Organization](#output-organization)\n- [Documentation](#documentation)\n\n## Key Features\n\n✨ **Modular Architecture**\n- Submit batch jobs and monitor progress independently\n- Resume workflows from any step using batch IDs\n- Cancel running jobs with safety confirmations\n\n🤖 **LLM-Powered Processing**\n- OpenAI Batch API integration for cost-effective analysis (~50% cost reduction)\n- External prompt system for easy customization\n- Support for multiple models and configurations\n\n📁 **Organized Output**\n- Results saved in batch-specific subfolders\n- Clean filenames without ID conflicts\n- Individual chunk processing for large datasets\n\n🔧 **Developer-Friendly**\n- Comprehensive CLI tools with intuitive options\n- Detailed progress monitoring and error handling\n- Flexible configuration and custom prompt support\n\n## Processing Pipeline\n\nThe processing pipeline consists of sequential steps, with LLM-powered batch processing for steps 3-4:\n\n| Step | Purpose | Script | Input | Output |\n|------|---------|--------|-------|--------|\n| **1** | Extract paragraphs from PDFs | `extract_pdfs.py` | `data/raw/pdfs/` | `extracted_paragraphs_from_pdfs/` |\n| **2** | Filter biodiversity content | `filter_biodiversity_paragraphs.py` | `extracted_paragraphs_from_pdfs/` | `biodiversity_related_paragraphs/` |\n| **3a** | **Submit** investment classification | `submit_batch_job.py` | `biodiversity_related_paragraphs/` | Returns batch ID |\n| **3b** | **Monitor** batch progress | `check_batch_status.py` | Batch ID | Status updates |\n| **3c** | **Download** investment results | `download_batch_results.py` | Batch ID | `investment_activity_classification/` |\n| **4a** | **Submit** assetization scoring | `submit_batch_job.py` | `investment_activity_classification/` | Returns batch ID |\n| **4b** | **Monitor** batch progress | `check_batch_status.py` | Batch ID | Status updates |\n| **4c** | **Download** assetization results | `download_batch_results.py` | Batch ID | `assetization_features_scoring/` |\n\n\u003e **💡 Key Points:**\n\u003e - Steps 3-4 use OpenAI's Batch API for cost-effective processing\n\u003e - Each batch step can be run independently \n\u003e - **Step 4 requires a completed investment activity classification batch ID**\n\u003e - All results are organized in batch-specific subfolders\n\n## Quick Start\n\n### Prerequisites\n\nEnsure you have [`uv`](https://github.com/astral-sh/uv) installed:\n\n\u003cdetails\u003e\n\u003csummary\u003e📦 Install uv (click to expand)\u003c/summary\u003e\n\n**Windows:**\n```powershell\npowershell -ExecutionPolicy ByPass -c \"irm https://astral.sh/uv/install.ps1 | iex\"\n```\n\n**Linux/MacOS:**\n```bash\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n```\n\u003c/details\u003e\n\n### 🚀 Installation\n\n```bash\n# Clone the repository\ngit clone \u003crepository-url\u003e\ncd BiodiversityASSET\n\n# Install dependencies\nuv sync\n```\n\n### ⚙️ Environment Setup\n\n```bash\n# Set your OpenAI API key\nexport OPENAI_API_KEY=\"your-openai-api-key\"\n\n# Or create a .env file\necho \"OPENAI_API_KEY=your-openai-api-key\" \u003e .env\n```\n\n### 📝 Basic Usage\n\n#### 1. Extract paragraphs from PDFs\n```bash\npython scripts/extract_pdfs.py\n```\n\n#### 2. Filter biodiversity-related content\n```bash\npython scripts/filter_biodiversity_paragraphs.py\n```\n\n#### 3. Classify investment activities\n```bash\n# Submit the batch job\npython scripts/submit_batch_job.py --task investment_activity_classification\n\n# Monitor progress (replace \u003cbatch-id\u003e with actual ID)\npython scripts/check_batch_status.py --batch-id \u003cbatch-id\u003e --wait\n\n# Download results\npython scripts/download_batch_results.py --batch-id \u003cbatch-id\u003e\n```\n\n#### 4. Score assetization features\n```bash\n# Submit dependent job (requires investment batch ID)\npython scripts/submit_batch_job.py --task assetization_features_scoring --batch-id \u003cinvestment_batch_id\u003e\n\n# Monitor and download\npython scripts/check_batch_status.py --batch-id \u003cassetization_batch_id\u003e --wait\npython scripts/download_batch_results.py --batch-id \u003cassetization_batch_id\u003e\n```\n\n## Batch Job Management\n\n### 📊 Monitoring Jobs\n\n```bash\n# List all batch jobs with LAST-CHECKED status and timestamps\npython scripts/check_batch_status.py --list-jobs\n\n# Check CURRENT status of a specific job\npython scripts/check_batch_status.py --batch-id \u003cbatch-id\u003e\n\n# Wait for job completion (polls every 30 seconds)\npython scripts/check_batch_status.py --batch-id \u003cbatch-id\u003e --wait\n\n# Custom polling interval\npython scripts/check_batch_status.py --batch-id \u003cbatch-id\u003e --wait --poll-interval 60\n```\n\n### ❌ Canceling Jobs\n\n```bash\n# Cancel a running batch job (requires confirmation)\npython scripts/check_batch_status.py --batch-id \u003cbatch-id\u003e --cancel\n```\n\n### 📋 Example Job Listing Output\n\n```\n=== Batch Jobs (3 found) ===\nBatch ID                              Task                      Status          Last Checked      Submitted         Paragraphs  \n-------------------------------------------------------------------------------------------------------------------------------\nbatch_686fc36b2da08190903bc237510c52f5 investment_activity_class completed       07-10 16:55       2025-07-10T15:43  120         \nbatch_686fd9e4f814819088b69150a57753d6 assetization_features_sc  submitted       never             2025-07-10T17:19  3           \nbatch_686fdd5143248190aae3f8185f24a415 investment_activity_class in_progress     07-10 14:30       2025-07-10T14:15  274         \n```\n\n## Prompt Customization\n\nBiodiversityASSET uses external text files for prompts, making them easy to customize without code changes:\n\n### 📁 Default Prompt Files\n\n- **`prompts/investment_activity_classification_system_prompt.txt`** - System prompt for investment activity classification\n- **`prompts/assetization_features_scoring_system_prompt.txt`** - System prompt for assetization features scoring  \n- **`prompts/user_prompt_template.txt`** - User prompt template applied to each paragraph\n\n### 🛠️ Using Custom Prompts\n\n```bash\n# Use custom system prompt\npython scripts/submit_batch_job.py --task investment_activity_classification \\\n    --system-prompt prompts/my_custom_system.txt\n\n# Use both custom system and user prompts\npython scripts/submit_batch_job.py --task investment_activity_classification \\\n    --system-prompt prompts/my_custom_system.txt \\\n    --user-prompt prompts/my_custom_user.txt\n\n# Use different model with custom prompts\npython scripts/submit_batch_job.py --task assetization_features_scoring \\\n    --batch-id \u003cinvestment_batch_id\u003e \\\n    --model gpt-4o \\\n    --max-tokens 750 \\\n    --system-prompt prompts/my_custom_system.txt\n```\n\n## Project Structure\n\n```\nBiodiversityASSET/\n├── 📁 data/\n│   ├── 📁 raw/\n│   │   └── 📁 pdfs/                     # 📄 Input: PDF investor reports\n│   ├── 📁 processed/\n│   │   ├── 📁 extracted_paragraphs_from_pdfs/      # Step 1: Extracted paragraphs\n│   │   ├── 📁 biodiversity_related_paragraphs/     # Step 2: Filtered biodiversity content\n│   │   ├── 📁 investment_activity_classification/  # Step 3: Investment classification results\n│   │   │   └── 📁 \u003cbatch_id\u003e/\n│   │   │       ├── 📊 batch_results.jsonl\n│   │   │       ├── 📊 chunk_1.csv\n│   │   │       └── 📊 chunk_2.csv\n│   │   └── 📁 assetization_features_scoring/       # Step 4: Assetization scoring results\n│   │       └── 📁 \u003cbatch_id\u003e/\n│   │           ├── 📊 batch_results.jsonl\n│   │           └── 📊 assetization_features_scored.csv\n│   └── 📁 human_annotations/            # 👥 Manual annotations for evaluation\n├── 📁 prompts/                          # 🤖 LLM prompt templates\n│   ├── 📝 investment_activity_classification_system_prompt.txt\n│   ├── 📝 assetization_features_scoring_system_prompt.txt\n│   └── 📝 user_prompt_template.txt\n├── 📁 results/\n│   ├── 📁 batch_jobs/                   # 📋 Batch job metadata and raw results\n│   │   ├── 📄 \u003cbatch_id\u003e.json\n│   │   ├── 📁 investment_activity_classification_processing/\n│   │   └── 📁 assetization_features_scoring_processing/\n│   └── 📁 evaluation/                   # 📈 Evaluation results (future)\n├── 📁 scripts/                          # 🐍 Python processing scripts\n├── ⚙️ pyproject.toml                    # 📦 Project dependencies\n├── 🔒 uv.lock                           # 🔐 Lock file for dependencies\n├── 📖 README.md                         # 📚 Project documentation\n├── 📖 BATCH_WORKFLOW.md                 # 🔄 Detailed batch processing workflow\n└── 📖 REFACTORING_SUMMARY.md            # 📝 Summary of refactoring changes\n```\n\n## Output Organization\n\nResults are organized in batch-specific subfolders to prevent conflicts and enable easy tracking:\n\n### 💼 Investment Activity Classification\n\n```\ndata/processed/investment_activity_classification/\u003cbatch_id\u003e/\n├── 📊 batch_results.jsonl              # Raw API responses\n├── 📊 chunk_1.csv                      # Processed results for chunk 1\n└── 📊 chunk_2.csv                      # Processed results for chunk 2\n```\n\n**Contains:** Investment activity scores, explanations, and original paragraph metadata\n\n### 📈 Assetization Features Scoring\n\n```\ndata/processed/assetization_features_scoring/\u003cbatch_id\u003e/\n├── 📊 batch_results.jsonl              # Raw API responses\n└── 📊 assetization_features_scored.csv # Scored paragraphs with all dimensions\n```\n\n**Contains:** Intrinsic value, cash flow, and ownership/control scores with detailed reasoning\n\n### 🔑 Key Benefits\n\n- **🔒 Conflict-free:** Each batch job gets its own subfolder\n- **🏷️ Clean naming:** Filenames without batch ID suffixes  \n- **📝 Traceable:** Easy to identify which batch produced which results\n- **🔄 Resumable:** Can re-run or reference specific batch outputs\n\n## Documentation\n\n📖 **[BATCH_WORKFLOW.md](BATCH_WORKFLOW.md)** - Detailed step-by-step workflow guide with examples\n\n📝 **[REFACTORING_SUMMARY.md](REFACTORING_SUMMARY.md)** - Complete summary of system architecture and changes\n\n---\n\n## Contributing\n\nWe welcome contributions! Please see our contribution guidelines for more information.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Citation\n\nIf you use BiodiversityASSET in your research, please cite:\n\n```bibtex\n@software{biodiversityasset,\n  title={BiodiversityASSET: LLM-powered analysis of biodiversity-related investment activities},\n  author={SoDa},\n  year={2025},\n  url={https://github.com/yourusername/BiodiversityASSET}\n}\n```\n\n## Contact\n\nThis project is developed and maintained by the [ODISSEI Social Data Science (SoDa)](https://odissei-soda.nl/) team.\n\nDo you have questions, suggestions, or remarks? File an [issue](https://github.com/sodascience/workshop_llm_data_collection/issues) or feel free to contact [Qixiang Fang](https://github.com/fqixiang) or [Catalina Papari](https://github.com/catalinapapari1).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodascience%2Fbiodiversityasset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsodascience%2Fbiodiversityasset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodascience%2Fbiodiversityasset/lists"}