{"id":29240300,"url":"https://github.com/priorlabs/dataset-requirements-guide","last_synced_at":"2025-09-05T21:35:09.483Z","repository":{"id":293703953,"uuid":"984752137","full_name":"PriorLabs/dataset-requirements-guide","owner":"PriorLabs","description":"Dataset requirement guide for the fine tuning program. ","archived":false,"fork":false,"pushed_at":"2025-08-12T16:57:03.000Z","size":32,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-12T18:39:39.799Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PriorLabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-16T12:56:21.000Z","updated_at":"2025-06-11T13:55:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"cd59e010-1746-49d4-9082-5c3ed156e0f4","html_url":"https://github.com/PriorLabs/dataset-requirements-guide","commit_stats":null,"previous_names":["priorlabs/ds_requirement_guide","priorlabs/dataset-requirements-guide"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/PriorLabs/dataset-requirements-guide","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PriorLabs%2Fdataset-requirements-guide","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PriorLabs%2Fdataset-requirements-guide/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PriorLabs%2Fdataset-requirements-guide/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PriorLabs%2Fdataset-requirements-guide/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PriorLabs","download_url":"https://codeload.github.com/PriorLabs/dataset-requirements-guide/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PriorLabs%2Fdataset-requirements-guide/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273826259,"owners_count":25175232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-05T02:00:09.113Z","response_time":402,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-03T19:39:36.343Z","updated_at":"2025-09-05T21:35:09.466Z","avatar_url":"https://github.com/PriorLabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dataset Integration Guide\n\nWelcome! This guide explains the best way to share your dataset with us. Our goal is to quickly understand your data and establish a strong performance baseline. This ensures a smooth and effective integration with our machine learning models, like **TabPFN**.\n\nTo do this, we ask that you provide a simple notebook that prepares your data and runs a baseline model, such as XGBoost. This helps us align on metrics and validate the data format from the start.\n\nTo ensure your datasets work flawlessly with our pipeline, please follow these requirements.\n\n**Note on Requirements:**\nThroughout this guide, certain items are marked as **(Essential)** or **(Optional)**.\n\n- **(Essential)** — These requirements are **critical** for us to process, evaluate, and integrate your dataset effectively.\n- **(Optional)** — These items are not strictly required, but they are **highly encouraged**. Providing them helps us better understand your data, perform more targeted benchmarking, and produce richer insights.\n\n---\n\n## **Option 1: The Baseline Notebook**\n\nThis is the **standard approach** for most projects. Providing a single Jupyter or Colab notebook ensures a **reproducible, low-friction starting point** that we can immediately build upon.\n\n### **Notebook Requirements (Essential)**\n\nYour notebook should cover the following:\n\n1. **Data Preparation and Splitting**\n   - Define `target_column` and `feature_columns` at the start.\n   - Load data, preprocess, and split into training/testing sets.\n   - For **time-series data**, ensure chronological splitting to avoid data leakage.\n\n2. **Baseline Model and Evaluation**\n   - Train a **standard XGBoost** model (or another agreed-upon baseline).\n   - Clearly **specify the performance metrics that matter most for your use case**.\n     The following are common examples — please confirm or adjust based on your project’s priorities:\n       - **Classification:** AUROC, F1-Score, LogLoss, Confusion Matrix\n       - **Regression:** RMSE, MAE, R²\n   - Report results on the **test set** using the agreed metrics.\n\n3. **Key Visualizations (Optional)**\n   - Feature importance chart\n   - ROC curve (classification)\n   - Residuals plot (regression)\n\n---\n\n### **Data Privacy (Essential)**\n\nBefore sharing data, **remove or obfuscate all PII** and sensitive information.\n\n- **Remove Direct Identifiers:** Names, emails, addresses, phone numbers, etc.\n- **Anonymize Quasi-Identifiers:** Features like ZIP code, age, job title that could be combined to re-identify individuals.\n  - Techniques: Group rare categories, add statistical noise to sensitive numerical values.\n\n---\n\n### **Submission**\n\n- Package the final notebook **and** associated data files into a `.zip` archive.\n- Provide a **secure download link**.\n\n---\n\n## **Option 2: Multi Dataset Pipeline Integration (Advanced)**\n\nFor **specialized use cases** — such as large-scale benchmarking across multiple datasets or formal fine-tuning programs — we support **direct integration into our multi-dataset testing and fine-tuning pipeline**.\n\nThis approach leverages the `Multi_Dataset_Integration` module included in the repository, which contains the core structure and example implementations to help you get started quickly.\n\n---\n\n### **Contributor Responsibilities**\n\n- **Implement Data Loader:** Create a `DataModule` class that loads, processes, and serves datasets according to our pipeline specs.\n- **Testing:** Validate implementation with provided scripts (`minimal_example.py`).\n- **Automated Checks:** Ensure all tests pass using `pytest`.\n\n---\n\n### **Detailed Requirements for Pipeline Integration**\n\n#### **Data Structure (Essential)**\n\n- **Format:** Pairs of NumPy arrays:\n  - `X` → Features (`n_samples`, `n_features`)\n  - `y` → Target (`n_samples`,)\n- **Data Types:** Numeric (`float`, `int`)\n- **Organization:** Separate `(X, y)` pairs for training, testing, and optionally validation.\n\n#### **Data Quality (Essential)**\n\n- Fully preprocessed (cleaned, imputed, encoded) — ready for training.\n- **No leakage** between training, validation, and test sets.\n- **Shape constraints:**\n  - Test/validation ≤ 10,000 samples × 500 features\n  - Training ≤ 5,000 samples × 500 features\n  - Notify us if exceeding these limits.\n\n#### **Evaluation (Essential)**\n\n- **Primary Metric:** Define (e.g., AUROC, F1-Score).\n- **Aggregation Strategy:** Specify (e.g., mean AUROC, weighted average).\n- **Performance Target:** State improvement goal (e.g., \"+5% AUROC over baseline\").\n\n---\n\n## **Metadata Submission**\n\nInclude a **YAML** metadata file with your dataset.\nThis consolidates **critical information** about the dataset and evaluation protocol.\n\n**Example:**\n```yaml\n# General Information\ndataset_name: \"Credit Risk Analysis\"\ndescription: \"Predicting loan default risk based on historical borrower information.\"\ntime_series: false\n\n# Evaluation Protocol\nprimary_metric: \"AUROC\"\naggregation_strategy: \"mean\" # Aggregate test set performance by taking the mean AUROC.\ntarget_performance: \"+3 pp AUROC over baseline\"\nkey_datasets: # Optional: list high-priority datasets for evaluation.\n  - \"credit_risk_v1.npy\"\n  - \"mortgage_loans_q3.npy\"\n\n# Data Specification\ndata_format: \"numpy\"\npreprocessing_steps: \"StandardScaler for numerical features, OneHotEncoder for categorical.\"\nfeature_names: # Optional: only if features are consistent across datasets.\n  - \"age\"\n  - \"annual_income\"\n  - \"credit_score\"\n  - \"employment_length_years\"","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpriorlabs%2Fdataset-requirements-guide","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpriorlabs%2Fdataset-requirements-guide","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpriorlabs%2Fdataset-requirements-guide/lists"}