{"id":15347629,"url":"https://github.com/michaelmior/annotate-schema","last_synced_at":"2025-04-15T04:12:50.318Z","repository":{"id":181087141,"uuid":"666129864","full_name":"michaelmior/annotate-schema","owner":"michaelmior","description":null,"archived":false,"fork":false,"pushed_at":"2025-04-14T17:42:42.000Z","size":9235,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-15T04:12:43.639Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michaelmior.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-13T19:31:29.000Z","updated_at":"2025-04-14T17:42:47.000Z","dependencies_parsed_at":"2023-12-25T20:20:12.125Z","dependency_job_id":"eed779e0-1a51-45da-8723-77a5aee812d6","html_url":"https://github.com/michaelmior/annotate-schema","commit_stats":{"total_commits":184,"total_committers":2,"mean_commits":92.0,"dds":0.08695652173913049,"last_synced_commit":"7286e776b5845041cfb671f251f6f4a98a7accd5"},"previous_names":["michaelmior/annotate-schema"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelmior%2Fannotate-schema","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelmior%2Fannotate-schema/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelmior%2Fannotate-schema/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelmior%2Fannotate-schema/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michaelmior","download_url":"https://codeload.github.com/michaelmior/annotate-schema/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249003956,"owners_count":21196793,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T11:36:47.218Z","updated_at":"2025-04-15T04:12:50.301Z","avatar_url":"https://github.com/michaelmior.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Schema annotation\n[![CI](https://github.com/michaelmior/annotate-schema/actions/workflows/ci.yml/badge.svg)](https://github.com/michaelmior/annotate-schema/actions/workflows/ci.yml)\n[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/michaelmior/annotate-schema/main.svg)](https://results.pre-commit.ci/latest/github/michaelmior/annotate-schema/main)\n\nThis repository contains scripts which attempt to augment a provided JSON Schema using the power of LLMs.\n\n## Description generation\n\nThis works by generating a prompt for each JSON path in the schema and then executing a LLM to generate a description for each attribute.\nFor example, the schema below has a single property `foo`.\n\n    {\n      \"type\": \"object\",\n      \"properties\": {\n        \"foo\": {\n          \"type\": \"string\"\n        }\n      }\n    }\n\nFor this, we generate a prompt like the following:\n\n    {\n      \"type\": \"object\",\n      \"properties\": {\n        \"foo\": {\n          \"type\": \"string\",\n          \"description\": \"\n\nNote that the prompt ends after the description is started.\nGeneration continues until an unescaped closing quote is encountered.\nTo run with your own schema, provide it on standard input.\nThe resulting schema with descriptions is written to standard output.\n\n    pipenv run python annotate_schema.py \u003c input_schema.json\n\n### Schema type\n\nWhile the input to this script is always a JSON Schema, multiple possible formats can be used to derive descriptions during code generation.\nIn addition to `jsonschema`, current options include [`zod`](https://zod.dev/) and [`typescript`](https://www.typescriptlang.org/docs/handbook/2/objects.html).\n\n### Models\n\nMultiple possible models can be used for the generation.\nCurrently most models which support `AutoModelForCausalLM` and `AutoTokenizer` should work.\nThe specific model can be specified with the `-m/--model` flag.\nNote that some models may not currently support GPU inference.\nIf a model produces errors, try running again with the `--cpu` flag.\nA few examples are given below.\n\n- `bigcode/santacoder`\n- `bigcode/starcoder`\n- `facebook/incoder-1B`\n- `facebook/incoder-6B`\n- `replit/replit-code-v1-3b`\n- `replit/replit-code-v1_5-3b`\n- `Salesforce/codegen-350M-mono`\n- `Salesforce/codegen-350M-multi`\n- `Salesforce/codegen-6B-mono`\n- `Salesforce/codegen-6B-multi`\n- `Salesforce/codegen-16B-mono`\n- `Salesforce/codegen-16B-multi`\n- `Salesforce/codegen2-1B`\n- `Salesforce/codegen2-7B`\n- `Salesforce/codegen2-16B`\n- `Salesforce/codegen25-7b-mono`\n- `Salesforce/codegen25-7b-multi`\n- `TheBloke/Codegen25-7B-mono-GPTQ` (with `--model-basename gptq_model-4bit-128g`)\n\n## Naming definitions\n\nThe `name_definitions.py` script attempts to generate meaningful names for definitions in a schema provided on standard input.\nSince this makes use of infill, currently it only works with Facebook's InCoder models.\nFor example, given a schema containing the definition below, the name `defn0` will be replaced with `person`.\n\n```json\n{\n  \"definitions\": {\n    \"defn0\": {\n      \"type\": \"object\",\n      \"properties\": {\n        \"id\": { \"type\": \"string\" },\n        \"name\": { \"type\": \"string\" },\n        \"age\": { \"type\": \"integer\" },\n        \"gender\": { \"type\": \"string\" },\n        \"email\": { \"type\": \"string\" }\n      }\n    }\n  },\n  ...\n}\n```\n\n### Models\n\nAny model which is trained using MLM should work here.\nIn addition, Facebook's InCoder models are supported.\n\n- facebook/incoder-1B\n- facebook/incoder-6B\n- huggingface/CodeBERTa-small-v1\n- microsoft/codebert-base\n- microsoft/codebert-base-mlm\n- neulab/codebert-javascript\n\n## Selecting relevant keywords\n\nWhen discovering a schema from data, it's possible to generate keywords such as `minLength` for all string properties.\nHowever, not all of those properties are necessarily relevant for inclusion into the final schema and may just be overfit to the dataset.\nTo solve this problem, you can train a model on real-world schemas to predict whether a keyword should be included.\n\n```bash\n# Download the schemas from JSON Schema Store\n$ ./download_schemas.sh\n\n# Extract training data from the\n$ pipenv run python extract_keywords.py \u003e extracted.json\n\n# Train the model\npipenv run python embed_training.py extracted.json\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelmior%2Fannotate-schema","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichaelmior%2Fannotate-schema","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelmior%2Fannotate-schema/lists"}