{"id":29704320,"url":"https://github.com/eth-sri/type-constrained-code-generation","last_synced_at":"2025-07-30T21:19:05.581Z","repository":{"id":289527980,"uuid":"944698323","full_name":"eth-sri/type-constrained-code-generation","owner":"eth-sri","description":"Reproduction Package for the paper \"Type-Constrained Code Generation with Language Models\" [PLDI 2025]","archived":false,"fork":false,"pushed_at":"2025-06-11T20:06:44.000Z","size":9365,"stargazers_count":66,"open_issues_count":0,"forks_count":2,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-07-21T22:10:46.760Z","etag":null,"topics":["code-synthesis","constrained-decoding","llm","type-systems"],"latest_commit_sha":null,"homepage":"https://pldi25.sigplan.org/details/pldi-2025-papers/25/Type-Constrained-Code-Generation-with-Language-Models","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eth-sri.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-07T20:07:10.000Z","updated_at":"2025-07-20T23:54:15.000Z","dependencies_parsed_at":"2025-04-23T18:54:55.338Z","dependency_job_id":"05f21828-092d-47b0-bedd-1a155ba90cb9","html_url":"https://github.com/eth-sri/type-constrained-code-generation","commit_stats":null,"previous_names":["eth-sri/type-constrained-code-generation"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/eth-sri/type-constrained-code-generation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-sri%2Ftype-constrained-code-generation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-sri%2Ftype-constrained-code-generation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-sri%2Ftype-constrained-code-generation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-sri%2Ftype-constrained-code-generation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eth-sri","download_url":"https://codeload.github.com/eth-sri/type-constrained-code-generation/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-sri%2Ftype-constrained-code-generation/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266691580,"owners_count":23969182,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-23T02:00:09.312Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-synthesis","constrained-decoding","llm","type-systems"],"created_at":"2025-07-23T14:10:05.321Z","updated_at":"2025-07-23T14:10:06.963Z","avatar_url":"https://github.com/eth-sri.png","language":"Python","readme":"Type-Constrained Code Generation with Language Models\n=====================================================\n[![arXiv](https://img.shields.io/badge/arXiv-2504.09246-b31b1b.svg)](https://arxiv.org/abs/2504.09246)\n[![QA \u0026 Tests](https://github.com/eth-sri/type-constrained-code-generation/actions/workflows/tests.yml/badge.svg)](https://github.com/eth-sri/type-constrained-code-generation/actions/workflows/tests.yml)\n\n\nThis is an implementation of a completion engine that parses type safe programs incrementally, guaranteeing that intermediate outputs can be completed to type-safe programs.\nThe completion enginge can be used to constrain the sampling from an LLM model to only type-safe programs.\nThe implementation currently only handles TypeScript.\n\nMore details on the properties of the completion engine and supported features can be found in the paper [Type-Constrained Code Generation with Language Models](https://arxiv.org/abs/2504.09246).\n\n### Overview\nWhen set-up correctly, the package can be used to sample type-safe TypeScript programs from a language model.\nThe following will incrementally generate the code for a TypeScript merge sort function, while ensuring that the generated code is type-safe:\n\n```python\nfrom typesafe_llm.sampling import sample_constrained\n\nsample_constrained(\n    prompt=\"function merge_sort(x:number[]):number[] {\",\n    max_tokens=100,\n    device=\"cuda\",\n    model_name = \"google/gemma-2-2b-it\",\n    temperature=0,\n    do_sample=False,\n    trace=True,\n)\nprint(\"Generation completed\")\n```\n\nThe project contains two main parts:\n- The sampling algorithm, which is used to sample type-safe TypeScript programs from a language model.\n- The parser, which is used to parse TypeScript programs and check their completability to type-safe programs.\n\n### Setup\n\nTo install the package, we recommend setting up a conda environment using NVIDIA GPUs.\n\n```bash\ngit clone https://github.com/eth-sri/type-constrained-code-generation.git\ncd type-constrained-code-generation  \nconda create -n typesafe_llm python=3.11\nconda activate typesafe_llm\n\n# for LLM inference\n# set up torch\nconda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -y\n# install flash-attention\npip install flash-attn==2.7.3 --no-build-isolation\n\n# install package\npip install -e .\n```\n\nIf you only want to use the parser and do not want to sample from a language model, you can skip the installation of `torch` and `flash-attention`.\n\n### Programmatic Usage\n\n#### LLM Sampling\n\nTo sample type-safe TypeScript programs from a language model, you can use the `sample_constrained` function from the `typesafe_llm.sampling` module.\n\n```python\nfrom typesafe_llm.sampling import sample_constrained\n\nsample = sample_constrained(\n    prompt=\"function merge_sort(x:number[]):number[] {\",\n    max_tokens=100,\n    device=\"cuda\",\n    model_name = \"google/gemma-2-2b-it\",\n    temperature=0.1,\n    do_sample=True,\n)\nprint(sample)\n```\n\nIf GPU is available, set device to \"cuda\", on MacBook Pro set device to \"mps\" (when pytorch nightly is installed).\nSetting the device to \"cpu\" always works.\n`trace` controls a debugging output for live debugging of the generation process.\nSet to False for programmatic use.\n\n#### Incremental TypeScript parsing\n\nYou can also independently use the parser to parse TypeScript programs and check their completability.\n\n```python\nfrom typesafe_llm.parser.parser_ts import parse_ts_program\n\nstates = parse_ts_program(\"let x:number = 1;x;\")\nprint(list(states))\n# only one accepting state\n\nstates = parse_ts_program('let x:number = \"he')\nprint(list(states))\n# some accepting states, could be saved by y\".length\n\nstates = parse_ts_program('let x:boolean = 1 \u003c \"hey\" +')\nprint(list(states))\n# no states, can not turn \"hey\" + ... into a number, but need number for \u003c operator\n\nstates = parse_ts_program('let x:number = 1;let y')\nprint(list(states))\n# two partial states, one where the second variable has name \"y\" and one where it is not completed yet\n```\n\n### Tests\n\nTo run the tests, you can use the following command:\n\n```bash\npytest test\n```\n\n## Reproducing experiments\n\nIn this section we provide an overview on how to reproduce the experiments presented in our [paper](https://arxiv.org/abs/2504.09246).\n\n### Requirements\n\nTo reproduce our experiments locally, it is required to have higher-end GPUs, e.g. NVIDIA A100 80GB. The package includes setup scripts for all software requirements using miniconda. Required Hardware / Software:\n\n- x86/64 architecture CPUs\n- 80GB GPU VRAM\n- CUDA 12.4 or newer\n\nFurther the Gemma 2 model family requires accepting an EULA. Please create a huggingface account and visit the model websites to accept the EULA.\n- https://huggingface.co/google/gemma-2b-it\n- https://huggingface.co/google/gemma-9b-it\n- https://huggingface.co/google/gemma-27b-it\n\nYou will later be requested for a Hugginface Access Token. Log in with the account with which you accepted the EULA and visit [the Access Token page](https://huggingface.co/settings/tokens) to generate an access token: https://huggingface.co/settings/tokens\n\n### Setup\n\nFollow the installation instructions to install conda and all dependencies for the experiments:\n\n```bash\nbash ./setup_conda.sh\n# Restart your shell\nbash ./setup_env.sh \n# NOTE: Some models are guarded on huggingface, so you will need to visit their model page, accept the EULA and enter the huggingface Access Token to your account when prompted. See section \"Requirements\" for more details.\n```\n\n\u003e Important note: Before running the experiments, you need to download the models and datasets used for the experiments.\n\nWe provide a script to download the required dataset and models for our experiments. This script must be run before starting the experiments.\nYou may specify models to download by passing the `models` paramater.\n\n```bash\npython3 experiments/main/download_models.py --models google/gemma-2-2b-it,google/gemma-2-9b-it\n```\n\nTo download all required models and datasets, run the following command:\n\n```bash\npython3 experiments/main/download_models.py\n```\n\n\n### Warming up\n\nTo warm up, we start by reproducing the result for synthesis of the smallest model (Gemma 2 2B) and the MBPP dataset. To avoid using busy GPUs in a shared setting, use command `nvidia-smi` to check which GPUs are free. Then specify the IDs of GPUs you want to use by setting the `CUDA_VISIBLE_DEVICES` environment variable.  If you want to use GPU 0 and 1, run the following command:\n\n```bash\nCUDA_VISIBLE_DEVICES=0,1 python3 experiments/main/run_experiments_syn_tran.py --models google/gemma-2-2b-it --tasks synth --subsets mbpp\n```\n\nThis reproduces the results for Gemma-2B on the synthesis task on MBPP.\nThe experiment should finish within approximately 4 hours on a single GPU.\nThe results of the experiment (and all other results) will be stored in `experiments/main/results` in an appropriately named `jsonl` file. The general schema is `experiments/main/results/\u003csubset\u003e_\u003cmodel\u003e_s=\u003cseed\u003e_t=\u003ctemperature\u003e_\u003ctask\u003e_\u003cconstrained\u003e.jsonl`. In this concrete example `experiments/main/results/mbpp_google_gemma-2-2b-it_s=0_t=1_synth_nc.jsonl` and `..._c.jsonl` for the unconstrained and type-constrained variants respectively.\n\n\u003e The experiment runs can be cancelled at any time, intermediate results are stored in the `jsonl` files. Upon restarting, the script will automatically pick up the last completed instance and continue from there. It may happen that running tasks daemonize and continue running (check `nvidia-smi`). Make sure to kill them manually before restarting.\n\nOur experiment script automatically distributes jobs over indicated GPUs.\nThe script then repeatedly queries whether running jobs are completed and new GPUs are available. You will therefore see something like the following ouput:\n```\n+ CUDA_VISIBLE_DEVICES=0 python3 inference_multiple.py --max-tokens 1000 --timeout 300 --model_name google/gemma-2-2b-it --seed 0 --temp 1 --subset mbpp  --try_top_k 10000000000000000 --constrained False --output_file 'results/mbpp_google_gemma-2-2b-it_s=0_t=1_synth_nc.jsonl' \n+ CUDA_VISIBLE_DEVICES=1 python3 inference_multiple.py --max-tokens 1000 --timeout 300 --model_name google/gemma-2-2b-it --seed 0 --temp 1 --subset mbpp  --try_top_k 10000000000000000 --constrained True --output_file 'results/mbpp_google_gemma-2-2b-it_s=0_t=1_synth_c.jsonl' \nTotal jobs: 2, Running jobs: 2, Remaining jobs: 0. Waiting for running jobs to finish...\n```\n\nTo reproduce other tasks, the following commands reproduce the results for the translation task and the repair task on MBPP, and should take around 4 hours each:\n\n```bash\nCUDA_VISIBLE_DEVICES=0,1 python3 experiments/main/run_experiments_syn_tran.py --models google/gemma-2-2b-it --tasks translate --subsets mbpp\nCUDA_VISIBLE_DEVICES=0,1 python3 experiments/main/run_experiments_repair.py --models google/gemma-2-2b-it --subsets mbpp\n```\n\n\n### Running more experiments\n\nThen you can run more experiments for synthesis and translation by providing different models (`--models`), tasks (`--tasks`), and benchmarks (`--subsets`). Remember to use `CUDA_VISIBLE_DEVICES`.\nNote that a single 80 GB GPU provides sufficient VRAM to host any model used in our experiments.\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python3 experiments/main/run_experiments_syn_tran.py --models google/gemma-2-2b-it,google/gemma-2-9b-it --tasks synth --subsets mbpp,humaneval\nCUDA_VISIBLE_DEVICES=0 python3 experiments/main/run_experiments_syn_tran.py --models Qwen/Qwen2.5-32B-Instruct --tasks translate --subsets mbpp\n```\n\nYou can similarly start the repair task:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python3 experiments/main/run_experiments_repair.py --models google/gemma-2-2b-it,google/gemma-2-9b-it --subsets mbpp,humaneval\nCUDA_VISIBLE_DEVICES=0 python3 experiments/main/run_experiments_repair.py --models Qwen/Qwen2.5-32B-Instruct --subsets mbpp\n```\n\nBelow is the list of all options for these parameters. Running all these options will cover all our experiments but can take several days. For the sake of time, reviewers may check a subset that they are interested in.\n\n```bash\nFLAGS\n    --models=MODELS\n        Default: google/gemma-2-2b-it,google/gemma-2-9b-it,google/gemma-2-27b-it,deepseek-ai/deepseek-coder-33b-instruct,codellama/CodeLlama-34b-Instruct-hf,Qwen/Qwen2.5-32B-Instruct\n    --tasks=TASKS (only for experiments/main/run_experiments_syn_tran.py)\n        Default: synth,translate \n    --subsets=SUBSETS\n        Default: humaneval,mbpp\n```\n\nYou can also deep dive into obtaining the list of all available parameters:\n\n```bash\npython3 experiments/main/run_experiments_syn_tran.py --help\npython3 experiments/main/run_experiments_repair.py --help\n```\n\n### Execution Time of Benchmarks\n\nThe runtime of our main experiments depends on the choice of datasets and tasks and the choice of models. Generally, larger datasets and larger models result in longer execution times.\n\nOur benchmark features the MBPP and HumanEval datasets, adapted for three tasks: synthesis, translate, and repair.\nTaking into account additional instances due to running on several seeds, the experiments can be ordered in increasing order of runtime as:  MBPP-repair, HumanEval-repair, MBPP-{synthesis,translate}, and HumanEval-{synthesis,translate}.\n\nOur evaluation further features 6 models, in order of increasing parameter size, Gemma 2 2B, Gemma 2 9B, Gemma 2 27B, Qwen 2.5 32B, DeepSeek Coder 33B, and CodeLlama 34B. \n\nThus, the quickest experiment is computing the performance of Gemma 2 2B synthesis on MBPP, taking approximately 4h on a single GPU. The longest experiment is computing performance of CodeLlama 34B synthesis on HumanEval.\n\n### Recreating Figures\n\nYou can run the following command to produce the figures for the paper. You may run this script with partial results, in which case you will receive a print out of missing results and its positions in the table will be substituted with \"-1\".\n\n```bash\nbash experiments/main/figures.sh\n```\n\nThe results map to the corresponding figures in the paper as follows:\n- Table 2 and 3: all models, all tasks, all datasets, i.e., `[mbpp|humaneval]_*_s=[0|1|2|3]_t=1_[synth|translate|repair-all]_[c|nc].jsonl`. Vanilla and Syntax can be computed based on non-constrained (`nc`) variants.\n- Table 4: all models, synthesis, all datasets, i.e., `[mbpp|humaneval]_*_s=[0|1|2|3]_t=1_synth_[c|nc].jsonl`\n- Figure 8: Gemma 2 2B, synthesis, HumanEval, i.e., `humaneval_google_gemma-2-2b-it_s=[0|1|2|3]_t=1_synth_[c|nc].jsonl`\n\nSince running the entire pipeline takes several days using 8 GPUs, we have included our raw data in the `experiments/main/results_paper` directory. You can directly run the figures script without running the experiments for the submitted results like this:\n\n```bash\nbash experiments/main/figures.sh results_paper\n```\n\n\u003e Note: Table 4 is a runtime table. You should expect the runtime per instance to differ based on the CPU and GPU used, however the *runtime increase* should be consistent with our findings.\n\n## Project Structure\n\nThe core part of our work is the implementation of a completion engine that incrementally parses type-safe TypeScript programs.\nThe completion engine can then be used to constrain the sampling from an LLM model to only generate type-safe programs.\n\nThis project is organized as a Python package.\nThe relevant code for the implementation of type-constrained decoding and sampling is located in the `typesafe_llm` directory.\nThe experiments are located in the `experiments` directory.\nWe further provide a test suite in the `tests` directory.\nThe usage of the latter two is described above.\nIn the following sections we describe the structure of the `typesafe_llm` package.\n\n### (Constrained) Sampling (Algorithm 1)\n\nThe sampling algorithm presented in Section 2.1 of the paper is located in `typesafe_llm/sampling.py`.\nIt uses the `transformers` library to infer predictions from a language model, sample from it and, if constraining is enabled, runs a parser in parallel to reject invalid programs (`sample_constrained`).\n\n### Prefix Automaton Definition and Base Automata (Section 3.2)\n\nThe prefix automaton is defined in `typesafe_llm/automata/parser_base.py`.\nThe automaton is implicitely defined by defining the transition function and acceptance status in each state, subclassing from `IncrementalParserState`.\nA state indicates that it is an accepting state by setting the field `accept` to True.\nThe transition function is invoked by the method `parse_char` and returns a list of new states that can be reached by parsing the given character.\nThe file further contains the definitions of concatenation, union, kleene plus and terminal automata.\n\n### Identifiers, Literals and Types (Section 3.3)\n\nThe automaton for identifiers (`ExistingIdentifierParserState`) is the first automaton defined in `typesafe_llm/automata/parser_ts.py`.\nThe following automata parse literals (`LiteralParserState` and its subclasses), including more advanced literals such as regular expressions and template strings.\n\nThe automaton for types is defined seperately in `typesafe_llm/automata/parser_ts_types.py`.\n\n### Expressions (Section 3.4)\n\nThe expression automaton is defined in `typesafe_llm/automata/parser_ts.py` in the class `ExpressionParserState`.\nIt implements the extension logic and the pruning of invalid transitions due to operator precedence and type constraints.\nThe derivability algorithm is implemented for each state individually in the method `derivable`. It determines the directly derivable types and call the reachability algorithm with them.\nThe type reachability algorithm is implemented in `typesafe_llm/parser/types_ts.py` in the method `reachable`, leveraging `_reachable_bfs` - a straightforward breadth-first search translation of the presented reachability algorithm.\n\n### Statements and the entire Program (Section 3.5)\n\nThe automaton for statements is defined in `typesafe_llm/automata/parser_ts.py` in the class `StatementParserState`.\nIt handles the constraining for valid return types.\nThe automaton for the entire program is defined in `typesafe_llm/automata/parser_ts.py` in the class `ProgramParserState`.\n\n\n## FAQ\n\n### Can you reuse compilers for type-constrained decoding?\n\nNo. The problem with traditional compilers is that they only provide feedback on a *completed* program. Meanwhile, to guide the LLM during generation effectively, we need feedback on the partially generated programs. Therefore compiler can not be reused for type-constrained decoding.\n\n### Can you reuse LSPs / Static Analyzers / Tree-Sitter / etc for type-constrained decoding?\n\nNo. These systems are designed to aid humans during development and not for reliable incremental parsing. As such, while LSPs and other systems are helpful and may be able to handle some partial programs, they usually do not guarantee being able to handle *arbitrary* partial programs. For example, LSPs are useful for providing possible members of objects or parameter types for calls, and have been used  for this purpose [1,2], however, they can not always handle partial syntax trees, may not provide help when they fail to derive an object's type, and can not predict whether a partial expression can be completed into the required type of the current context. To reliably provide steering and constraints for all partial programs, we had to build our custom incremental parser.\n\n[1] Agrawal et. al., \"Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context\", NeurIPS 2023 ([link](https://openreview.net/forum?id=qPUbKxKvXq))  \n[2] Gvero et. al., \"Complete Completion using Types and Weights\", ACM Sigplan 2013 ([link](https://dl.acm.org/doi/10.1145/2499370.2462192))  \n\n### Are you aware of any implementation in another language than TypeScript?\n\nNo. As far as we know, such a constraining algorithm has to be implemented manually for every language. As such, we are not aware of any implementations of our method for other languages yet (as of June 2025).\n","funding_links":[],"categories":["Libraries"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feth-sri%2Ftype-constrained-code-generation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feth-sri%2Ftype-constrained-code-generation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feth-sri%2Ftype-constrained-code-generation/lists"}