{"id":13572474,"url":"https://github.com/GammaTauAI/opentau","last_synced_at":"2025-04-04T10:30:43.829Z","repository":{"id":62976298,"uuid":"531796125","full_name":"GammaTauAI/opentau","owner":"GammaTauAI","description":"Using Large Language Models for Repo-wide Type Prediction","archived":false,"fork":false,"pushed_at":"2023-12-10T02:41:41.000Z","size":11356,"stargazers_count":92,"open_issues_count":1,"forks_count":8,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-08-02T14:12:33.474Z","etag":null,"topics":["ai","llm","openai","rust","type-inference","typescript"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GammaTauAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-09-02T05:41:01.000Z","updated_at":"2024-07-19T05:28:28.000Z","dependencies_parsed_at":"2023-10-12T06:10:25.578Z","dependency_job_id":null,"html_url":"https://github.com/GammaTauAI/opentau","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GammaTauAI%2Fopentau","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GammaTauAI%2Fopentau/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GammaTauAI%2Fopentau/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GammaTauAI%2Fopentau/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GammaTauAI","download_url":"https://codeload.github.com/GammaTauAI/opentau/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223123767,"owners_count":17091169,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","llm","openai","rust","type-inference","typescript"],"created_at":"2024-08-01T14:01:24.248Z","updated_at":"2024-11-05T05:31:26.208Z","avatar_url":"https://github.com/GammaTauAI.png","language":"Rust","funding_links":[],"categories":["Rust","Applications"],"sub_categories":[],"readme":"# OpenTau: Using Large Language Models for Gradual Type Inference\n\nImplementation for the paper: [Type Prediction With Program Decomposition and Fill-in-the-Type Training. Federico Cassano, Ming-Ho Yee, Noah Shinn, Arjun Guha, Steven Holtzen.](https://arxiv.org/abs/2305.17145)\n\nType inference for gradually-typed languages such as TypeScript and Python has become increasingly prevalent in the field of programming languages.\nHowever, current approaches often struggle with inferring descriptive types in cases in which user-defined type annotations are absent,\nespecially when inferring function signatures.\n\nThis has motivated automated type prediction: given an untyped program, produce a well-typed output program. Large language models (LLMs) are promising for type prediction, but there are challenges: fill-in-the-middle performs poorly, programs may not fit into the context window, generated types may not type check, and it is difficult to measure how well-typed the output program is. We address these challenges by building OpenTau, a search-based approach for type prediction that leverages large language models. We propose a new metric for type prediction quality, give a tree-based program decomposition that searches a space of generated types, and present fill-in-the-type fine-tuning for LLMs. We evaluate our work with a new dataset for TypeScript type prediction, and show that 47.4% of files type check (14.5% absolute improvement) with an overall rate of 3.3 type errors per file.\n\nAdditionally, we build two protocols for implementing additional languages and models.\nIn our work, we implement a TypeScript compiler that respects the protocol and a SantaCoder server that\nrespects the other protocol.\nAn optional OpenAI model endpoint also implements the protocol, but it is unmaintained and not recommended for use.\nImplementing the respective protocols is relatively straightforward. More information can be found in our [class final project submission](https://github.com/GammaTauAI/opentau/blob/main/docs/final_report.md) (as this work started as a class project for [CS 4100 at Northeastern University](https://www.khoury.northeastern.edu/home/sholtzen/assets/pdf/cs4100-fall22-syllabus.pdf)).\n\n## Cite\n\n```bibtex\n@misc{cassano2023type,\n      title={Type Prediction With Program Decomposition and Fill-in-the-Type Training}, \n      author={Federico Cassano and Ming-Ho Yee and Noah Shinn and Arjun Guha and Steven Holtzen},\n      year={2023},\n      eprint={2305.17145},\n      archivePrefix={arXiv},\n      primaryClass={cs.SE}\n}\n```\n\n## Usage\n\nWe have implemented an OpenTau in Rust, which can be utilized in three ways:\n\n1. As a simple CLI client that will type-infer a given program. (more info in `./client`)\n2. As a library, that exposes numerous abstractions for interacting with different compilers, models, and type prediction strategies. (more info in `./client`)\n3. As an evaluation tool, to analyze the performance of the combinations of different models, languages, datasets, and type prediction strategies\n   on the task of type prediction. (more info in `./evaluator`)\n\nWe are in the review process for our paper:\n[Type Prediction With Program Decomposition and Fill-in-the-Type Training. Federico Cassano, Ming-Ho Yee, Noah Shinn, Arjun Guha, Steven Holtzen.](https://arxiv.org/abs/2305.17145)\n\n## Requirements\n\n- `rust`\n- Incoder/SantaCoder model requirements:\n  - `torch`\n  - `tokenizers\u003e=0.12`\n  - `transformers`\n- TypeScript compiler requirements:\n  - `ts-node`\n  - `tsc`\n- Python compiler requirements (Work in progress):\n  - `mypy` | `pyright` for static type checking\n  - `redbaron` for AST parsing with comments\n- `pandoc` ONLY for building the report\n\n## Installation\n\nRun `make` while being in the directory\n\nThe output binary (symlinked) will be at `/out/client`\n\n## Example completion\n\nOur system was able to type-infer this program:\n\n```ts\nconst findAllPeople = function (n, meetings, firstPerson) {\n  meetings.sort((a, b) =\u003e a[2] - b[2]);\n  const uf = new UnionFind(n);\n  uf.connect(0, firstPerson);\n  let ppl = [];\n  for (let i = 0, len = meetings.length; i \u003c len; ) {\n    ppl = [];\n    let time = meetings[i][2];\n    while (i \u003c len \u0026\u0026 meetings[i][2] === time) {\n      uf.connect(meetings[i][0], meetings[i][1]);\n      ppl.push(meetings[i][0]);\n      ppl.push(meetings[i][1]);\n      i++;\n    }\n    for (let n of ppl) {\n      if (!uf.connected(0, n)) uf.reset(n);\n    }\n  }\n  let ans = [];\n  for (let i = 0; i \u003c n; ++i) {\n    if (uf.connected(0, i)) ans.push(i);\n  }\n  return ans;\n};\n\nclass UnionFind {\n  arr;\n\n  constructor(n) {\n    this.arr = Array(n).fill(null);\n    this.arr.forEach((e, i, arr) =\u003e (arr[i] = i));\n  }\n  connect(a, b) {\n    this.arr[this.find(a)] = this.find(this.arr[b]);\n  }\n  find(a) {\n    return this.arr[a] === a ? a : (this.arr[a] = this.find(this.arr[a]));\n  }\n  connected(a, b) {\n    return this.find(a) === this.find(b);\n  }\n  reset(a) {\n    this.arr[a] = a;\n  }\n}\n```\n\nAnd annotate it with these types:\n\n```ts\nconst findAllPeople: (\n  n: number,\n  meetings: number[][],\n  firstPerson: number\n) =\u003e number[] = function (n, meetings, firstPerson) {\n  meetings.sort((a, b) =\u003e a[2] - b[2]);\n  const uf: UnionFind = new UnionFind(n);\n  uf.connect(0, firstPerson);\n  let ppl: number[] = [];\n  for (let i = 0, len = meetings.length; i \u003c len; ) {\n    ppl = [];\n    let time: number = meetings[i][2];\n    while (i \u003c len \u0026\u0026 meetings[i][2] === time) {\n      uf.connect(meetings[i][0], meetings[i][1]);\n      ppl.push(meetings[i][0]);\n      ppl.push(meetings[i][1]);\n      i++;\n    }\n    for (let n of ppl) {\n      if (!uf.connected(0, n)) uf.reset(n);\n    }\n  }\n  let ans: number[] = [];\n  for (let i = 0; i \u003c n; ++i) {\n    if (uf.connected(0, i)) ans.push(i);\n  }\n  return ans;\n};\n\nclass UnionFind {\n  arr: number[];\n  constructor(n) {\n    this.arr = Array(n).fill(null);\n    this.arr.forEach((e, i, arr) =\u003e (arr[i] = i));\n  }\n  connect(a: number, b: number): void {\n    this.arr[this.find(a)] = this.find(this.arr[b]);\n  }\n  find(a: number): number {\n    return this.arr[a] === a ? a : (this.arr[a] = this.find(this.arr[a]));\n  }\n  connected(a: number, b: number): boolean {\n    return this.find(a) === this.find(b);\n  }\n  reset(a: number): void {\n    this.arr[a] = a;\n  }\n}\n```\n\nWhile TypeScript's type inference only managed to infer these types (too many `any`s and loose typing):\n\n```ts\nconst findAllPeople = function (n: number, meetings: any[], firstPerson: any) {\n  meetings.sort((a: number[], b: number[]) =\u003e a[2] - b[2]);\n  const uf = new UnionFind(n);\n  uf.connect(0, firstPerson);\n  let ppl = [];\n  for (let i = 0, len = meetings.length; i \u003c len; ) {\n    ppl = [];\n    let time = meetings[i][2];\n    while (i \u003c len \u0026\u0026 meetings[i][2] === time) {\n      uf.connect(meetings[i][0], meetings[i][1]);\n      ppl.push(meetings[i][0]);\n      ppl.push(meetings[i][1]);\n      i++;\n    }\n    for (let n of ppl) {\n      if (!uf.connected(0, n)) uf.reset(n);\n    }\n  }\n  let ans = [];\n  for (let i = 0; i \u003c n; ++i) {\n    if (uf.connected(0, i)) ans.push(i);\n  }\n  return ans;\n};\n\nclass UnionFind {\n  arr: any[];\n\n  constructor(n: any) {\n    this.arr = Array(n).fill(null);\n    this.arr.forEach(\n      (e: any, i: string | number, arr: { [x: string]: any }) =\u003e (arr[i] = i)\n    );\n  }\n  connect(a: number, b: string | number) {\n    this.arr[this.find(a)] = this.find(this.arr[b]);\n  }\n  find(a: string | number) {\n    return this.arr[a] === a ? a : (this.arr[a] = this.find(this.arr[a]));\n  }\n  connected(a: number, b: number) {\n    return this.find(a) === this.find(b);\n  }\n  reset(a: string | number) {\n    this.arr[a] = a;\n  }\n}\n```\n\nNote that TypeScript's inference type annotated non let-bound arrow functions, while our system didn't. We believe that these functions should be left untyped, as the signature of the function that calls them should be typed, and TypeScript's type-inference should enforce those rules. Our system will not battle with TypeScript's type-inference, it will try to work alongside it. Additionally, our system will not perform any type-migrations, i.e. it will not change already defined types. This is to further enforce the coalition between our system and TypeScript's.\n\n#### Another Example: Generics Inference\n\nOur system is able to fill out generic types.\n\n```ts\nvar sumFourDivisors = function (nums) {\n  let res = 0;\n\n  for (const e of nums) {\n    const set = helper(e);\n    if (set.size === 4) {\n      for (const i of set) res += i;\n    }\n  }\n\n  return res;\n\n  function helper(num) {\n    const set = new Set();\n    const r = ~~(Math.sqrt(num) + 1);\n    for (let i = 1; i \u003c r; i++) {\n      if (num % i === 0) {\n        set.add(i);\n        set.add(num / i);\n      }\n    }\n    return set;\n  }\n};\n```\n\nto\n\n```ts\nvar sumFourDivisors: (nums: number[]) =\u003e number = function (nums) {\n  let res: number = 0;\n  for (const e of nums) {\n    const set: Set\u003cnumber\u003e = helper(e);\n    if (set.size === 4) {\n      for (const i of set) res += i;\n    }\n  }\n  return res;\n  function helper(num: number): Set\u003cnumber\u003e {\n    const set: Set\u003cnumber\u003e = new Set();\n    const r: number = ~~(Math.sqrt(num) + 1);\n    for (let i = 1; i \u003c r; i++) {\n      if (num % i === 0) {\n        set.add(i);\n        set.add(num / i);\n      }\n    }\n    return set;\n  }\n};\n```\n\nwhile TypeScript's inference couldn't give us a type-checkable answer:\n\n```\n7:28 - error TS2365: Operator '+=' cannot be applied to types 'number' and 'unknown'.\n\n7       for (const i of set) res += i;\n                             ~~~~~~~~\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGammaTauAI%2Fopentau","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FGammaTauAI%2Fopentau","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGammaTauAI%2Fopentau/lists"}