Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/FSoft-AI4Code/TheVault

[EMNLP 2023] The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
https://github.com/FSoft-AI4Code/TheVault
ai4code dataset multilingual-nlp
Last synced: about 2 months ago
JSON representation
[EMNLP 2023] The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Host: GitHub
URL: https://github.com/FSoft-AI4Code/TheVault
Owner: FSoft-AI4Code
License: mit
Created: 2022-10-27T09:12:22.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-02-05T07:36:40.000Z (5 months ago)
Last Synced: 2024-03-12T15:05:31.047Z (4 months ago)
Topics: ai4code, dataset, multilingual-nlp
Language: Jupyter Notebook
Homepage:
Size: 9.44 MB
Stars: 76
Watchers: 4
Forks: 6
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

awesome-ai-coding - The Vault
README

        




  



[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) [![Python 3.8](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/release/python-380/) [![arXiv](https://img.shields.io/badge/arXiv-2305.06156-b31b1b.svg)](https://arxiv.org/abs/2305.06156) [![The Vault on HuggingFace datasets](https://img.shields.io/badge/%F0%9F%A4%97%20Datasets-The%20Vault-yellow?style=flat)](https://huggingface.co/datasets/Fsoft-AIC/the-vault-function) 

# The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation



## Table of content

- [The Vault Dataset](#the-vault-dataset)

  - [Data Summary](#data-summary)

  - [Data Structure](#data-structure)

    - [Data Instance](#data-instances)

    - [Data Fields](#data-fields)

    - [Data Near Deduplication](#data-near-deduplication)

    - [Splitting Train/Eval/Test](#splitting-trainevaltest)

    - [Splitting Train set](#splitting-trainset-into-multiple-subsets)

  - [Download Dataset](#download-dataset)

    - [Load dataset on Huggingface](#load-dataset-on-huggingface)

    - [Download via link](#download-via-link)

- [The Vault toolkit](#the-vault-toolkit)

  - [Getting Started](#getting-started)

  - [Processing Pipeline](#processing-pipeline)

    - [Extracting raw code](#extracting-raw-code)

    - [Filtering extracted code snippet](#filtering-extracted-code-snippet)

    - [Processing Custom Dataset](#processing-custom-dataset)

- [Citing The Vault](#citing-the-vault)

- [Contact Us](#contact-us)

- [License](#license)

___________

# The Vault Dataset

## Data Summary

The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.

We provide The Vault which contains code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.

![Something something](./assets/TheVault-Poster.png)

## Data Structure

### Data Instances

Every sample of The Vault are stored in form of a json object and compressed into a large json line file. Each sample corresponds to one raw code file. The content of the file are used to extracting function, class and inline set, other information (repository name, licenses, etc) are collected from source dataset (The Stack).

### Data Fields

See detail of data fields and example for each type of set [Here](./data/README.md)

### Data Near-Deduplication

We applied deduplication for internal and external.

- **Internal**: Apply exact deduplicate in full dataset.

- **External**: Apply near deduplicate with the test sets of CodeSearchNet, HumanEval and APPS.

*[Near-deduplication](https://chenghaomou.github.io/posts/20230220150602) use MinHash LSH to clustering sample based on their code. Those sample are close to each other (or even modified version) can be detected.

### Splitting train/eval/test

We have divided the complete dataset into three distinct sets: a training set, an evaluation set, and a test set, to maintain consistency throughout the experiment.

To avoid data leakage, we allocated all samples from the same repository to a singular set. We then subdivided these sets using code tokens as splitting factors. As a result, these subsets mirror the distribution of the full dataset.

### Splitting trainset into multiple subsets

Given the substantial size of our dataset, we found it beneficial to further divide the training set into two smaller subsets for ease of experimentation:

- A small training set, which contains 5% of the total data.

- A medium training set, comprising 20% of the full dataset.

- (And) the full training set.

|            | Small set   | Medium set   | Train set  | Validation | Test   | Total         |

|:-----------|------------:|-------------:|-----------:|-----------:|-------:|--------------:|

|Python      |   370,657   |  1,952,110   | 7,772,647  | 30,992     | 21,652 | 7,825,291     |

|Java        |   351,213   |  1,612,366   | 6,629,193  | 22,677     | 15,552 | 6,667,422     |

|JavaScript  |    82,931   |    404,729   | 1,640,416  | 22,044     | 21,108 | 1,683,568     |

|PHP         |   236,638   |  1,155,476   | 4,656,371  | 21,375     | 19,010 | 4,696,756     |

|C           |   105,978   |    381,207   | 1,639,319  | 27,525     | 19,122 | 1,685,966     |

|C#          |   141,090   |    783,166   | 3,305,891  | 24,787     | 19,638 | 3,350,316     |

|C++         |    87,420   |    410,907   | 1,671,268  | 20,011     | 18,169 | 1,709,448     |

|Go          |   267,535   |  1,319,547   | 5,109,020  | 19,102     | 25,314 | 5,153,436     |

|Ruby        |    23,921   |    112,574   |   424,339  | 17,338     | 19,908 |   461,585     |

|Rust        |    35,367   |    224,015   |   825,130  | 16,716     | 23,141 |   864,987     |

|TOTAL       | 1,702,750   |  8,356,097   |33,673,594  |222,567     |202,614 |**34,098,775** |

## Download dataset

### Load dataset on Huggingface

We publish The Vault ([function](https://huggingface.co/datasets/Fsoft-AIC/the-vault-function)/[inline](https://huggingface.co/datasets/Fsoft-AIC/the-vault-inline)/[class](https://huggingface.co/datasets/Fsoft-AIC/the-vault-class)) on Huggingface dataset hub.

```python

from datasets import load_dataset

# Load full function/class/inline level dataset

dataset = load_dataset("Fsoft-AIC/the-vault-{function/class/inline}")

# Load function level train/validation/test set

dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"])

# Load "small" (or "medium", "full") function level training set

dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"])

# specific language (e.g. Python) 

dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['Python'])

# dataset streaming

data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True)

for sample in iter(data['train']): 

    print(sample)

```

### Download via link

Or download the Vault directly from Azure blob storage via download link. Here are the link pattern for specific download option:

> https://ai4code.blob.core.windows.net/thevault/v1/{function,class,inline}/{python,java,javascript,go,cpp,c_sharp,c,rust,ruby,php}.zip

For example, download *class* of *Python*:

> https://ai4code.blob.core.windows.net/thevault/v1/class/python.zip

Or download using the script [`download_dataset.py`](./resources/download_dataset.py):

```bash

python download_dataset.py "" --set "function" # or class/inline

```

_Note: blob link currently only contains function-level version_

# The Vault Toolkit

## Getting Started

To setup environment and install dependencies via `pip`:

```bash

pip -r install requirements.txt

```

Install `codetext` parser to extract code using [tree-sitter](https://tree-sitter.github.io/tree-sitter/), via `pip`:

```bash

pip install codetext

```

Or manually build `codetext` form source, see more at [`Codetext` repo](https://github.com/FSoft-AI4Code/CodeText-parser)

```bash

git clone https://github.com/FSoft-AI4Code/CodeText-parser.git

cd CodeText-parser

pip install -e .

```

## Processing Pipeline

Our toolkit takes raw source code files as input and streamlines the extraction and generation of

code-text pairs, as illustrated in Figure above.

There are 3 seperate process:

1. **Extracting Raw code:** By using Tree-sitter extractor to identify function/class/line node inside raw file and obtain their metadata

2. **Extracting Docstring Style:** We implement a docstring style parser to capture all the informative section or param's description inside a docstring

3. **Filtering Docstring:** From the docstring gathered from previous process, we use it as main factor to filter quality sample (remove empty docstring, uninformative docstring, etc). See more about cleaning methodologies at our [paper](https://arxiv.org/abs/2305.06156)

We seperated the source code into multiple steps (coresponde for each process). Or you can run the full pipeline follow [this tutorial](#processing-custom-dataset).

![Extracting pipeline](./assets/pipeline.png)

### Extracting Raw code

From raw code, you can extract function, class using [`process_raw_node()`](./src/utils/utils.py#L138). An example structure of a raw code snippet show in the figure below. Inside a node are identifier, parameter or argument list, code and comment (docstring). 

```python

from codetext.utils import parse_code

from codetext.parser import PythonParser

code_snippet = """

def sum2num(a: int, b: int):

  '''

  :param a: first number

  :param b: second number

  '''

  return a + b # result

"""

code_tree = parse_code(code_snippet, 'cpp')

res = process_raw_node(

    tree=code_tree, 

    blob=code_snippet,

    language_parser=PythonParser(),

    metadata={'repo': 'test'}  # Optional

)

# or extrating line

res = get_line_definitions(

    tree=code_tree, 

    blob=code_snippet,

    language_parser=PythonParser(),

    source_metadata={'repo': 'test'}  # Optional

)

```

For extracting raw inline comment, the function [`get_line_definitions()`](./src/utils/utils.py#L279) can help to extract line comment and return the parent code block, previous and next context (i.e. code block).

```python

from codetext.utils import parse_code

from codetext.parser import PythonParser

code_snippet = """

def sum2num(a: int, b: int):

  '''

  :param a: first number

  :param b: second number

  '''

  return a + b

"""

code_tree = parse_code(code_snippet, 'cpp')

res = process_raw_node(

    tree=code_tree, 

    blob=code_snippet,

    language_parser=PythonParser(),

    metadata={'repo': 'test'}  # Optional

)

```

![Raw node structure](./assets/raw-node.png)

### Filtering Extracted code snippet

With the result function or class node and their metadata extracted from previous process, execute [`get_node_definitions()`](./src/utils/utils.py#L238) to filtering sample based on their docstring. Heuristic rules will remove sample that not meet the minimum requirement (We detailedly describe it inside our 

).

![](./assets/docstring-style.png)

Lastly, to extracting docstring style we implement a function call [`extract_docstring()`](./src/utils/utils.py#L514) that take docstring (in form of string) as input and result metadata of the docstring style as demonstrate in the figure above (e.g. param's docstring, type, return's docstring, etc.)

### Processing Custom Dataset

We create a `.yaml` to define which field to load when processing data. Usually, only source code are needed, but in case there are other additional information about the raw code might be added using the `.yaml`.

For example, `CodeSearchNet` stores their data in structure:

```yaml

# CodeSearchNet jsonline format 

# https://github.com/github/CodeSearchNet#data-details

code: original_string # raw code

repo: repo # additional infor

path: path # additional infor

language: language # additional infor

```

Inside `processing.py` we merged extracting raw code, filtering docstring and extracting docstring style function into 1 simple pipeline for quickly extracting dataset from raw source data. You can use `processing.py` by:

```bash

python -m codetext.processing 

--save_path   # path to save dir

--load_from_file  # load from file instead load from dataset cache

--language Python  # or Java, JavaScript, ...

--data_format './data/format/codeparot-format.yaml'  # load raw data format

--n_split 20  # split original dataset into N subset

--n_core -1  # number of multiple processor (default to 1) (-1 == using all core)

```

Arguments list:

```

positional arguments:

  data_path             data folder contain file.jsonl or huggingface dataset cache

options:

  -h, --help            show this help message and exit

  --save_path SAVE_PATH

                        Processed data save path

  --level LEVEL         Extract function/class/inline level or all

  --language LANGUAGE   Declare processing language (e.g: Python, Java)

  --data_format DATA_FORMAT

                        Path to file .yaml contains data format

  --load_from_file      Load from .json or .jsonl

  --cons_from_raw       Continues from raw .jsonl (pass folder path to data)

  --raw_only

  --filtered_only

  --extracted_only

  --n_split N_SPLIT     Split all the raw data into N file and feed into process pool

  --n_core N_CORE       Number of maximum process to create

  --debug

```

# Citing The Vault

More details can be found in our [paper](https://arxiv.org/abs/2305.06156). 

If you're using The Vault or the toolkit in your research or applications, please cite using this BibTeX:

```bibtex

@article{manh2023vault,

  title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},

  author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},

  journal={arXiv preprint arXiv:2305.06156},

  year={2023}

}

```

# Contact us

If you have any questions, comments or suggestions, please do not hesitate to contact us.

- Website: [fpt-aicenter](https://www.fpt-aicenter.com/ai-residency/)

- Email: [email protected]

# License

[MIT License](LICENSE)