{"id":30705076,"url":"https://github.com/mradovic38/tiny-starcoder-evaluation","last_synced_at":"2026-06-25T02:31:51.199Z","repository":{"id":266879349,"uuid":"870273542","full_name":"mradovic38/tiny-starcoder-evaluation","owner":"mradovic38","description":"Evaluation of the Tiny Starcoder Fill-in-the-Middle code completion model, utilizing dataset generation, manual scoring, and various automatic metrics to assess performance.","archived":false,"fork":false,"pushed_at":"2025-01-14T02:02:49.000Z","size":193,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-02T18:11:41.781Z","etag":null,"topics":["bleu-score","code-completion","code-generation","fill-in-the-middle","fim","gpt","huggingface","huggingface-transformers","nlp","rouge","transformers"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mradovic38.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-09T18:35:48.000Z","updated_at":"2025-01-14T02:02:53.000Z","dependencies_parsed_at":"2024-12-06T18:39:49.314Z","dependency_job_id":null,"html_url":"https://github.com/mradovic38/tiny-starcoder-evaluation","commit_stats":null,"previous_names":["mradovic38/tiny-starcoder-evaluation"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mradovic38/tiny-starcoder-evaluation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mradovic38%2Ftiny-starcoder-evaluation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mradovic38%2Ftiny-starcoder-evaluation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mradovic38%2Ftiny-starcoder-evaluation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mradovic38%2Ftiny-starcoder-evaluation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mradovic38","download_url":"https://codeload.github.com/mradovic38/tiny-starcoder-evaluation/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mradovic38%2Ftiny-starcoder-evaluation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34757353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-25T02:00:05.521Z","response_time":101,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bleu-score","code-completion","code-generation","fill-in-the-middle","fim","gpt","huggingface","huggingface-transformers","nlp","rouge","transformers"],"created_at":"2025-09-02T18:07:19.419Z","updated_at":"2026-06-25T02:31:51.175Z","avatar_url":"https://github.com/mradovic38.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tiny Starcoder FIM Code Completion Model Evaluation\n\n## [Generating the dataset](data_fetcher.py)\nTo generate the dataset, I first cloned the public repository of my [Football Analysis](\"https://github.com/mradovic38/football_analysis\") project using the `clone_repo()` function. This function clones the repository into a specified directory, ensuring the code is available locally. After cloning, I used the `collect_python_files()` function to gather all Python files (except `__init__.py` files) from the repository into a target directory. This function searches for .py files and stores them in a designated folder for further processing. I applied this process to gather all Python files from the cloned repository.\n\n## [Splitting scripts into prefix, middle and suffix](split_generator.py)\nTo prepare the dataset for the model, I implemented the `SplitGenerator` class, which splits Python files into three sections: **prefix**, **middle**, and **suffix**. The prefix and suffix sections each contain 200 tokens, providing ample context for predicting the middle segment, which is 40 tokens long. \n\nThe process works as follows:\n- **Tokenization**: The code is tokenized using a provided tokenizer.\n```python\ntokens = self.tokenizer.tokenize(code)\n```\n- **Splitting**: The tokens are divided into prefix, middle, and suffix, ensuring that each part fits the specified lengths.\n```python\nprefix = tokens[current_position:current_position + self.prefix_length]\nmiddle = tokens[current_position + self.prefix_length:current_position +\n                           self.prefix_length + self.middle_length]\nsuffix = tokens[current_position + self.prefix_length + self.middle_length:current_position +\n                         self.prefix_length + self.middle_length + self.suffix_length]\n```\n- **Splits Dataset Generation**: I used the `generate()` method to create a CSV file with 40 examples, each containing a filename, prefix, middle, and suffix.\n\n## [Running Tiny Starcoder](tiny_starcoder_evaluation.ipynb)\nThe `get_completion` function generates predictions for the middle part of a code snippet. It takes the prefix and suffix as inputs, tokenizes them, and prepares them for the model. The model generates a completion based on the input, which is then decoded to extract the middle portion of the text.\nWe apply this function to each row in our DataFrame to obtain predictions for the dataset.\n\n## [Manual scoring](manual_reviewer.py)\nThe `ManualReviewer` class enables the manual evaluation of code completion examples. It displays each example to the reviewer, allowing them to assign a label (0 for correct, 1 for partially correct, and 2 for incorrect) and provide comments on why the example got the given score. The reviews are stored in a DataFrame, which can be saved to a CSV file for further analysis.\n\n## [Proposing some automatic metrics](tiny_starcoder_evaluation.ipynb)\nI have proposed several automatic metrics to evaluate the performance of the Fill-in-the-Middle (FIM) code completion model. The selected metrics include Exact Match, chrF, BLEU and some ROUGE metrics, each providing unique insights into the quality of the generated outputs.\n\n1. **Exact Match**\nThis metric evaluates whether the predicted output matches the reference output exactly. It returns a binary score (True or False).\n\n2. **chrF (Character F-score)**\nchrF measures the F-score at the character level, allowing for sensitivity to small variations in the output, such as syntax and formatting differences. It could be particularly beneficial for code completion tasks, where minor changes can significantly affect functionality, thus providing a more nuanced evaluation.\n\n3. **BLEU** (Bilingual Evaluation Understudy)\nBLEU measures the overlap of n-grams between the predicted output and reference completions. It ranges from 0 to 1, where higher scores indicate better alignment with the references.\n - **Why did I choose BLEU?**: BLEU could be useful for assessing the precision of code snippets, because it identifies how closely the model's outputs match expected patterns.\n\n4. **ROUGE** (Recall-Oriented Understudy for Gisting Evaluation)\nROUGE calculates the recall of n-grams between the reference and predicted texts, with several variants (ROUGE-1, ROUGE-2, ROUGE-L) measuring unigram, bigram, and longest common subsequence matches, respectively.\n - **Why did I choose ROUGE?**: It could provide insights into the completeness and relevance of the generated code, making it effective for understanding how well the model captures key elements of the reference completions.\n\n## [Calculating which metrics align the most with manual evaluations](tiny_starcoder_evaluation.ipynb)\nIn evaluating the performance of the Tiny Starcoder model, several metrics were analyzed for their correlation with manual labels. I have computed the Pearson and Spearman correlations between the manually created labels and various metrics.\n### Insights:\nMost metrics exhibited negative correlations, which can be attributed to the labeling system where 0 indicates correctness and 2 indicates incorrectness.\n1. **Exact Match has no meaningful correlation with the manual labels**, since there are no exact matches.\n\n2. **chrf has a moderate correlation**, indicating that this metric aligns reasonably well with manual evaluations.\n\n3. **BLEU demonstrates the strongest correlation with manual labels**, making it the most reliable metric in this evaluation.\n\n4. **ROUGE-1 and ROUGE-2 and ROUGE-L also show moderate positive correlations**, indicating moderate agreement with manual labels.\n\n\n## 📖 Resources\n* [Tiny Starcoder](https://huggingface.co/bigcode/tiny_starcoder_py)\n* Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In Pro-ceedings of the Workshop on Text Summari-zation  Branches  Out  (WAS  2004),  Barce-lona, Spain, July 25 - 26, 2004\n* Kishore  Papineni  and  Salim  Roukos  and Todd  Ward  and  Wei-jing  Zhu  BLEU:  a Method  for  Automatic  Evaluation  of  Ma-chine Translation /  Proceedings of the 40th Annual Meeting of the Association for Com-putational  Linguistics  (ACL),  Philadelphia, July 2002, pp. 311-318.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmradovic38%2Ftiny-starcoder-evaluation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmradovic38%2Ftiny-starcoder-evaluation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmradovic38%2Ftiny-starcoder-evaluation/lists"}