{"id":13471131,"url":"https://github.com/ZhangShurong/rebucket","last_synced_at":"2025-03-26T13:30:50.050Z","repository":{"id":136099970,"uuid":"157214012","full_name":"ZhangShurong/rebucket","owner":"ZhangShurong","description":"ReBucket – A Method for Clustering Duplicate Crash Reports based on Call Stack Similarity","archived":false,"fork":false,"pushed_at":"2021-03-28T03:02:49.000Z","size":3861,"stargazers_count":30,"open_issues_count":0,"forks_count":7,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-10-30T02:58:22.660Z","etag":null,"topics":["microsoft","rebucket","stacktrace"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZhangShurong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-12T12:58:34.000Z","updated_at":"2024-10-09T05:12:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"c8a25d71-cfdb-44dc-84cf-1fe656e96c80","html_url":"https://github.com/ZhangShurong/rebucket","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZhangShurong%2Frebucket","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZhangShurong%2Frebucket/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZhangShurong%2Frebucket/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZhangShurong%2Frebucket/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZhangShurong","download_url":"https://codeload.github.com/ZhangShurong/rebucket/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245662744,"owners_count":20652073,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["microsoft","rebucket","stacktrace"],"created_at":"2024-07-31T16:00:40.319Z","updated_at":"2025-03-26T13:30:50.039Z","avatar_url":"https://github.com/ZhangShurong.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# rebucket\nimplements rebucket algorithm for research.\n\n## How To Use? \nUsage:\n```\ncd rebucket\nmkdir build\ncd build\ncmake ..\nmake\npython test.py -d ../../dataset/Firefox/df_mozilla_firefox.json\n```\n\n## dataset\nhttps://github.com/logpai/bugrepo\n\n## Implement\ntodo \n- [x] implements rebucket algorithm with c++\n- [ ] data strcuture\n------\n以下为中文说明\n# Rebucket算法实现\n\n算法本身请参见rebucket论文，本文档只说明项目相关内容\n## 项目结构\nrebucket  \n|  \n|---- dataset, 处理后的数据集  \n|  \n|---- rebucket, C++实现rebucket   \n|  \n|---- generate_dataset.py， 生成数据集的脚本  \n|  \n|---- test.py 测试脚本  \n|  \n|---- rebucket.py 算法脚本  \n\n## 数据集处理部分\n**为什么需要处理数据集**\n因为原始的数据集bugrepo并不是每个记录都含有堆栈，因此需要提取出堆栈信息，声称可用的数据集。生成数据集的脚本是generate_dataset.py。生成数据集的位置在dataset中。  \n数据集提取算法为：  \n\nhttp://groups.csail.mit.edu/pag/pubs/bettenburg-msr-2008.pdf\n\n**数据集格式**  \n因为数据量不大且为了兼容其他项目，因此数据集采用的是json字符串存储。其格式为\n```\n{\n    \"stack_id\":\"堆栈ID\",\n    \"duplicated_stack\":\"重复堆栈ID\",\n    \"stack_arr\":[堆栈内容，用数组表示]\n}\n```\n## 验证算法部分\n因为原始的论文中已经提供了详细的度量值，本文只简单描述如何计算分类错误数。  \n假设正确的分类应该是\n```\n{[1,2,3],[4,5,6],[7,8]}\n```\n但是由于种种原因，分类错误，导致了以下分类结果：\n```\n{[1,2],[3],[4,5,6,7,8]}\n```\n上述过程的漏报数为1，因为7,8这两个堆栈均被分到了4,5,6中，意味着，有**一类**错误没有反应出来。或者换种说法，意味着生产环境中，有一类错误没有上报。  \n计算漏报数非常简单，只需要对比分类结果与真实结果，找出哪一类没有被分类即可。相关代码在rebucket.py中的wrong函数中。\n\n## 如何运行c++代码？\n进入rebucket目录\n```\nmkdir build\ncmake ..\nmake\n```\n此时，build目录下面会有动态连接库以及test.py，请执行\n```\npython test.py -d ../../dataset/Firefox/df_mozilla_firefox.json\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZhangShurong%2Frebucket","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FZhangShurong%2Frebucket","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZhangShurong%2Frebucket/lists"}