Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wyatt-avilla/sunbird
neural network based decompiler
https://github.com/wyatt-avilla/sunbird
decompilation machine-translation neural-network python pytorch tree-sitter x86-64
Last synced: 8 days ago
JSON representation
neural network based decompiler
- Host: GitHub
- URL: https://github.com/wyatt-avilla/sunbird
- Owner: wyatt-avilla
- Created: 2024-09-04T01:07:04.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-09-23T01:07:07.000Z (4 months ago)
- Last Synced: 2025-01-19T18:52:14.729Z (12 days ago)
- Topics: decompilation, machine-translation, neural-network, python, pytorch, tree-sitter, x86-64
- Language: Jupyter Notebook
- Homepage:
- Size: 60.5 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# sunbird 🐦🔥
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
[![python](https://img.shields.io/badge/Python-3.12-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
[![pytorch](https://img.shields.io/badge/PyTorch-2.4.1-EE4C2C.svg?style=flat&logo=pytorch)](https://pytorch.org)
![Tree Sitter](https://img.shields.io/badge/tree_sitter-0.23.0-7E8F31)## Overview
This project focuses on translating x86-64 assembly back into C code using a
machine learning model trained on a dataset of C code snippets. Each snippet is
compiled with multiple optimization levels across different compilers, and the
resulting assembly code is tokenized for use in training.## Dataset
- the model was trained on an augmented version of
[this dataset](https://www.kaggle.com/datasets/shirshaka/c-code-snippets-and-their-labels)
- each snippet of C code is compiled (by default) with the first four
optimization levels of GCC and Clang, yielding 8 unique assembly code snippets
for each element in the initial dataset (totaling 2.5 million snippets)If `kaggle` is in your path, the original dataset can be downloaded with:
```sh
kaggle datasets download -d shirshaka/c-code-snippets-and-their-labels && \
unzip -d dataset c-code-snippets-and-their-labels.zip
```### Generation
- compilation is performed as needed when calling `DatasetIterator.take(n)`
- compilation settings, including optimization levels and compiler choices are
specified in the arguments to this method call
- the exact flags passed into the compilation subprocesses are specified in the
`.compile()` methods in `compilation.py`## Tokenization
C and assembly code snippets are tokenized semantically using the tree-sitter
library. Each token includes raw text paired with its symbolic identity, e.g.,
`(variable, 42)`.