An open API service indexing awesome lists of open source software.

https://github.com/chtphuc/readable.ai

🤖 An AI deobfuscator that translates minified/obfuscated JavaScript back into human-readable code.
https://github.com/chtphuc/readable.ai

ai deobfuscate deobfuscation deobfuscator finetuning javascript-deobfuscator js-deobfuscator lora machine-learning reverse-js

Last synced: 2 months ago
JSON representation

🤖 An AI deobfuscator that translates minified/obfuscated JavaScript back into human-readable code.

Awesome Lists containing this project

README

          

-----

# 🤖 Readable.ai

🤖 An AI deobfuscator that translates minified/obfuscated JavaScript back into **human-readable code**.

Have you ever viewed the source of a website, only to be met by a wall of meaningless, machine-generated code?

```javascript
function _0x5dcf(_0x1a2b, _0x3c4d) {
  var _0x7e8f = _0x1a2b["data"][0];
  var _0x9b1a = _0x1a2b["key"];
  if(_0x3c4d > _0x7e8f) {
    for(var _0x5f8d = 0; _0x5f8d < _0x3c4d; _0x5f8d++) {
      console.log(_0x9b1a + _0x5f8d);
    }
  }
  return _0x7e8f;
}
```

This is a nightmare for debugging and analysis. **Readable.ai** is the answer.

-----

## 🔮 The Mission: Translate Chaos into Clarity

Our mission is simple: We use the power of AI (LLMs) to **reverse-engineer** this digital nightmare, restoring logic and human-readable meaning to obfuscated code.

We treat this as a "machine translation" problem:

* **Source Language:** "Minified" or "obfuscated" code.
* **Target Language:** The clean, original code.

**Our goal is to turn the cryptic block above into its logical equivalent:**

```javascript
function checkThreshold(config, limit) {
  var firstValue = config["data"][0];
  var prefixKey = config["key"];
  if(limit > firstValue) {
    for(var index = 0; index < limit; index++) {
      console.log(prefixKey + index);
    }
  }
  return firstValue;
}
```

-----

## 🛠️ The Arsenal: Our Methodology

We aren't training a model from scratch. We are fine-tuning a powerful, pre-existing model to specialize in this one task.

* **Base Model:** `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (A powerful 1.1-billion parameter model).
* **Training Technique:** **LoRA** (Low-Rank Adaptation). This is a Parameter-Efficient Fine-Tuning (PEFT) technique. It allows us to "teach" the massive base model a new skill by training only a tiny fraction (\< 2%) of its parameters.
* **Fuel (The Dataset):** This model is trained on a **large-scale, custom-built dataset** featuring tens of thousands of real-world, obfuscated-to-clean code pairs.

-----

## 📈 Status & Roadmap

This project is trained in **multiple stages** due to dataset size and compute limitations:

* **Stage 1 (In Progress):** Trained on the **first 500,000 samples** of the dataset.

* **Result:** `adapter_v1` (Represents initial learning on a substantial data portion)

* **Stage 2 (Upcoming):** Loading `adapter_v1` and continuing training on the **next segment** of the dataset (e.g., samples 500,001 to 1,000,000).

* **Result:** `adapter_v2`

* **Stage 3 (Upcoming):** Loading `adapter_v2` and training on the **subsequent segment** until the full dataset is processed.

* **Result:** `adapter_vX` (The final adapter after processing all data)

-----

## 🚀 How to Use

Your fine-tuned model consists of two parts: the **base model** (`TinyLlama-1.1B-Chat-v1.0`) and the **adapter** (your trained knowledge). To run inference, you must load the base model, then apply your adapter on top of it.

This is the standard 5-step procedure:

### 1\. Install Dependencies

Ensure you have the necessary libraries installed:

```bash
pip install transformers peft accelerate torch
```

### 2\. Load Base Model & Tokenizer

First, load the original `TinyLlama/TinyLlama-1.1B-Chat-v1.0` model from Hugging Face. This is the foundation your adapter will be applied to.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE_MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)
# TinyLlama uses an explicit chat template. We set pad_token for consistency.
tokenizer.pad_token = tokenizer.eos_token
```

### 3\. Load Your Trained Adapter

Using `PeftModel`, load your trained weights (from the `adapter_model.safetensors` file) and "graft" them onto the base model.

```python
# Provide the path to your trained adapter directory
ADAPTER_PATH = "/path/to/your/trained_adapter/"
model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
```

### 4\. Merge for Inference Speed

This is a critical optimization step. The `merge_and_unload()` command permanently fuses the adapter's weights into the base model. This creates a single, highly efficient model and **significantly speeds up inference**.

```python
model = model.merge_and_unload()
print("✅ Adapter merged. Model is ready!")
```

### 5\. Run Inference

The `model` is now ready. You must format your request as a **prompt** that matches the structure the model was trained on, then call `model.generate()`.

```python
# 1. Define the obfuscated code
obfuscated_code = 'function _0x5dcf(_0x1a2b, _0x3c4d){var _0x7e8f=_0x1a2b["data"][0];var _0x9b1a=_0x1a2b["key"];if(_0x3c4d>_0x7e8f){for(var _0x5f8d=0;_0x5f8d<_0x3c4d;_0x5f8d++){console.log(_0x9b1a+_0x5f8d);}}return _0x7e8f;}'

# 2. Format the prompt
# **NOTE:** The prompt structure must match the one used during training.
prompt = f"### Input:\n{obfuscated_code}\n\n### Output:\n"

# 3. Tokenize and run generation
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=150,
    temperature=0.7,
    do_sample=True
)

# 4. Decode the result
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("--- Deobfuscation Result ---")
print(response_text)
```

-----

## ⚡ Performance & Resource Requirements

Understanding the resource needs is critical. Here is the breakdown based on our tests.

* **Training (LoRA):** The fine-tuning process is highly efficient. By using LoRA, `fp16`, and an efficient optimizer, a training session (like Stage 1 or 2) requires **approximately 10-12 GB of VRAM**. This is generally suitable for a high-end Colab GPU (like A100 or V100) or a mid-range local GPU.

* **Inference (After Merging):** This is the key benefit. After using `merge_and_unload()` (Step 4), the adapter is fused into the base model.

* The final merged model (Base + Adapter) requires the **exact same VRAM as the original `TinyLlama-1.1B` base model** (approx. 2.4 GB in `fp16`).
* You pay **no extra VRAM cost** for the adapter's new knowledge.
* Inference speed is also **identical to the original base model**, as it's a single, unified model.

* **Portability:** This merged model can be saved and deployed as a standard transformer, without needing the `peft` library for inference.

-----

## ❤️ Support the Project

If you find this project useful, want to support server costs, or just want to buy me a coffee, donations are appreciated.

**BSC (Binance Smart Chain) Address:**
`0x1f7fa6d01f02583b48e0343a9e42cbd408ef3bfb`

-----

## 📄 License

This project is licensed under the MIT License.

-----