https://github.com/chtphuc/readable.ai

🤖 An AI deobfuscator that translates minified/obfuscated JavaScript back into human-readable code.
https://github.com/chtphuc/readable.ai

ai deobfuscate deobfuscation deobfuscator finetuning javascript-deobfuscator js-deobfuscator lora machine-learning reverse-js

Last synced: 2 months ago
JSON representation

🤖 An AI deobfuscator that translates minified/obfuscated JavaScript back into human-readable code.

Host: GitHub
URL: https://github.com/chtphuc/readable.ai
Owner: chtphuc
License: mit
Created: 2025-10-28T14:17:42.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-10-28T14:24:40.000Z (3 months ago)
Last Synced: 2025-10-28T16:23:17.509Z (3 months ago)
Topics: ai, deobfuscate, deobfuscation, deobfuscator, finetuning, javascript-deobfuscator, js-deobfuscator, lora, machine-learning, reverse-js
Homepage:
Size: 4.88 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          -----

# 🤖 Readable.ai

🤖 An AI deobfuscator that translates minified/obfuscated JavaScript back into **human-readable code**.

Have you ever viewed the source of a website, only to be met by a wall of meaningless, machine-generated code?

```javascript

function _0x5dcf(_0x1a2b, _0x3c4d) {

  var _0x7e8f = _0x1a2b["data"][0];

  var _0x9b1a = _0x1a2b["key"];

  if(_0x3c4d > _0x7e8f) {

    for(var _0x5f8d = 0; _0x5f8d < _0x3c4d; _0x5f8d++) {

      console.log(_0x9b1a + _0x5f8d);

    }

  }

  return _0x7e8f;

}

```

This is a nightmare for debugging and analysis. **Readable.ai** is the answer.

-----

## 🔮 The Mission: Translate Chaos into Clarity

Our mission is simple: We use the power of AI (LLMs) to **reverse-engineer** this digital nightmare, restoring logic and human-readable meaning to obfuscated code.

We treat this as a "machine translation" problem:

  * **Source Language:** "Minified" or "obfuscated" code.

  * **Target Language:** The clean, original code.

**Our goal is to turn the cryptic block above into its logical equivalent:**

```javascript

function checkThreshold(config, limit) {

  var firstValue = config["data"][0];

  var prefixKey = config["key"];

  if(limit > firstValue) {

    for(var index = 0; index < limit; index++) {

      console.log(prefixKey + index);

    }

  }

  return firstValue;

}

```

-----

## 🛠️ The Arsenal: Our Methodology

We aren't training a model from scratch. We are fine-tuning a powerful, pre-existing model to specialize in this one task.

  * **Base Model:** `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (A powerful 1.1-billion parameter model).

  * **Training Technique:** **LoRA** (Low-Rank Adaptation). This is a Parameter-Efficient Fine-Tuning (PEFT) technique. It allows us to "teach" the massive base model a new skill by training only a tiny fraction (\< 2%) of its parameters.

  * **Fuel (The Dataset):** This model is trained on a **large-scale, custom-built dataset** featuring tens of thousands of real-world, obfuscated-to-clean code pairs.

-----

## 📈 Status & Roadmap

This project is trained in **multiple stages** due to dataset size and compute limitations:

  * **Stage 1 (In Progress):** Trained on the **first 500,000 samples** of the dataset.

      * **Result:** `adapter_v1` (Represents initial learning on a substantial data portion)

  * **Stage 2 (Upcoming):** Loading `adapter_v1` and continuing training on the **next segment** of the dataset (e.g., samples 500,001 to 1,000,000).

      * **Result:** `adapter_v2`

  * **Stage 3 (Upcoming):** Loading `adapter_v2` and training on the **subsequent segment** until the full dataset is processed.

      * **Result:** `adapter_vX` (The final adapter after processing all data)

-----

## 🚀 How to Use

Your fine-tuned model consists of two parts: the **base model** (`TinyLlama-1.1B-Chat-v1.0`) and the **adapter** (your trained knowledge). To run inference, you must load the base model, then apply your adapter on top of it.

This is the standard 5-step procedure:

### 1\. Install Dependencies

Ensure you have the necessary libraries installed:

```bash

pip install transformers peft accelerate torch

```

### 2\. Load Base Model & Tokenizer

First, load the original `TinyLlama/TinyLlama-1.1B-Chat-v1.0` model from Hugging Face. This is the foundation your adapter will be applied to.

```python

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import PeftModel

BASE_MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

base_model = AutoModelForCausalLM.from_pretrained(

    BASE_MODEL_ID,

    torch_dtype=torch.float16,

    device_map="auto",

)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)

# TinyLlama uses an explicit chat template. We set pad_token for consistency.

tokenizer.pad_token = tokenizer.eos_token 

```

### 3\. Load Your Trained Adapter

Using `PeftModel`, load your trained weights (from the `adapter_model.safetensors` file) and "graft" them onto the base model.

```python

# Provide the path to your trained adapter directory

ADAPTER_PATH = "/path/to/your/trained_adapter/"

model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)

```

### 4\. Merge for Inference Speed

This is a critical optimization step. The `merge_and_unload()` command permanently fuses the adapter's weights into the base model. This creates a single, highly efficient model and **significantly speeds up inference**.

```python

model = model.merge_and_unload()

print("✅ Adapter merged. Model is ready!")

```

### 5\. Run Inference

The `model` is now ready. You must format your request as a **prompt** that matches the structure the model was trained on, then call `model.generate()`.

```python

# 1. Define the obfuscated code

obfuscated_code = 'function _0x5dcf(_0x1a2b, _0x3c4d){var _0x7e8f=_0x1a2b["data"][0];var _0x9b1a=_0x1a2b["key"];if(_0x3c4d>_0x7e8f){for(var _0x5f8d=0;_0x5f8d<_0x3c4d;_0x5f8d++){console.log(_0x9b1a+_0x5f8d);}}return _0x7e8f;}'

# 2. Format the prompt

# **NOTE:** The prompt structure must match the one used during training.

prompt = f"### Input:\n{obfuscated_code}\n\n### Output:\n"

# 3. Tokenize and run generation

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(

    **inputs,

    max_new_tokens=150,

    temperature=0.7,

    do_sample=True

)

# 4. Decode the result

response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("--- Deobfuscation Result ---")

print(response_text)

```

-----

## ⚡ Performance & Resource Requirements

Understanding the resource needs is critical. Here is the breakdown based on our tests.

  * **Training (LoRA):** The fine-tuning process is highly efficient. By using LoRA, `fp16`, and an efficient optimizer, a training session (like Stage 1 or 2) requires **approximately 10-12 GB of VRAM**. This is generally suitable for a high-end Colab GPU (like A100 or V100) or a mid-range local GPU.

  * **Inference (After Merging):** This is the key benefit. After using `merge_and_unload()` (Step 4), the adapter is fused into the base model.

      * The final merged model (Base + Adapter) requires the **exact same VRAM as the original `TinyLlama-1.1B` base model** (approx. 2.4 GB in `fp16`).

      * You pay **no extra VRAM cost** for the adapter's new knowledge.

      * Inference speed is also **identical to the original base model**, as it's a single, unified model.

  * **Portability:** This merged model can be saved and deployed as a standard transformer, without needing the `peft` library for inference.

-----

## ❤️ Support the Project

If you find this project useful, want to support server costs, or just want to buy me a coffee, donations are appreciated.

**BSC (Binance Smart Chain) Address:**

`0x1f7fa6d01f02583b48e0343a9e42cbd408ef3bfb`

-----

## 📄 License

This project is licensed under the MIT License.

-----

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chtphuc/readable.ai

Awesome Lists containing this project

README