Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dispatchcode/x64-instruction-decoder

An x86/x64 instruction disassembler written in C
https://github.com/dispatchcode/x64-instruction-decoder

architectures assembly c disassembler disassembler-library instruction-decoding instruction-set low-level machine-code reverse-engineering x64 x86

Last synced: 4 days ago
JSON representation

An x86/x64 instruction disassembler written in C

Awesome Lists containing this project

README

        

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

# x64ID ~ x64 Instruction Decoder

A x86/x64 machine code decoder. It is useful to get instructions' length and identify each of its fields.

Here some scenarios where x64ID can be used:

- write your disassembler from scratch
- as a base for a VM protection [[1]](#user-content-res1).
- reverse engineering scenarios
- swapping instructions with others (eg. substitution like `MOV EAX, 0` with `XOR EAX, EAX`)
- get mnemonic rapresentation (*currently not implemented*)
- others (as ideas will come to mind...)

___

- [x64ID ~ x64 Instruction Decoder](#x64id--machine-code-analyzer)
* [Supported architectures and features](#supported-architectures-and-features)
+ [Features on development](#features-on-development)
* [API](#api)
+ [Instruction struct](#instruction-struct)
- [`REX` union](#rex-union)
- [`ModRm` union](#modrm-union)
- [`SIB` union](#sib-union)
- [`vex_info` struct](#vex_info-struct)
* [Examples](#examples)
- [A practical example: sum of two vectors using SIMD instruction](#a-practical-example-sum-of-two-vectors-using-simd-instruction)
- [Another example: architecture x64, VEX prefix with YMM register](#another-example-architecture-x64-vex-prefix-with-ymm-register)
* [Enabling / Disabling features](#enabling--disabling-features)
* [Function Length detection](#find-function-length) 🌟*New❗*🌟
* [Tests](#tests)
* [Useful resources](#useful-resources)
* [Notes](#notes)

## Supported architectures and features

**Architectures**:

βœ… x86

βœ… x64

**Opcodes**:

βœ… 1-byte OPs

βœ… 2-byte OPs

βœ… 3-byte OPs, 0x38 and 0x3A

**Fields**:

βœ… prefixes

βœ… VEX prefix (0xC4, 0xC5)

βœ… ModRm

βœ… REX prefix

βœ… SIB

βœ… Imm

βœ… Disp

❌ XOP prefix

**Instruction Set**:

βœ… x86 & x64

βœ… SIMD extension

βœ… AVX extension

❌ AVX-512 (EVEX prefix)

❌ 3DNow!

### Features on development

🎯 XOP support

🎯 AVX-512 (EVEX prefix)

🎯 Machine code to assembly mnemonics

🎯 Others (as ideas will come to mind...)

## API

x64ID exposes only one function and some structs to complete its goal:

```C
int x64id_decode(struct instruction *instr, enum supported_architecture arch, char *data_src, int offset);
```

| Parameter | Type | Explanation | Required |
|----------------|:--------:|-------------|:--------:|
| `instr` | `struct instruction` | A reference to the struct that will contain the analysis result. See [below](#instruction-struct) for more informations. | YES |
| `arch` | `enum supported_architecture` | The achitecture type. Use `1` for `x86` and `2` for `x64` | YES |
| `data_source` | `char*` | A data buffer with the data to be analyzed | YES |
| `offset` | `int` | An offset to be added to the starting address of `data_buffer` | NO |

**Return**:

`x64id_decode` returns the length of the decoded instruction. Its value can also be accessed from `instr.length`.

> :information_source: **Notes**
> Internally, x64ID does not use dynamic allocation to avoid overhead.

___

### Instruction struct

Here below how you can use the struct. More infos and the other structs can be found in the [header file](https://github.com/DispatchCode/Machine-Code-Analyzer/blob/master/src/x64id.h#L355).

| Field Name | Type | Description |
|-------------------|:-----------------:|-------------|
| `prefixes` | `uint8_t[4]` | Store the prefixes of the instrucion, like Segment Override, Address Size and 2 and 3-byte escape opcodes (0x0f, 0x38, 0x3A) |
| `rex` | `union` | Union of five fields: `value`, `rex_b`, `rex_r`, `rex_x`, `rex_w` (see [below](#rex-union)). |
| `op` | `uint8_t` | The opcode of the instruction |
| `modrm` | `union` | Union of four fields: `value`, `rm`, `reg` and `mod` (see [below](#modrm-union)). |
| `disp` | `uint64_t` | Displacement field |
| `imm` | `uint64_t` | Immediate value (a number) |
| `label` | `uint32_t` | Address of Jcc/JMP, if present |
| `_vex` | `struct vex_info` | Available only if [`_ENABLED_VEX_INFO`](#enabling--disabling-features) is defined. Described [below](#vex_info-struct). |
| `instr` | `uint8_t[15]` | Available only if [`_ENABLE_RAW_BYTES`](#enabling--disabling-features) is defined. |
| `sib` | `union` | Union of four fields: `value`, `base`, `index` and `scaled` (see [below](#sib-union)). |
| `vex` | `uint8_t[3]` | 0xC4 or 0xC5 followed by 1 or 2 bytes |
| `length` | `int` | The instruction length (in bytes) |
| `disp_len` | `int` | The displacement size (in bytes) |
| `imm_len` | `int` | The imm size |
| `vex_cnt` | `int8_t` | Count how many VEX prefixes are available |
| `prefix_cnt` | `int8_t` | Count how many prefixes are available |
| `set_prefix` | `uint16_t` | A field against which is possible to check if a determined prefix (belonging to `prefixes` enum) is present. |
| `set_field` | `uint16_t` | A field against which is possible to check if a determined feature (belonging to `instruction_feature` enum) is available (e.g. FPU, SIB, DISP,...) |
| `jcc_type` | `uint8_t` | The type of jump: Jcc or JMP with 1 or 2-bytes (refer to jmp_type enum)

___

#### `REX` union

| Field Name | Type | Description |
|-------------------|:----------:|-------------|
| `rex.value` | `uint8_t` | The `rex` prefix if present (x64 only) |
| `rex.bits.rex_b` | `uint8_t` | `rex_b` field |
| `rex.bits.rex_x` | `uint8_t` | `rex_x` field |
| `rex.bits.rex_r` | `uint8_t` | `rex_r` field |
| `rex.bits.rex_w` | `uint8_t` | `rex_w` field |

For more information on REX prefix, refer to section *2.2.1 REX Prefixes* of the Intel Developer Manual Vol.2 [[2]](#user-content-res2).

___

#### `ModRm` union

| Field Name | Type | Description |
|-------------------|:----------:|-------------|
| `modrm.value` | `uint8_t` | The ModRm value |
| `modrm.bits.rm` | `uint8_t` | The `rm` part of ModRm |
| `modrm.bits.reg` | `uint8_t` | The `reg` part of ModRm |
| `modrm.bits.mod` | `uint8_t` | The `mod` part of ModRm. When mod=11b source and destination are registers, otherwise one of the operands involves memory access (displacement field) |

More information on ModRm field can be found at the section *2.1.3 ModR/M and SIB Bytes* of the Intel Developer Manual Vol.2 [[2]](#user-content-res2).

___

#### `SIB` union

| Field Name | Type | Description |
|-------------------|:----------:|-------------|
| `sib.value` | `uint8_t` | If present, is the Scaled Index Base |
| `sib.bits.base` | `uint8_t` | `base` field |
| `sib.bits.index` | `uint8_t` | `index` field |
| `sib.bits.scaled` | `uint8_t` | `scaled` field |

For more information refer to section *2.1.5 Addressing-Mode Encoding of ModR/M and SIB Bytes* of the Intel Developer Manual Vol.2 [[2]](#user-content-res2).

___

#### `vex_info` struct

| Field Name | Type | Description |
|----------------------|:-----------------:|-------------|
| `type` | `uint8_t` | `0xC4` used when 3-byte prefix is present or `0xC5` used when 2-byte prefix is present |
| `vexc5b` | `struct` | |
| `_vex.val5` | `uint8_t` | The byte after `0xC5` with its filds described below |
| `_vex.vexc5b.vex_pp` | `uint8_t` | Equivalent to a SIMD prefix: `00`: none, `01`: 0x66, `02`: 0xF3, `03`: 0xF2 |
| `_vex.vexc5b.vex_l` | `uint8_t` | 0 for 128-bit vector or 1 for 256-bit vector |
| `_vex.vexc5b.vex_v` | `uint8_t` | An additional operand for the instruction |
| `_vex.vexc5b.vex_r` | `uint8_t` | |
| `_vex.val4` | `uint16_t` | |
| `_vex.vexc4b.vex_pp` | `uint8_t` | |
| `_vex.vexc4b.vex_l` | `uint8_t` | |
| `_vex.vexc4b.vex_v` | `uint8_t` | |
| `_vex.vexc4b.vex_r` | `uint8_t` | |
| `_vex.vexc4b.vex_m` | `uint8_t` | Values: 00001: implied 0F leading opcode byte, 00010: implied 0F 38 leading opcode bytes, 00011: implied 0F 3A leading opcode bytes. Other values will #UD. |
| `_vex.vexc4b.vex_b` | `uint8_t` | |
| `_vex.vexc4b.vex_x` | `uint8_t` | |
| `_vex.vexc4b.vex_r` | `uint8_t` | |

For all the details about VEX prefix look at section **2.3.5 The VEX Prefix** of the Intel Developer Manual Vol.2 [[2]](#user-content-res2).
___

## Examples

#### A practical example: sum of two vectors using SIMD instruction

Lets have a pratical example, the sum of two vectors (using inline assembly):

```C
// ... omitted code ...
int vect1[LEN] = {1,2,3,4,5,6,7,8,9,10,11,12};
int vect2[LEN] = {1,2,3,4,5,6,7,8,9,10,11,12};
int res_vect1[LEN];

__asm
{
lea eax, vect1
lea ebx, vect2
xor ecx, ecx

_while:
cmp ecx, LEN * 4
jge _end

movups xmm0, [eax + ecx]
movups xmm1, [ebx + ecx]
addps xmm0, xmm1
movups [res_vect1 + ecx], xmm0
add ecx, 4
jmp _while

_end:
}
// ... omitted code ...
```

Compiling through MS Compiler (with `/Ot` flag), the result will be what follows:

```Assembly
CPU Disasm
Address Hex dump Command Comments
008910BC |. C785 68FFFFFF 00000000 MOV DWORD PTR SS:[LOCAL.38],0
008910C6 |. 8D45 CC LEA EAX,[LOCAL.13]
008910C9 |. 8D5D 9C LEA EBX,[LOCAL.25]
008910CC |. 33C9 XOR ECX,ECX
008910CE |> 83F9 30 /CMP ECX,30
008910D1 |. 7D 18 |JGE SHORT 008910EB
008910D3 |. 0F100408 |MOVUPS XMM0,DQWORD PTR DS:[ECX+EAX]
008910D7 |. 0F100C0B |MOVUPS XMM1,DQWORD PTR DS:[ECX+EBX]
008910DB |. 0F58C1 |ADDPS XMM0,XMM1
008910DE |. 0F11840D 6CFFFFFF |MOVUPS DQWORD PTR SS:[ECX+EBP-94],XMM0
008910E6 |. 83C1 04 |ADD ECX,4
008910E9 |.^ EB E3 \JMP SHORT 008910CE

```

We can write a sample code that uses x64ID to read and print the instructions.

```C
int offset = 0x4bc;
int parse_bytes = 0x2a;
int byte_reads = 0;

while(byte_reads <= parse_bytes) {
struct instruction instr;
x64id_decode(&instr, arch, (char*)data_buffer, offset);

for(int i=0; i :information_source: **Notes**
> Jump Table are not handled; be careful when you use switch case and compiling with MSVC (GCC/MinGw seems use other techniques).
> Handling jmp table require heuristics (eg. as IDA do and other tools) and more info on the target.
## Tests

After googling for a better solution, I came back with one of the first things I was thinking: assembly.

Tests have been written using NASM and must be compiled using the "bin" flag:

```sh
nasm -f bin
```

The tested instructions are the following:

* x86: `1-byte OP`, `2-byte OP`, `3-byte OP` and `2-byte OP with VEX prefix`
* x64: `1-byte OP`, `2-byte OP`, `3-byte OP`; some of which have `VEX prefix`

Tests have been written by hand using the Intel Developer Manual book [[2]](#user-content-res2).
I can't guarantee a 100% coverage, however all the opcodes have been tested.

## Useful resources

- [X86-64 Instruction Encoding](https://wiki.osdev.org/X86-64_Instruction_Encoding)
- [x86_64 Instruction Table](https://c9x.me/x86/)
- [Compiler Explorer](https://godbolt.org/)

## Notes

[1] By VM protection is meant a code obscator that converts x86/x64 machine code into "virtual opcodes" that are understandable by a VM. Two commercial examples can be [VMProtect](https://vmpsoft.com/) and [CodeVirtualizer](https://www.oreans.com/codevirtualizer.php)

[2] [Intel Developer Manual (2nd book)](https://software.intel.com/content/dam/develop/public/us/en/documents/334569-sdm-vol-2d.pdf)

___

_Crafted with ❀ by DispatchCode. Documentation created along with [Alexander Cerutti](https://github.com/alexandercerutti)_