Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/noloader/aes-intrinsics

AES encryption function using Intel, ARMv8 and Power8 intrinsics
https://github.com/noloader/aes-intrinsics

aes-intrinsics aes-power8 armv8 c crypto cryptography power8 powerpc x86 x86-64

Last synced: about 16 hours ago
JSON representation

AES encryption function using Intel, ARMv8 and Power8 intrinsics

Awesome Lists containing this project

README

        

# AES-Power8

This is a test implementation of Power 8's in-core crypto using xlC and GCC built-in's.

The test implementation side steps key scheduling by using a pre-expanded "golden" key from FIPS 197, Appendix B. The golden key is the big-endian byte array `2b 7e 15 16 28 ae d2 a6 ab f7 15 88 09 cf 4f 3c`, and it produces the key schedule hard-coded in the program.

The GCC Compile Farm (http://gcc.gnu.org/wiki/CompileFarm) offers two test machines. To test on a Power 8 little-endian machine use GCC112. To test on a big-endian machine use GCC119.

According to data from GCC112, the naive impementation provided by `fips197-p8.c` achieves about 6 cycles-per-byte (cpb). It is mostly dull, but its still better than 20 to 30 cpb for C and C++. Running 4 or 8 blocks in parallel will increase performance to around 1 to 1.5 cpb.

## Compiling

To compile the source file using GCC:

gcc -std=c99 -mcpu=power8 fips197-p8.c -o fips197-p8.exe

To compile the source file using IBM XL C/C++:

xlc -qarch=pwr8 -qaltivec fips197-p8.c -o fips197-p8.exe

## Decryption

The decryption rountines are mostly a copy and paste of the encryption routines using the appropriate inverse function. However, you must build the key table using the algorithm discussed in FIPS 197, Sections 5.3.1 through 5.3.4 (pp. 20-23). You cannot use the "Equivalent Inverse Cipher" from Section 5.3.5 (p.23).

If you use the same key table as built for encryption, then you should index the subkey table in reverse order. That is, start with index `rounds`, then `rounds-1`, ..., then index `1`, and finally index `0`. (Remember, there are `N+1` subkeys for `N` rounds of AES).

## Byte Order

The VSX unit only operates on big-endian data. However, the CPU will load the VSX register in little-endian format on a little-endian machine by default. On little-endian machines each 16-byte buffer must be byte reversed before loading. Conversely, the data needs to be stored in little endian format on little endian machines when moving from a VSX register to memory. You have two options when reversing the data to ensure it is properly loaded into a VSX register or saved from a VSX register. First you can reverse the in-memory byte buffer. Second, you can load the byte buffer and then permute the vector.

A derivative of the test program used the first strategy for the subkey table. The subkey table is converted to big endian once so each subkey does not need a permute after loading. It was an optimization that benefited multiple encryptions under the same key. The test program used the second strategy on user data like input and output buffers.

For general reading on byte ordering, see "Targeting your applications - what little endian and big endian IBM XL C/C++ compiler differences mean to you" (http://www.ibm.com/developerworks/library/l-ibm-xl-c-cpp-compiler/index.html).

## Optimizations

There are at least two optimizations available that your program should take. The first optimization is perform the byte reversal on little-endian machines for the subkey table once after it is built. You will still need to perform the endian conversions on user supplied input and output buffers as the data is streamed into the program.

The second optimization your program should take is to run 4 or 8 blocks of encryption or decryption in parallel. The VSX unit has 32 full size registers, so you should be able to raise the number of simultaneous transformations to 12 if desired.

As an example, instead of a single loop operating on a a single block:

```
VectorType s = VectorLoad(input);
VectorType k = VectorLoadKey(subkeys);

s = VectorXor(s, k);
for (size_t i=1; i