Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/noloader/aes-intrinsics
AES encryption function using Intel, ARMv8 and Power8 intrinsics
https://github.com/noloader/aes-intrinsics
aes-intrinsics aes-power8 armv8 c crypto cryptography power8 powerpc x86 x86-64
Last synced: about 1 month ago
JSON representation
AES encryption function using Intel, ARMv8 and Power8 intrinsics
- Host: GitHub
- URL: https://github.com/noloader/aes-intrinsics
- Owner: noloader
- Created: 2017-09-16T09:24:16.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-04-05T13:38:49.000Z (9 months ago)
- Last Synced: 2024-04-16T00:40:15.186Z (8 months ago)
- Topics: aes-intrinsics, aes-power8, armv8, c, crypto, cryptography, power8, powerpc, x86, x86-64
- Language: C
- Size: 24.4 KB
- Stars: 43
- Watchers: 5
- Forks: 10
- Open Issues: 3
-
Metadata Files:
- Readme: README-p8.md
Awesome Lists containing this project
README
# AES-Power8
This is a test implementation of Power 8's in-core crypto using xlC and GCC built-in's.
The test implementation side steps key scheduling by using a pre-expanded "golden" key from FIPS 197, Appendix B. The golden key is the big-endian byte array `2b 7e 15 16 28 ae d2 a6 ab f7 15 88 09 cf 4f 3c`, and it produces the key schedule hard-coded in the program.
The GCC Compile Farm (http://gcc.gnu.org/wiki/CompileFarm) offers two test machines. To test on a Power 8 little-endian machine use GCC112. To test on a big-endian machine use GCC119.
According to data from GCC112, the naive impementation provided by `fips197-p8.c` achieves about 6 cycles-per-byte (cpb). It is mostly dull, but its still better than 20 to 30 cpb for C and C++. Running 4 or 8 blocks in parallel will increase performance to around 1 to 1.5 cpb.
## Compiling
To compile the source file using GCC:
gcc -std=c99 -mcpu=power8 fips197-p8.c -o fips197-p8.exe
To compile the source file using IBM XL C/C++:
xlc -qarch=pwr8 -qaltivec fips197-p8.c -o fips197-p8.exe
## Decryption
The decryption rountines are mostly a copy and paste of the encryption routines using the appropriate inverse function. However, you must build the key table using the algorithm discussed in FIPS 197, Sections 5.3.1 through 5.3.4 (pp. 20-23). You cannot use the "Equivalent Inverse Cipher" from Section 5.3.5 (p.23).
If you use the same key table as built for encryption, then you should index the subkey table in reverse order. That is, start with index `rounds`, then `rounds-1`, ..., then index `1`, and finally index `0`. (Remember, there are `N+1` subkeys for `N` rounds of AES).
## Byte Order
The VSX unit only operates on big-endian data. However, the CPU will load the VSX register in little-endian format on a little-endian machine by default. On little-endian machines each 16-byte buffer must be byte reversed before loading. Conversely, the data needs to be stored in little endian format on little endian machines when moving from a VSX register to memory. You have two options when reversing the data to ensure it is properly loaded into a VSX register or saved from a VSX register. First you can reverse the in-memory byte buffer. Second, you can load the byte buffer and then permute the vector.
A derivative of the test program used the first strategy for the subkey table. The subkey table is converted to big endian once so each subkey does not need a permute after loading. It was an optimization that benefited multiple encryptions under the same key. The test program used the second strategy on user data like input and output buffers.
For general reading on byte ordering, see "Targeting your applications - what little endian and big endian IBM XL C/C++ compiler differences mean to you" (http://www.ibm.com/developerworks/library/l-ibm-xl-c-cpp-compiler/index.html).
## Optimizations
There are at least two optimizations available that your program should take. The first optimization is perform the byte reversal on little-endian machines for the subkey table once after it is built. You will still need to perform the endian conversions on user supplied input and output buffers as the data is streamed into the program.
The second optimization your program should take is to run 4 or 8 blocks of encryption or decryption in parallel. The VSX unit has 32 full size registers, so you should be able to raise the number of simultaneous transformations to 12 if desired.
As an example, instead of a single loop operating on a a single block:
```
VectorType s = VectorLoad(input);
VectorType k = VectorLoadKey(subkeys);s = VectorXor(s, k);
for (size_t i=1; i