Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/foo123/code-optimization-methods

A summary of code optimization methods
https://github.com/foo123/code-optimization-methods

algorithms complexity efficiency optimization-methods principles tips-and-tricks

Last synced: about 1 month ago
JSON representation

A summary of code optimization methods

Awesome Lists containing this project

README

        

## Code Optimization Methods

*A summary of various code optimization methods*

### Translations

* [Turkish](/tr.md) (by [@umutphp](https://github.com/umutphp))

### Contents

* [General Principles](#general-principles)
* [Low-level](#low-level)
* [Language-dependent optimization](#language-dependent-optimization)
* [Language-independent optimization](#language-independent-optimization)
* [Databases](#databases)
* [Web](#web)
* [References](#references)

__A note on the relation between Computational Complexity of Algorithms and Code Optimization Techniques__

Both computational complexity theory [2](#r2 "Computational complexity theory, wikipedia") and code optimization techniques [1](#r1 "Code optimization, wikipedia") , have the common goal of efficient problem solution. While they are related to each other and share some concepts, the difference lies in what is emphasized at each time.

Computational complexity theory, studies the performance with respect to input data size. Trying to design algorithmic solutions that have the least / fastest dependence on data size, regardless of underlying architecture. Code optimization techniques, on the other hand, focus on the architecture and the specific constants which enter those computational complexity estimations.

Systems that operate in Real-Time are examples where both factors can be critical (eg. [Real-time Image Processing in JavaScript](https://github.com/foo123/FILTER.js) , yeah i know i am the author :) ).

### General Principles

* __Keep it `DRY` and Cache__ : The general concept of caching involves avoiding re-computation/re-loading of a result if not necessary. This can be seen as a variation of Dont Repeat Yourself principle [3](#r3 "DRY principle, wikipedia") . Even dynamic programming can be seen as a variation of caching, in the sense that it stores intermediate results saving re-computation time and resources.

* __`KISS` it, simpler can be faster__ : Keep it simple [4](#r4 "KISS principle, wikipedia") , makes various other techniques easier to apply and modify. ( A plethora of Software Engineering methods can help in this [34](#r34 "Software development philosophies, wikipedia"), [35](#r35 "97 Things every programmer should know"), [55](#r55 "Stream processing, wikipedia"), [56](#r56 "Dataflow programming, wikipedia") )

* __Sosi ample free orginizd, So simple if re-organized__ : Dont hesitate to re-organize if needed. Many times sth can be re-organized, re-structured in a much simpler / faster way while retaining its original functionality (concept of isomorphism [22](#r22 "Isomorphism, wikipedia") , change of representation [23](#r23 "Representation, wikipedia"), [24](#r24 "Symmetry, wikipedia"), [36](#r36 "Data structure, wikipedia") ), yet providing other advantages. For example, the expression `(10+5*2)^2` is the simple constant `400`, another example is the transformation from `infix` expression notation to `prefix` (Polish) notation which can be parsed faster in one pass.

* __Divide into subproblems and Conquer the solution__ : Subproblems (or smaller, trivial, special cases) can be easier/faster to solve and combine for the global solution. Sorting algorithms are great examples of that [5](#r5 "Divide and conquer algorithm, wikipedia"), [sorting algorithms](https://github.com/foo123/SortingAlgorithms)

* __More helping hands are always welcome__ : Spread the work load, subdivide, share, parallelize if possible [6](#r6 "Parallel computation, wikipedia"), [54](#r54 "Heterogeneous computing, wikipedia"), [68](#r68 "A Practical Wait-Free Simulation for Lock-Free Data Structures"), [69](#r69 "A Highly-Efficient Wait-Free Universal Construction"), [70](#r70 "A Methodology for Creating Fast Wait-Free Data Structures") .

* __United we Stand and Deliver__ : Having data together in a contiguous chunk, instead of scattered around here and there, makes it faster to load and process as a single block, instead of (fetching and) accessing many smaller chunks (eg. cache memory, vector/pipeline machines, database queries) [51](#r51 "Locality of reference, wikipedia"), [52](#r52 "Memory access pattern, wikipedia"), [53](#r53 "Memory hierarchy, wikipedia") .

* __A little Laziness never hurt anyone__ : So true, each time a program is executed, only some of its data and functionality are used. Delaying to load and initialize (being lazy) all the data and functionality untill needed, can go a long way [7](#r7 "Lazy load, wikipedia") .

__Further Notes__

Before trying to optimize, one has to measure and identify what needs to be optimized, if any. Blind "optimization" can be as good as no optimization at all, if not worse.

That being said, one should always try to optimize and produce efficient solutions. A non-efficient "solution" can be as good as no solution at all, if not worse.

**Pre-optimisation is perfectly valid given pre-knowledge**. For example that instantiating a whole `class` or `array` is slower than just returning an `integer` with the appropriate information.

Some of the optimization techniques can be automated (eg in compilers), while others are better handled manually.

Some times there is a trade-off between space/time resources. Increasing speed might result in increasing space/memory requirements (__caching__ is a classic example of that).

The `90-10` (or `80-20` or other variations) rule of thumb, states that __`90` percent of the time__ is spent on __`10` percent of the code__ (eg a loop). Optimizing this part of the code can result in great benefits. (see for example Knuth [8](#r8 "An empirical study of Fortran programs") )

One optimization technique (eg simplification) can lead to the application of another optimization technique (eg constant substitution) and this in turn can lead back to the further application of the first optimization technique (or others). Doors can open.

__References:__ [9](#r9 "Compiler optimizations, wikipedia"), [11](#r11 "Compiler Design Theory"), [12](#r12 "The art of compiler design - Theory and Practice"), [46](#r46 "Optimisation techniques"), [47](#r47 "Notes on C Optimisation"), [48](#r48 "Optimising C++"), [49](#r49 "Programming Optimization"), [50](#r50 "CODE OPTIMIZATION - USER TECHNIQUES")

### Low-level

#### Generalities [44](#r44 "What Every Programmer Should Know About Floating-Point Arithmetic"), [45](#r45 "What Every Computer Scientist Should Know About Floating-Point Arithmetic")

__Data Allocation__

* Disk access is slow (Network access is even slower)
* Main Memory access is faster than disk
* CPU Cache Memory (if exists) is faster than main memory
* CPU Registers are fastest

__Binary Formats__

* Double precision arithmetic is slow
* Floating point arithmetic is faster than double precision
* Long integer arithmetic is faster than floating-point
* Short Integer, fixed-point arithmetic is faster than long arithmetic
* Bitwise arithmetic is fastest

__Arithmetic Operations__

* Exponentiation is slow
* Division is faster than exponentiation
* Multiplication is faster than division
* Addition/Subtraction is faster than multiplication
* Bitwise operations are fastest

#### Methods

* __Register allocation__ : Since register memory is fastest way to access heavily used data, it is desirable (eg compilers, real-time systems) to allocate some data in an optimum sense in the cpu registers during a heavy-load operation. There are various algorithms (based on the graph coloring problem) which provide an automated way for this kind of optimization. Other times a programmer can explicitly declare a variable that is allocated in the cpu registers during some part of an operation [10](#r10 "Register allocation, wikipedia")

* __Single Atom Optimizations__ : This involves various operations which optimize one cpu instruction (atom) at a time. For example some operands in an instruction, can be constants, so their values can be replaced instead of the variables. Another example is replacing exponentiation with a power of `2` with a multiplication, etc..

* __Optimizations over a group of Atoms__ : Similar to previous, this kind of optimization, involves examining the control flow over a group of cpu instructions and re-arranging so that the functionality is retained, while using simpler/fewer instructions. For example a complex `IF THEN` logic, depending on parameters, can be simplified to a single `Jump` statement, and so on.

### Language-dependent optimization

* Check carefuly the **documentation and manual** for the underlying mechanisms the language is using to implement specific features and operations and use them to estimate the cost of a certain code and the alternatives provided.

### Language-independent optimization

* __Re-arranging Expressions__ : More efficient code for the evaluation of an expression (or the computation of a process) can often be produced if the operations occuring in the expression are evaluated in a different order. This works because by re-arranging expression/operations, what gets added or multiplied to what, gets changed, including the relative number of additions and multiplications, and thus the (overall) relative (computational) costs of each operation. In fact, this is not restricted to arithmetic operations, but any operations whatsoever using symmetries (eg commutative laws, associative laws and distributive laws, when they indeed hold, are actualy examples of arithmetic operator symmetries) of the process/operators and re-arrange to produce same result while having other advantages. That is it, so simple. Classic examples are Horner's Rule [13](#r13 "Horner rule, wikipedia") , Karatsuba Multiplication [14](#r14 "Karatsuba algorithm, wikipedia") , fast complex multiplication [15](#r15 "Fast multiplication of complex numbers") , fast matrix multiplication [18](#r18 "Strassen algorithm, wikipedia"), [19](#r19 "Coppersmith-Winograd algorithm, wikipedia") , fast exponentiation [16](#r16 "Exponentiation by squaring, wikipedia"), [17](#r17 "Fast Exponentiation") , fast gcd computation [78](#r78 "A Binary Recursive Gcd Algorithm") , fast factorials/binomials [20](#r20 "Comments on Factorial Programs"), [21](#r21 "Fast Factorial Functions") , fast fourier transform [57](#r57 "Fast Fourier transform, wikipedia") , fast fibonacci numbers [76](#r76 "Fast Fibonacci numbers") , sorting by merging [25](#r25 "Merge sort, wikipedia") , sorting by powers [26](#r26 "Radix sort, wikipedia") .

* __Constant Substitution/Propagation__ : Many times an expression is under all cases evaluated to a single constant, the constant value can be replaced instead of the more complex and slower expression (sometimes compilers do that).

* __Inline Function/Routine Calls__ : Calling a function or routine, involves many operations from the part of the cpu, it has to push onto the stack the current program state and branch to another location, and then do the reverse procedure. This can be slow when used inside heavy-load operations, inlining the function body can be much faster without all this overhead. Sometimes compilers do that, other times a programmer can declare or annotate a function as `inline` explicitly. [27](#r27 "Function inlining, wikipedia")

* __Combining Flow Transfers__ : `IF/THEN` instructions and logic are, in essence, cpu `branch` instructions. Branch instructions involve changing the program `pointer` and going to a new location. This can be slower if many `jump` instructions are used. However re-arranging the `IF/THEN` statements (factorizing common code, using De Morgan's rules for logic simplification etc..) can result in *isomorphic* functionality with fewer and more efficient logic and as a result fewer and more efficient `branch` instructions

* __Dead Code Elimination__ : Most times compilers can identify code that is never accessed and remove it from the compiled program. However not all cases can be identified. Using previous simplification schemes, the programmer can more easily identify "dead code" (never accessed) and remove it. An alternative approach to "dead-code elimination" is "live-code inclusion" or "tree-shaking" techniques.

* __Common Subexpressions__ : This optimization involves identifying subexpressions which are common in various parts of the code and evaluating them only once and use the value in all subsequent places (sometimes compilers do that).

* __Common Code Factorisation__ : Many times the same block of code is present in different branches, for example the program has to do some common functionality and then something else depending on some parameter. This common code can be factored out of the branches and thus eliminate unneeded redundancy , latency and size.

* __Strength Reduction__ : This involves transforming an operation (eg an expression) into an equivalent one which is faster. Common cases involve replacing `exponentiation` with `multiplication` and `multiplication` with `addition` (eg inside a loop). This technique can result in great efficiency stemming from the fact that simpler but equivalent operations are several cpu cycles faster (usually implemented in hardware) than their more complex equivalents (usually implemented in software) [28](#r28 "Strength reduction, wikipedia")

* __Handling Trivial/Special Cases__ : Sometimes a complex computation has some trivial or special cases which can be handled much more efficiently by a reduced/simplified version of the computation (eg computing `a^b`, can handle the special cases for `a,b=0,1,2` by a simpler method). Trivial cases occur with some frequency in applications, so simplified special case code can be quite useful. [42](#r42 "Three optimization tips for C"), [43](#r43 "Three optimization tips for C, slides") . Similar to this, is the handling of common/frequent computations (depending on application) with fine-tuned or faster code or even hardcoding results directly.

* __Exploiting Mathematical Theorems/Relations__ : Some times a computation can be performed in an equivalent but more efficient way by using some mathematical theorem, transformation, symmetry [24](#r24 "Symmetry, wikipedia") or knowledge (eg. Gauss method of solving Systems of Linear equations [58](#r58 "Gaussian elimination, wikipedia") , Euclidean Algorithm [71](#r71 "Euclidean Algorithm") , or both [72](#r72 "Gröbner basis") , Fast Fourier Transforms [57](#r57 "Fast Fourier transform, wikipedia") , Fermat's Little Theorem [59](#r59 "Fermat's little theorem, wikipedia") , Taylor-Mclaurin Series Expasions, Trigonometric Identities [60](#r60 "Trigonometric identities, wikipedia") , Newton's Method [73](#r73 "Newton's Method"),[74](#r74 "Fast Inverse Square Root") , etc.. [75](#r75 "Methods of computing square roots") ). This can go a long way. It is good to refresh your mathematical knowledge every now and then.

* __Using Efficient Data Structures__ : Data structures are the counterpart of algorithms (in the space domain), each efficient algorithm needs an associated efficient data structure for the specific task. In many cases using an appropriate data structure (representation) can make all the difference (eg. database designers and search engine developers know this very well) [36](#r36 "Data structure, wikipedia"), [37](#r37 "List of data structures, wikipedia"), [23](#r23 "Representation, wikipedia"), [62](#r62 "Dancing links algorithm"), [63](#r63 "Data Interface + Algorithms = Efficient Programs"), [64](#r64 "Systems Should Automatically Specialize Code and Data"), [65](#r65 "New Paradigms in Data Structure Design: Word-Level Parallelism and Self-Adjustment"), [68](#r68 "A Practical Wait-Free Simulation for Lock-Free Data Structures"), [69](#r69 "A Highly-Efficient Wait-Free Universal Construction"), [70](#r70 "A Methodology for Creating Fast Wait-Free Data Structures"), [77](#r77 "Fast k-Nearest Neighbors (k-NN) algorithm")

__Loop Optimizations__

Perhaps the most important code optimization techniques are the ones involving loops.

* __Code Motion / Loop Invariants__ : Sometimes code inside a loop is independent of the loop index, can be moved out of the loop and computed only once (it is a loop invariant). This results in the loop doing fewer operations (sometimes compilers do that) [29](#r29 "Loop invariant, wikipedia"), [30](#r30 "Loop-invariant code motion, wikipedia")

__example:__

```javascript

// this can be transformed
for (i=0; i<1000; i++)
{
invariant = 100*b[0]+15; // this is a loop invariant, not depending on the loop index etc..
a[i] = invariant+10*i;
}

// into this
invariant = 100*b[0]+15; // now this is out of the loop
for (i=0; i<1000; i++)
{
a[i] = invariant+10*i; // loop executes fewer operations now
}

```

* __Loop Fusion__ : Sometimes two or more loops can be combined into a single loop, thus reducing the number of test and increment instructions executed.

__example:__

```javascript

// 2 loops here
for (i=0; i<1000; i++)
{
a[i] = i;
}
for (i=0; i<1000; i++)
{
b[i] = i+5;
}

// one fused loop here
for (i=0; i<1000; i++)
{
a[i] = i;
b[i] = i+5;
}

```

* __Unswitching__ : Some times a loop can be split into two or more loops, of which only one needs be executed at any time.

__example:__

```javascript

// either one of the cases will be executing in each time
for (i=0; i<1000; i++)
{
if (X>Y) // this is executed every time inside the loop
a[i] = i;
else
b[i] = i+10;
}

// loop split in two here
if (X>Y) // now executed only once
{
for (i=0; i<1000; i++)
{
a[i] = i;
}
}
else
{
for (i=0; i<1000; i++)
{
b[i] = i+10;
}
}

```

* __Array Linearization__ : This involves handling a multi-dimensional array in a loop, as if it was a (simpler) one-dimensional array. Most times multi-dimensional arrays (eg `2D` arrays `NxM`) use a linearization scheme, when stored in memory. Same scheme can be used to access the array data as if it is one big `1`-dimensional array. This results in using a single loop instead of multiple nested loops [31](#r31 "Array linearisation, wikipedia"), [32](#r32 "Vectorization, wikipedia"), [61](#r61 "The NumPy array: a structure for efficient numerical computation")

__example:__

```javascript

// nested loop
// N = M = 20
// total size = NxM = 400
for (i=0; i<20; i+=1)
{
for (j=0; j<20; j+=1)
{
// usually a[i, j] means a[i + j*N] or some other equivalent indexing scheme,
// in most cases linearization is straight-forward
a[i, j] = 0;
}
}

// array linearized single loop
for (i=0; i<400; i++)
a[i] = 0; // equivalent to previous with just a single loop


```

* __Loop Unrolling__ : Loop unrolling involves reducing the number of executions of a loop by performing the computations corresponding to two (or more) loop iterations in a single loop iteration. This is partial loop unrolling, full loop unrolling involves eliminating the loop completely and doing all the iterations explicitly in the code (for example for small loops where the number of iterations is fixed). Loop unrolling results in the loop (and as a consequence all the overhead associated with each loop iteration) executing fewer times. In processors which allow pipelining or parallel computations, loop unroling can have an additional benefit, the next unrolled iteration can start while the previous unrolled iteration is being computed or loaded without waiting to finish. Thus loop speed can increase even more [33](#r33 "Loop unrolling, wikipedia")

__example:__

```javascript

// "rolled" usual loop
for (i=0; i<1000; i++)
{
a[i] = b[i]*c[i];
}

// partially unrolled loop (half iterations)
for (i=0; i<1000; i+=2)
{
a[i] = b[i]*c[i];
// unroled the next iteration into the current one and increased the loop iteration step to 2
a[i+1] = b[i+1]*c[i+1];
}

// sometimes special care is needed to handle cases
// where the number of iterations is NOT an exact multiple of the number of unrolled steps
// this can be solved by adding the remaining iterations explicitly in the code, after or before the main unrolled loop

```

### Databases

#### Generalities

Database Access can be expensive, this means it is usually better to fetch the needed data using as few DB connections and calls as possible

#### Methods

* __Lazy Load__ : Avoiding the DB access unless necessary can be efficient, provided that during the application life-cycle there is a frequency of cases where the extra data are not needed or requested

* __Caching__ : Re-using previous fetched data-results for same query, if not critical and if a slight-delayed update is tolerable

* __Using Efficient Queries__ : For Relational DBs, the most efficient query is by using an index (or a set of indexes) by which data are uniquely indexed in the DB [66](#r66 "10 tips for optimising Mysql queries"), [67](#r67 "Mysql Optimisation") .

* __Exploiting Redundancy__ : Adding more helping hands(DBs) to handle the load instead of just one. In effect this means copying (creating redundancy) of data in multiple places, which can subdivide the total load and handle it independantly

### Web

* __Minimal Transactions__ : Data over the internet (and generally data over a network), take some time to be transmitted. More so if the data are large, therefore it is best to transmit only the necessary data, and even these in a compact form. That is one reason why `JSON` replaced the verbose `XML` for encoding of arbitrary data on the web.

* __Minimum Number of Requests__ : This can be seen as a variaton of the previous principle. It means that not only each request should transmit only necessary data in a compact form, but also that the number of requests should be minimized. This can include, minifying `.css` files into one `.css` file (even embedding images if needed), minifying `.js` files into one `.js` file, etc.. This can sometimes generate large data (files), however coupled with the next tip, can result in better performance.

* __Cache, cache and then cache some more__ : This can include everything, from whole pages to `.css` files, `.js` files, images etc.. Cache in the server, cache in the client, cache in-between, cache everywhere..

* __Exploiting Redundancy__ : For web applications, this is usually implemented by exploiting some [cloud architecture](http://en.wikipedia.org/wiki/Cloud_computing) in order to store (static) files, which can be loaded (through the cloud) from more than one location. Other approaches include, [Load balancing](http://en.wikipedia.org/wiki/Load_balancing_%28computing%29) ( having redundancy not only for static files, but also for servers ).

* __Make application code faster/lighter__ : This draws from the previous principles about code optimization in general. Efficient application code can save both server and user resources. There is a reason why Facebook created `HipHop VM` ..

* __Minimalism is an art form__ : Having web pages and applications with tons of html, images, (not to mention a ton of advertisements) etc, etc.. is not necessarily better design, and of course makes page load time slower. Therefore having minimal pages and doing updates in small chunks using `AJAX` and `JSON` (that is what `web 2.0` was all about), instead of reloading a whole page each time, can go a long way. This is one reason why [Template Engines](http://en.wikipedia.org/wiki/Template_engine) and [MVC Frameworks](http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller) were created. Minimalism does not need to sacrifice the artistic dimension, [Minimalism](http://en.wikipedia.org/wiki/Minimalism) __IS__ an art form.

[το λακωνίζειν εστί φιλοσοφείν (i.e to be laconic in speech and deeds, simple and to the point, is the art of philosophy)](https://en.wikipedia.org/wiki/Laconic_phrase)

[![Zen Circle](/zen-circle.jpg)](http://en.wikipedia.org/wiki/Ens%C5%8D)

### References

1. Code optimization, wikipedia
2. Computational complexity theory, wikipedia
3. DRY principle, wikipedia
4. KISS principle, wikipedia
5. Divide and conquer algorithm, wikipedia
6. Parallel computation, wikipedia
7. Lazy load, wikipedia
8. An empirical study of Fortran programs
9. Compiler optimizations, wikipedia
10. Register allocation, wikipedia
11. Compiler Design Theory
12. The art of compiler design - Theory and Practice
13. Horner rule, wikipedia
14. Karatsuba algorithm, wikipedia
15. Fast multiplication of complex numbers
16. Exponentiation by squaring, wikipedia
17. Fast Exponentiation
18. Strassen algorithm, wikipedia
19. Coppersmith-Winograd algorithm, wikipedia
20. Comments on Factorial Programs
21. Fast Factorial Functions
22. Isomorphism, wikipedia
23. Representation, wikipedia
24. Symmetry, wikipedia
25. Merge sort, wikipedia
26. Radix sort, wikipedia
27. Function inlining, wikipedia
28. Strength reduction, wikipedia
29. Loop invariant, wikipedia
30. Loop-invariant code motion, wikipedia
31. Array linearisation, wikipedia
32. Vectorization, wikipedia
33. Loop unrolling, wikipedia
34. Software development philosophies, wikipedia
35. 97 Things every programmer should know
36. Data structure, wikipedia
37. List of data structures, wikipedia
38. Cloud computing, wikipedia
39. Load balancing, wikipedia
40. Model-view-controller, wikipedia
41. Template engine, wikipedia
42. Three optimization tips for C
43. Three optimization tips for C, slides
44. What Every Programmer Should Know About Floating-Point Arithmetic
45. What Every Computer Scientist Should Know About Floating-Point Arithmetic
46. Optimisation techniques
47. Notes on C Optimisation
48. Optimising C++
49. Programming Optimization
50. CODE OPTIMIZATION - USER TECHNIQUES
51. Locality of reference, wikipedia
52. Memory access pattern, wikipedia
53. Memory hierarchy, wikipedia
54. Heterogeneous computing, wikipedia
55. Stream processing, wikipedia
56. Dataflow programming, wikipedia
57. Fast Fourier transform, wikipedia
58. Gaussian elimination, wikipedia
59. Fermat's little theorem, wikipedia
60. Trigonometric identities, wikipedia
61. The NumPy array: a structure for efficient numerical computation
62. Dancing links algorithm
63. Data Interface + Algorithms = Efficient Programs
64. Systems Should Automatically Specialize Code and Data
65. New Paradigms in Data Structure Design: Word-Level Parallelism and Self-Adjustment
66. 10 tips for optimising Mysql queries
67. Mysql Optimisation
68. A Practical Wait-Free Simulation for Lock-Free Data Structures
69. A Highly-Efficient Wait-Free Universal Construction
70. A Methodology for Creating Fast Wait-Free Data Structures
71. Euclidean Algorithm
72. Gröbner basis
73. Newton's Method
74. Fast Inverse Square Root
75. Methods of computing square roots
76. Fast Fibonacci numbers
77. Fast k-Nearest Neighbors (k-NN) algorithm
78. A Binary Recursive Gcd Algorithm