Preliminaries

"God made the integers, all the rest is the work of man."
Leopold Kronecker

Our computers are really good at doing math. When we write programs, we can just throw some equations at the computer and it will dutifully calculate them very fast. But is it fast enough? How can we make it faster? It is true that our computers are really good at math. And they are really fast. But they are better and faster at some things than others. What I have to say here applies mostly to the smaller processors we tend to use on our robots and other embedded devices. It also applies to modern PC type computers, but not as much. AVR(Arduino) processors, PICs, and ARM chips are typical. I will be using some terms that you may not be very familiar with, so let's explain those first. First is the "clock cycle," or just "cycle" or "clock." I will use this a lot. Every (well almost every, but that's another story) processor has a clock, or oscillator, that times all it's actions. The faster the clock on any given processor, the faster it can get work done. On older processors it normally took from 2 to 16 clock cycles for the processor to perform a single assembly language operation. Most modern processors like the AVR and ARM can perform many instructions in a single clock cycle. A 20 MHZ AVR chip has 20 million clock cycles in one second. A 50 MHZ LPC1114 has 50 million clock cycles in one second. A 3 GHz PC has 3 billion clock cycles in one second. Performance on different processors varies widely, so using clock cycles is usually the best way to describe how long something will take. I will use it frequently. Even on most modern processors that can execute one instruction per clock normally, there are almost always some that take more than one clock cycle.

The next term I will use is "math unit" or "integer math unit" or "floating point math unit" abbreviated "fpu." I may also use the term "ALU" for "arithmetic logic unit." All the processors we use have one or more parts inside that actually do the math. Some are very simple and can't do much more than a simple addition. Some are very complex and can do very complex math, like calculating trig functions (sine, cosine, tangent) without help. The processors we will be discussing most tend to fall toward the lower end. What types of arithmetic the processor can do on its own without help will have a big part in how we write our code to make it efficient. Now that we have the preliminaries out of the way, let's get to it.

Integers vs Floating Point

This first point is the single most important one to remember. Integer math is almost always faster than floating point, even on processors that have floating point math units. A whole lot faster normally. But most of the processors we use don't even have floating point math units. An ATMEGA chip has an integer math unit that can add, subract, and multiply. It can add and subtract 16 bits at a time, and can multiply two 8 bit numbers at a time, usually in two clock cycles. For a 16 bit multiply that is needed with an int variable, the normal, it takes 4 multiply instructions and two or more additions. with some other overhead instructions. A total of about 20 to 30 clock cycles typically. A 32 bit multiply takes 16 8 bit multiplies and a similar amount of overhead for a total of about 100 clock cycles. An ATTiny, or any other processor that does not have a multiply insruction, must use a program to perform an 8 bit multiply that takes about 20 clock cycles alone. For floating point it becomes quite a bit more complex. First, what is a floating point number? Well, there is a standard, IEEE754, that is almost universally used now. The two most common formats are 32 bit single precision and 64 bit double precision. Floating point numbers work a lot like scientific notation. But instead of using powers of ten like scientific notation, they use powers of two for convenience of the computer. Some of the bits in each type (24 in 32 bit and 53 in 64 bit) are used to represent the number itself. One is used to represent the sign of the entire number. The rest are used as an exponent(power) of two that the base number should be rasied to (which can be either positive or negative.) Interestingly, floating point addition and subtraction often take longer than multiplication and division. The base number, known as the "mantissa," is always a number less than 1. Since it is base 2 (binary) it is shifted so that it is a number with the most significant bit set, so it is always between 1/2 and 1. Because of that, it is always known that the most significant bit is a 1 and there is no need to store it. That sort of gives one "extra" bit, but makes processing more complex. The exponent bits are a signed number that represents the power of two used to raise the mantissa to. Before doing almost any arithmetic on floating point numbers, the two parts must be separated and the extra 1 bit put back into the mantissa. The arithmetic is done, then the parts must be put back together and "normalized" which means to adjust it so the most significant bit is again a 1, then that bit is again removed. So, basically, we have about 4 times as many operations required to do floating point calculations as the same size integer calculations, plus more overhead. It typically takes about 4 or more times as long to do floating point math on a processor that does not have a powerful fpu. Even on a powerful 32 bit microcontroller like the NXP LPC1114 that can do a 32 bit mulitply in a single clock cycle, it will take much longer to do a floating point 32 bit mulitply.

Do you remember learning about scientific notation in school? You probably found it a bit more difficult and confusing than working with "normal" numbers. Same for the computer. One thing that should be noted: the code to perform floating point on our simple processors is rather large, often several kilobytes. If you don't perform ANY floating point math, they won't be included and the space will be available for other uses.

Avoid Division Whenever Possible

How about division? Do you remember learning that in school? Addition, subtraction, and multiplication were fairly simple. They have simple rules that you follow to get the answer. But division took some guess work, and chances are you probably still pull out a calculator whenever you need to divide large numbers. Again, the same is true with the computer. The process the computer uses to divide is much like the way we do it and even involves some guess work. As a result, division is more complex and normally takes longer, even on processors that have hardware divide instructions. On those processors it is not unusual for a divide to take two to 32 times longer than multiply, even for integers. And many processors that do have mulitply instructions don't even have a divide instruction, which makes it take much longer: the division has to be done as a separate program piece. Floating point division is even worse and typically might take twice as long as multiply on a machine with extra hardware, and longer on machines without it. A floating point division is a rather complex program all by itself. On top of that, it is full of chances to introduce inaccurate results.

It is often rather easy and almost always much faster to avoid division when your program is running by dividing ahead of time. Especially with floating point numbers. How do we do that? One thing I see a lot is code to convert the number of microseconds an ultrasonic module takes to receive an echo into a distance in millimeters. The code will usually look something like this:
distance_mm = microseconds / 171.0; That code causes the program to (obviously) divide the number of microseconds by 171.0 ( the .0 on the end makes sure the compiler does a floating point division.) But what if we put this line up at the start of our program so it only runs once:
float distance_divisor = 1/171.0; and use this line instead:
distance_mm = microseconds * distance_divisor; We calculate the division only once and replace the division that gets done many times with a multiply! That will speed up our code. The result will be the same (or at least close enough) so we don't lose anything except wasted time. As a bonus, when the compiler sees that 1/171.0 it knows all the values. With integers it would actually do the division BEFORE your program ever runs, at the time it is compiling. It doesn't do that with floating point for reasons I won't go into here. This is called "constant folding" in case you care.

With integers it is a bit harder to convert division into multiplication. But there are some nifty tricks we can use. First, why is it harder? Well, as we saw above, dividing by a number n is the same as multiplying by 1/n; For any integer other than 0, that will yield an answer less than 1, which in integers can only be 0. But how about we use a form of number scaling to accomplish the same thing? Let me explain. What happens when you take a decimal number, say 110, and drop off the rightmost digit, moving (or shifting) the other digits one place to the right? You end up with 11, which happens to be 110 divided by the base of our number system, ten. So an easy way to divide by ten is to "shift" the number right one place. You can divide by any power of ten: 1 (power of 0), 10 (power of 1), 100 (2), 1000 (3), etc. What happens if you are using base 2 (binary), the number system used in computers? Well, let's have a look. Since computers deal with fixed numbers of bits, I will write the binary numbers that way. Take the number 27 in binary:
27 = 00011011b
shift all the bits one to the right and fill the empty space on the left with a 0:
00001101b = 13
We lose the one that drops off the right and end up with 27/2 dropping the remainder, or 13. We can divide by any power of 2 that way: 2, 4, 8, 16, 32, 64, etc. Well, that's handy if we need to divide by a power of 2. But what about other numbers. Sometimes, a power of 2 is "close enough" and we can just use the closest one. But what if it isn't?

For instance, let's look at that ultrasonic calculation from earlier: distance_millimeters = microseconds / 171; The two closest powers of 2 are 128 and 256. Neither is very close, and probably not "close enough." What to do? Let's go back to how we did it with floating point. Start by dividing 1/171. My calculator gives me 5.8479532 * 10^ -3 power, or 0.0058479532. Not much good for integer arithmetic.  But let's multiply this by a convenient power of 2, say 1024 (2^10). I get 5.988 and some more digits that won't matter. Round that off to the nearest integer of 6, which is pretty close. Now, since we have mulitiplied by 1024 we can use this calculation:
distance_millimeters = microseconds * 6 >> 10;
We multiply by 6. The ">>" operator in many languages including C, C++, C#, and java, means "shift right" by the number of bits that follows, in this case 10. We saw above that is division by 2^10, or 1024. We ended up with a multiply by 1024 ahead of time to get the 6, then a divide by 1024 in the code. The two 1024s cancel in the end, we have some nice integer arithmetic that is probably close enough, and things end up well. But what did we gain? We traded a division for a shift. Did that help. Yes. It turns out that pretty much any processor can perform a shift in one clock cycle per shifted bit. In this case, ten clock cycles. Sometimes, on say an 8 bit processor doing shifts on values larger than 8 bits, it may take 2 or 4 clocks per shifted bit. But that is the slowest it gets. A ten bit shift in a sixteen bit int on an Arduino will take less than 20 clocks. Much better than the several dozen for a divide. On other processors, like most ARM processors, the processor has special hardware called a "barrel shifter" that can shift any number of bits up to 32 in a single clock! That is a huge improvement! So much so, that the barrel shifter is a very common addition to processors, and most compilers will try to convert any division by a power of 2 to a shift.

This morning I was writing some code to convert millimeters to inches. The "correct" way is to divide by 25.4. But since 1/25.4 is .03937, I mulitipled that number by 1024 to get 40.31488 which I rounded to 40. Then, I did the calculation with this:
inches = mm * 40 >> 10; // multiply by 40 and divide by 1024
which is equivalent to multiplying by .0390625, or dividing by 25.6. Close enough. That was what prompted me to write this tip.

Never Put off Until Runtime What You Can Do at Compile Timer (or Earlier)

When we write a program in C, C++, Arduino's bastardized mix of the two, java, or many other languages, it must be "compiled" before we can run it. Modern compilers are great and will do a lot to make our code more efficient. They can't read our minds, though, and sometimes we need to give them a little help as we've seen above. I mentioned earlier that if you have an equation with all integer constants the compiler will often perform that calculation at "compile time," or when it is compiling your code into machine language. That way, it doesn't have to perform it every time your program runs, over and over. It normally won't do that for floating point numbers because that might lead to rounding and calculation errors. But often, we can be reasonably sure those errors won't be a problem, so we can do some of those calculations ourselves ahead of time. Or, we can use integers where appropriate and simplify our equations so that the constants are together and available for the compiler to calculate ahead of time. Sometimes I see people write stuff like this:
distance = microseconds / 86 /2.0;
If that were an integer calculation the compiler would probably do the 86/2 ahead of time and only have to do one division at runtime, or when the program is actually running. With floating point, it won't. It will do BOTH divisions. We can help out a lot by simple doing this:
distance = microseconds/172.0;
Then the compiler only has to do one division. As we saw above, there are better ways for that particular calculation, but similar opportunities pop up a lot. If you absolutely must use the more complex calculations, try to make them as simple for the computer as you can.

The End

I have been at this game for a very long time. I learned to program on 2 MHz Z-80s and 1 Mhz 6502s that took several clocks to perform the simplest instructions. They could only add or subtract usually 8 bit numbers. Anything more complex took loads of code to accomplish. Often the programs were written in assembly language, and the few compilers available were very primitive and didn't do much to make your code better. Learning to program efficiently was simply a necessity. Now our development tools are much better, but we can still help the compiler out and make the programs run better or accomplish more in any given amount of time.

I hope what I've written here is helpful. There are lots of little tricks one can do to make programs more efficient. I have only scratched the surface and addressed things I often see in code on the web. Thanks for reading and good luck.