Hi there – I have a conundrum rattling around my head.
I once read that an MCU is most efficient when its working with its natural word width. On one level this makes sense. If I have an 8-bit MCU, then performing an add on two 8-bit words takes 1 clock, adding two 16-bit words takes 2 clocks, adding two 32-bit words takes 4 clocks, etc.
Also, using a 16-bit or 32-bit value where an 8-bit one would do requires more memory – not much of a problem with one variable – a bigger problem if you have a large array of the littler scamps.
Now consider a 16-bit CPU. Obviously it will be more efficient when adding two 16-bit words together because it can do so in 1 clock cycle. Similarly, a 32-bit CPU can add two 32-bit words in 1 clock cycle. But here's the question, will the 16-bit and 32-bit CPUs be LESS EFFICIENT when working with 8-bit variables, for example, or with those 8-bit values just get promoted to 16-bits or 32-bits at execution time?
Suppose I have a loop that cycles from 0 to 100, on a 8-bit machine it makes sense to say:
for (int8_t i = 0; i < 100; i++)
How about on a 32-bit machine? Would it make more sense to use:
for (int32_t i = 0; i < 100; i++)
Or is it better to use the int8_t just because this better reflects the size of the values we’re working with?
There is more to consider than simply the width of the internal architecture of the CPU. The first thing to look at is the width of the data bus. If this is only 16-bits wide (which is often the case) then a 32-bit calculation will require at least 4 clock cycles to get both operands into the ALU. Also, Princeton or Harvard? And so on.
Another consideration is the compiler, which may not be optimized for a particular type of operation. If you don't mind delving into some inline assembly, there are extremely efficient procedures tailored for many architectures, but they do only a single function - 8x8, 8x16, 16x16, etc. - you have to know the size of the variable and the size of the result. Most compilers will reserve 16 bits for an 8x8 operation that only requires and 8-bit answer. The list goes on. As a quick summary, I would tentatively suggest that the fastest operations are ones that remain inside the width of the data bus as transferring data in and out of memory is still a big time bandit.
Howdy, You've got 2 things clouding a quality answer: chip family (cpu core) and which compiler. In writing assembler, you'll Know what's best or necessary - no compiler conventions or assumptions apply. I think you'll find no consistent pattern other than native size is what it's best at, as you said. I suspect some cores are better at alternate sizes than others.
I'm just learning the STM8 series and while limited in some respects, having 2 16 bit registers with some math operations is way handy. I wrote for NXP 908/9S08 since late '90s and anything over 8 bit wasn't pretty.
Curious on other opinions. G.H. <<<)))
How about using unsigned int and (signed) int, unless you have a specific need to use a particular size? That's precisely their purpose, as I recall.
E.g., in the loop examples, use unsigned int in both cases, and let the (optimizing) compiler figure it out for you.
The problem here is that an int is a 16-bit quantity on an Arduino Uno, but a 32-bit quantity on an Arduino Due -- suppose I wrote the program using int on Due and actually expected to see values outside the range -32,768 to +32,767 -- but then I port the code to an Uno with its 16-bit ints -- things wotld go pear-shaped quickly?
I think that this is what the C99 "fast" integers are for. Use uint_fast8_t when 8-bits of dynamic range are sufficient. Use uint_fast16_t if you need at least 16 bits and uint_fast32_t for 32-bits.
Are they "fast" because the compiler stores them in 32-bit words even though they are 8-bit and 16-bit values?
The answer to your question and more is nicely explained by Nigel Jones in his blog on Embedded Gurus
This is awesome -- thanks so much for sharing it with me -- Max
So what is the range then: 0 to 100 or -32768 to 32767?
In many cases, the size is determined by something outside the processor, so there you should use the appropriate size-specific type.
E.g., if the I/O register map provided by the Uno uses 8-bit registers, but that used by the Duo uses 32-bit registers, then in the former case you must use uint8_t to represent the I/O registers, but in the latter case uint32_t.
By contrast, if the 0 to 100 is the result of a data-driven design where the data that drives the design is stored in an array of structs, and you need to step through the array elements in order to handle (for example) inputs from the user, then I would use unsigned int or (better yet) size_t to represent the array index (and I would use sizeof to determine the number of elements in the array).
The C99 "fast" integer approach is a good suggestion (in fact, all of the other replies I see right now strike me as very helpful), but even with that approach, my question remains: what determines the range, such that you can choose the appropriate (smallest) C99 "fast" integer?
There is nothing to be gained in speed or power to use an int8 on a 16 bit or 32 bit processor. That processing width has already been paid for so you might as well use it. Depending on optimization the compiler may pack multiple unrelated int8s into a single 16 or 32 bit wide memory location to save space, but it would require compiler or processor smarts to increment an int8 that is stored in the middle of a 32 bit register.
The CCS C compiler for PICs is unusual in that it has an int1 that often takes zero RAM space as the compiler will pack several of them into the unused register bits of the particular PIC being used. But it is an old compiler born in the days when PICs might have as little as 16 bytes or RAM, so getting a few free booleans was a significant bonus. On an Arduino 1 boolean takes a byte of RAM and 2 booleans take 2 bytes.
And I assume you know that your little loop
for (int8_t i = 0, i < 100, I ++)
is quite inefficient. If you use instead
for (int8_t i = 100, i < 0, I--)
Then the processor does not have to do a subtraction to test i against 100 for each iteration. The auto-decrement will also set the zero flag for free so no explicit test is needed. You just have to use your I values backwards.
Depends on the chip. First, the bus width is one thing, but the internal architecture is an other. Back in the old days the 8088 was pretty much an 8086 with an 8 bit bus. Thus, as long as you worked with registers only, 16 bits was as fast as 8 bits. But if you had to go out, fetching a 16-bit word cost you an extra clock cycle. Same deal with the Motorola 68008.
Cortex-M from ARM is a 32-bit architecture. Interestingly, 32-bit and 8-bit memory operations cost you the same but 16-bit half-words can cause all sorts of performance issues. Furthermore, the ARM cannot do 16-bit and 8-bit operations on registers, everything is 32-bit internally. Which means that doing operations on 8-bit variables, which happen to be in registers, can invoke performance penalties, as the compiler needs to generate instructions that mask the result to 8-bit.
So you really need to know the chip you are working with to figure out what size for a particular variable under the particular circumstances is "optimal".
Donald Knuth, the author of The Art of Computer Programming, said:
“People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like.
Otherwise the programs they write will be pretty weird.”
Although with the astonishing evolution of the hardware and the computer architectures this statement is somewhat exaggerated, as we see, the underlying hardware is still to be considered.
In general, my thinking is more in the line of @DKWatson. The main time bandit is memory bandwidth, and chip manufacturers are making big efforts to deal with it.
I remember that on the LPC line of NxP ARM-based processors, albeit being a 32 bits architecture, memory read cycles were of 128 bits at a time, in order to reduce the dependence of memory bandwidth in the performance.
And, of course, there are other issues, like @DKWatson said.
There is also a 'money' consideration. If your project 'budget' can deal with a processor fast enough and with sufficient memory for which these issues are irrelevant, use one of Murphy's law: "Don't force, use a bigger hammer".
But, if you are constrained by a tight budget, perhaps it's better to stick with Knuth's statement.
P.S.: Isn't there a typo in your loop statements?
OMG There were THREE typos in each one (there aren't any now) -- for other readers, the way it was when I just went to look was:
for (int8_t i = 0, i < 100, I ++)
-- I'd used ',' separators instead of ';'
-- The editor had changed 'i' to 'I' at the end
-- There was a space before the '++'
Of course it should have been as follows:
for (int8_t i = 0; i < 100; i++)
I'm not sure if the space before the '++' is a problem (I'll have to check that).
I can't believe no one else spotted this (or maybe they were just too kind to say anything LOL)
... Or maybe it's not the point of the question and it's not worthwhile to lose time with it. :-)
I think in general you should stick to signed or unsigned int, and let the compiler figure it out. If you get to the end of the project and need a few more bytes of RAM or more CPU cycles, then try to optimize it.
There's another Knuth quote that "premature optimization is the root of all evil".
I agree with what someone said below, that is, if the size matters, i.e. for an I/O port or register in a chip, then explicitly define it. Otherwise let the compiler do with it as it will.
Hi Jmford94 -- I think you offer sage advice -- I'm so glad I posted this question because the various responses have helped shape the way I go forward.