Forums

IAR ARM Cortex-M compiler does not align stack on 8-byte boundary

Started by StateMachineCOM September 18, 2022
ARM ABI says that the stack should be 8-byte aligned, but I see cases where the stack is aligned only to 4-byte boundary.

For example, I have the following simple busy-delay function:

<pre>
void delay(int iter) {
   int volatile counter = 0;
   while (counter < iter) { // delay loop
       ++counter;
   }
}
</pre>

This compiles with IAR EWARM 9.10.2 on ARM Cortex-M to the following disassembly:

<pre>
SUB SP, SP, #0x4
...
ADD SP, SP, #0x4
BX LR
</pre>

The problem is that after SUB SP,SP,4 the stack is misaligned (is aligned only to 4-byte boundary).

Why is this happening? Is this compliant with the ARM ABI? Are there any compiler options to control that?
On 9/18/22 4:26 PM, StateMachineCOM wrote:
> ARM ABI says that the stack should be 8-byte aligned, but I see cases where the stack is aligned only to 4-byte boundary. > > For example, I have the following simple busy-delay function: > > <pre> > void delay(int iter) { > int volatile counter = 0; > while (counter < iter) { // delay loop > ++counter; > } > } > </pre> > > This compiles with IAR EWARM 9.10.2 on ARM Cortex-M to the following disassembly: > > <pre> > SUB SP, SP, #0x4 > ... > ADD SP, SP, #0x4 > BX LR > </pre> > > The problem is that after SUB SP,SP,4 the stack is misaligned (is aligned only to 4-byte boundary). > > Why is this happening? Is this compliant with the ARM ABI? Are there any compiler options to control that?
I think, that as long as the function doesn't call another function it doesn't need to respect that ABI, since it knows it isn't going to do the operations that need the 8-byte alignment. If it isn't *I*nterfacing with anything, the ABI doesn't apply.
Yes, the simple delay() function does not call anything. But still, interrupts can preempt it, which is quite likely because a function like this runs for a long time by design (and consumes a significant percentage of the CPU time).

In fact, I've checked it, and an interrupt preempting delay() must re-align the stack by using the "stack aligner". So the simple (no FPU) Cortex-M exception stack frame of 8 registers (32 bytes) becomes the bigger stack frame of 9 registers (36 bytes). Please note that the Cortex-M CPU deals with it just fine and the program runs. But in the case of RTOS or some other assembly code dealing with interrupts could break the system by making assumptions about the stack alignment. I thought that the compatibility with interrupts is the primary reason why the ARM ABI stipulates 8-byte stack alignment.

Also, I've just checked ARM/KEIL Compiler 6 (based on LLVM), and that compiler generated 8-byte aligned code for delay():

<pre>
SUB SP, SP, #0x8
...
ADD SP, SP, #0x8
BX LR
</pre>

Now, I don't have the time to investigate all compilers and various optimization levels. I thought that standards, like the ARM ABI, are supposed to settle things like that. I'm just a bit perplexed and couldn't find much information about that.
(Please get a real newsreader and a real newsserver, rather than using 
the google groups crapware.  Google groups is fine for searching old 
posts, but makes a mess of posts - it ruins line endings, code 
formatting, attributions, and generally breaks every Usenet posting 
convention it can.  If you /must/ use google groups, please make the 
effort to get attributions right and to quote appropriate parts of the 
earlier posts.  And if you are including code snippets, fix the line 
endings of your post.  news.eternal-september.org is a free newsserver, 
and Thunderbird is one of many free newsreaders.)

On 19/09/2022 02:10, StateMachineCOM wrote:
> Yes, the simple delay() function does not call anything. But still, interrupts can preempt it, which is quite likely because a function like this runs for a long time by design (and consumes a significant percentage of the CPU time). > > In fact, I've checked it, and an interrupt preempting delay() must re-align the stack by using the "stack aligner". So the simple (no FPU) Cortex-M exception stack frame of 8 registers (32 bytes) becomes the bigger stack frame of 9 registers (36 bytes). Please note that the Cortex-M CPU deals with it just fine and the program runs. But in the case of RTOS or some other assembly code dealing with interrupts could break the system by making assumptions about the stack alignment. I thought that the compatibility with interrupts is the primary reason why the ARM ABI stipulates 8-byte stack alignment. >
The hardware has to be able to cope with interrupts occurring while stacks are not 8-byte aligned. It's possible that it is marginally slower or results in a bigger stack frame, but it has to work. The key reason for stack alignment is efficiency. It makes a bigger difference when you have caches and big internal buses, and an even bigger difference when this is combined with multiple cores. It's also possible that some vector and SIMD units require higher alignments. For embedded Cortex-M devices, it would not have made much difference (I believe the old EABI required 4 byte alignment), but requiring 8 byte alignment is a very minor cost that makes future compatibility much simpler. Getting it right early on avoids the kind of dog's dinner you see in the x86 world where the 64-bit Windows stack alignment is too small for the needs of SIMD instructions.
> Also, I've just checked ARM/KEIL Compiler 6 (based on LLVM), and that compiler generated 8-byte aligned code for delay(): > > <pre> > SUB SP, SP, #0x8 > ... > ADD SP, SP, #0x8 > BX LR > </pre> > > Now, I don't have the time to investigate all compilers and various optimization levels. I thought that standards, like the ARM ABI, are supposed to settle things like that. I'm just a bit perplexed and couldn't find much information about that.
A leaf function can be fine with 4 byte stack alignment. A quick test shows gcc aligns on 8 bytes, while clang aligns at 4 bytes for a leaf function. An extremely useful tool for investigating this kind of thing is the online compiler at <https://godbolt.org>. It does not include many commercial compilers (though it has MSVC), but supports C, C++, and lots of languages on a very wide range of compilers and targets. Here you can see your code compiled for gcc and clang Cortex-M4 : <https://godbolt.org/z/cc6bf6oGe>
Hi David,
Thanks for your help.

> Please get a real newsreader and a real newsserver...
I'd like to do this, but I use this newsgroup so infrequently that I don't want to buy and install anything special. Is there some online tool you'd recommend?
> An extremely useful tool for investigating this kind of thing is the online compiler
Yes, thank you. It seems indeed as a useful tool for a quick look at the generated assembly. But regarding the stack alignment requirements, The "ARM Procedure Call Standard for the ARM Architecture" (ARM IHI 0042E) says in Section 5.2.1.1 "Universal stack constraints" that "SP mod 4 = 0, The stack must at all times be aligned at word boundary". Later in the next Section 5.2.1.2 "Stack constraints at a public interface" it strengthens the requirements to: "SP mod 8 = 0. The stack must be double-word aligned". So the question now is: what do they mean by "public interface"?
On 19/09/2022 18:09, StateMachineCOM wrote:
> Hi David, Thanks for your help. > >> Please get a real newsreader and a real newsserver... > > I'd like to do this, but I use this newsgroup so infrequently that I > don't want to buy and install anything special. Is there some online > tool you'd recommend? >
Thunderbird is free - as are any of a dozen different newsreaders, depending on preferences and OS. Many other email programs also support Usenet. There are several free Usenet servers, at least for non-binary groups like those in comp.* news.eternal-september.org is a popular one. Your ISP might also provide the service, as it used to be a standard part of any internet access package. I don't know of any free online interfaces other than google groups, which is barely worth the price (although as always with google, it's good for searching). There are several paid-for services, mostly targeting binary groups (which used to be a popular way to spread pirated software and media, before bittorrent). Technical groups are all text posts, and most have relatively few posts. Even if you start your newsreader once a month, it will take no more than a few seconds to download all posts in comp.arch.embedded to bring it up to date.
>> An extremely useful tool for investigating this kind of thing is >> the online compiler > > Yes, thank you. It seems indeed as a useful tool for a quick look at > the generated assembly. >
I use it all the time, for looking at code on different targets, comparing different options, checking complicated syntax (such as testing C++ features in the latest standards, newer than the compilers I have online), comparing the output of different compilers, sharing code with others via links, checking if the code I write gives exactly the assembly I want, amongst other things.
> But regarding the stack alignment requirements, The "ARM Procedure > Call Standard for the ARM Architecture" (ARM IHI 0042E) says in > Section 5.2.1.1 "Universal stack constraints" that "SP mod 4 = 0, The > stack must at all times be aligned at word boundary". Later in the > next Section 5.2.1.2 "Stack constraints at a public interface" it > strengthens the requirements to: "SP mod 8 = 0. The stack must be > double-word aligned". > > So the question now is: what do they mean by "public interface"?
I guess that means when calling code, or being called from code, that is independently compiled. When it is within the same compiled code, you don't have to follow the standard ABI at all - you (meaning "the compiler") can make your own rules regarding parameter passing, volatile / non-volatile registers, etc.
On 9/19/22 2:16 PM, David Brown wrote:
> On 19/09/2022 18:09, StateMachineCOM wrote: >> But regarding the stack alignment requirements, The "ARM Procedure >> Call Standard for the ARM Architecture" (ARM IHI 0042E) says in >> Section 5.2.1.1 "Universal stack constraints" that "SP mod 4 = 0, The >> stack must at all times be aligned at word boundary". Later in the >> next Section 5.2.1.2 "Stack constraints at a public interface" it >> strengthens the requirements to: "SP mod 8 = 0. The stack must be >> double-word aligned". >> >> So the question now is: what do they mean by "public interface"? > > I guess that means when calling code, or being called from code, that is > independently compiled.&nbsp; When it is within the same compiled code, you > don't have to follow the standard ABI at all - you (meaning "the > compiler") can make your own rules regarding parameter passing, volatile > / non-volatile registers, etc.
Yes, the Standard API defines what functions are allowed to presume when they are called by "unknown" code. That is what is allowed at a "Public API", being public, anyone can call it. Since routines are allowed to assume they are entered with a stack pointer aligned to a multiple of 8, the caller needs to assure that (at least if their entry at a public API also had the stack pointer properly aligned). The purpose of this is that some common instructions require their source/destination to be so aligned, and it is a bit awkward to write a subroutine that might be called with a stack pointer that isn't so aligned to make the pointer so aligned (it typically costs a register to hold the old SP), so the ABI requires the stack to be so aligned. If a piece of code doesn't call any outside routines, then this isn't a problem, so the ABI doesn't restrict the stack pointer at those times. This is important, as it isn't uncommon to want to temporarily push a single word onto the stack for a bit, and it the stack pointer needed to be kept at an alignment of 8, that operation would need to use up extra stack memory.