Niklas Holsti wrote:
> David Brown wrote:
>> Niklas Holsti wrote:
>>> David Brown wrote:
>>>> Niklas Holsti wrote:
>>>>> David Brown wrote:
>>>>>> Niklas Holsti wrote:
>>>>>>> [ Quotations edited severely but hopefully without misattribution.]
>>>>>>>
>>>>>>>>>>> brOS wrote:
>>>>>>>>>>>>> On Mon, 23 Nov 2009 14:19:14 -0600
>>>>>>>>>>>>> "brOS" <bogdanrosandic@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does anybody knows how to force compiler to use call
>>>>>>>>>>>>>> instruction
>>>>>>>>>>>>>> instead of br(branch)for disassembling function call?
>>>>>>>>>>>>>> It is extremely important for me to specific function is
>>>>>>>>>>>>>> disassembled
>>>>>>>>>>>>>> using call instead of brunch, as compiler always does.
>>>>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>> This is why i need it....
>>>>>>>>>>>> Function I'm calling have looks something like this:
>>>>>>>>>>>> void Spin(void){
>>>>>>>>>>>> for(;;){}
>>>>>>>>>>>> }
>>>>>>>>>>>> So if it is disassembled with call before entering in pc
>>>>>>>>>>>> will be saved on
>>>>>>>>>>>> stack and it will point to instruction after function
>>>>>>>>>>>> spin....So I want to
>>>>>>>>>>>> use that pc and to save context so when my scheduler
>>>>>>>>>>>> schedule that task
>>>>>>>>>>>> again it will not continue spinning in that forever loop but
>>>>>>>>>>>> it will jump
>>>>>>>>>>>> to next instruction after Spin function.....
>>>>>>>>>>>> branch doesn t push pc to stack so taht s my problem;)
>
> [ snip ]
>
> Niklas Holsti wrote:
>>>>> Agreed. For most embedded compilers, though, anything in a separate
>>>>> compilation is considered "foreign". But as you say, there is no
>>>>> guarantee in general.
>
> David Brown wrote:
>>>> These days, full program optimisation is not uncommon. Even gcc
>>>> (despite its critics' opinions) can do reasonable full program
>>>> optimisation by compiling all the C modules in one shot.
>
> Niklas Holsti wrote:
>>> Sure, but in that case it would not be "separate compilation".
>
> And David Brown wrote:
>> Fair enough.
>>
>> What about gcc 4.5 with -flto ? Then you can compile C modules
>> separately into object files, but the object files hold a copy of the
>> internal trees as well as generated object code. When you link these
>> object files, the trees are used for link-time optimisation, including
>> inlining across modules.
>
> That is interesting info on gcc, and new to me (I don't follow gcc
> development that closely). Thanks, David, this is definitely something
> one needs to know about (future) gcc.
>
See <http://gcc.gnu.org/gcc-4.5/changes.html> and
<http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-flto-796>
(and also <http://gcc.gnu.org/> more generally).
It was news to me too that the LTO (link time optimisation) branch of
gcc had been merged with the main development line. I've known of its
existence for a long time, but for many years it has been (or appeared
to be) a bit of a blue-sky project with a lot of ideas and limited
working code.
It will still take a while before LTO rolls down to gcc compilers
popular in c.a.e. gcc 4.5 is in stage 3 (no new features, bug fix and
testing) - it will be a early next year before we can expect a first
release. It will take a while from then for Code Sourcery to qualify
and verify it thoroughly on their targets, and the 32-bit embedded gcc
suppliers will pick it up from there. Smaller ports, such as avr-gcc,
will take longer - they have fewer developers and resources. For
out-of-tree ports such as the msp430, it depends entirely on what the
developers want to prioritise.
At the moment (gcc 4.3), using -combine and -fwhole-program can get you
quite a lot of these effects for C programming. Basically, it treats
all the files in the program as a single big C file with everything
declared "static".
LTO will give several advantages on top of that. Files can be
individually compiled - useful for large projects, when files are in
different directories, or when you want different compiler options.
Libraries can also have LTO information. You can use languages other
than C (for example, C++), and mix them together. And gcc 4.5 has a
number of new optimisations that are only relevant for whole-program
compilation.
>> You lose all clarity in the definitions of "compile", "link", and
>> "separate compilation".
>
> Yes indeed, at least for the purpose of ensuring a standard calling
> sequence is used for a given function.
>
>> But that is a digression, especially since the msp430 gcc port is not
>> (yet) updated to gcc 4.5, which is itself not yet released.
>
> And the OP specifically asked about the IAR MSP430 compiler, anyway.
>
>>> Interesting question, though: Is there a standard way in a C
>>> environment to ensure that the standard calling sequence is used for
>>> an extern function, with no C-calling-C optimizations?
>>>
>>
>> I think the only way is by being sure that the compiler can't access
>> the code for a function declared as "extern". It should not be hard
>> to do, but you may have to do it explicitly. For example, if you use
>> a compiler's IDE and project manager, you might have to go out of your
>> way to force true separate compilation.
>
> I was afraid that would be the answer, as far as it goes. Yuck.
>
It is like a lot of things in C development - finding a general,
portable, standards-compliant solution is hard, even though making it
work in a real life project is typically very easy. The trick is to
find a balance for a solution that is general enough without being
overly complicated.
>>>>> I'm not so ready to call this "bad design" without knowing more
>>>>> about the OP's requirements and design. The code generated for Spin
>>>>> is exactly the kind of tight eternal loop that you often find in a
>>>>> kernel where the kernel has no ready tasks and waits for an
>>>>> interrupt. I haven't tried it, but it seems to me that writing this
>>>>> loop as a conditional, flag-checking one could increase (by a
>>>>> little) the latency for resuming the right task when an interrupt
>>>>> happens, compared to resuming the task directly from the interrupt
>>>>> handler and simply abandoning the tight loop.
>>>>>
>>>>
>>>> Nah, the loop overhead to continually read a flag would be a few
>>>> cycles at most. The interrupt function overhead to figure out
>>>> return addresses from the stack will be much, much worse.
>
> There are several things under discussion here:
>
> - Whether it makes sense to use a routine Spin, containing a loop
> (whether conditional or unconditional) as the last thing a thread should
> call in its time-slice, such that threads are always suspended and
> resumed only at this point, that is, within the call of Spin.
>
> David, I think you have more or less agreed that this is a workable
> design for a non-preemptive (in my definition) time-sliced system that
> does not schedule other threads to use the slack left over in one
> thread's time-slice. I won't say more to defend it at this point.
>
OK.
> - How difficult or time-consuming is it for an interrupt handler that
> interrupts the loop in Spin to find the return address of that call of
> Spin?
>
> Assuming that
>
> o the thread calls Spin using the normal calling sequence, in
> which the return address is left on top of the stack,
>
You can't assume that (otherwise the OP would never have asked the
question in the first place...), although we can probably assume this
can be forced in some way.
> o the code in Spin does not push more data on the stack, and
>
It may make sense for the end-of-time-slice function to do more than
just spin. Then it may have to make a separate call to Spin.
> o the handler is written in MSP430 assembly language,
>
That's a big requirement, and totally unnecessary except as a way to
implement this bad idea.
> then this is just one POP.W instruction, executed after the interrupt
> handler has popped the saved status register and saved interrupt-point
> PC from the stack. (I'm not very familiar with the MSP430 instruction
> set and its interrupt handling, so this may be a bit optimistic. But the
> MSP430 instruction set is claimed to be strong on stack accesses, so it
> should not be much harder.)
>
That is mostly true (you can't just pop the top of the stack, because
the interrupt function must first preserve a register or two - but as
you say the msp430 has good stack access instructions), given your
assumptions. But as I noted above, the assumptions are not reasonable,
IMHO.
> Thus, getting the return address of Spin (under the above assumptions)
> is quick and well-defined.
>
> In fact, the tick interrupt handler could do it smartly as follows:
>
> 1) Pop the saved status register.
> 2) Pop the saved interrupt-point PC, check that it points to the loop in
> Spin, and then discard it.
> 3) Push back the saved status register.
>
Once you have taken into account saving a working register or two (easy
enough), then that's a fairly elegant implementation of a very ugly hack.
> This makes the two top words on the stack be the resumption PC (the
> return address for the Spin call) and the saved status register, exactly
> the state needed for a future RETI to resume this thread. It is not even
> necessary to get and manipulate the return address for the Spin call.
> (I'm assuming that each thread has its own stack area.)
>
> - Whether the Spin routine can or should be written in C.
>
> If the C compiler generates code for Spin and for the calls to Spin that
> satisfies the above assumptions, it can be written in C. But David is
> right to say that it is hard to be sure that the assumptions do hold,
> and will continue to hold, if Spin is written in C. So let's assume that
> we write Spin in assembly language, which lets us be sure that the
> assumptions hold.
>
> - Whether the thread-resumption latency can be shorter if the Spin loop
> is unconditional, and the return address of Spin is saved and used as
> the resumption point (case A), compared to the latency when the Spin
> loop polls a flag, the address in the interrupted loop is saved and used
> as the resumption point, and the interrupt handler sets the flag to make
> the loop terminate (case B).
>
> I comment on that below.
>
>>> Let's consider what the kernel has to do, in my guess of the OP's
>>> design, considering the two cases of (A) an unconditional "eternal"
>>> loop and (B) a flag-checking loop.
>>>
>>> The kernel knows which thread is running.
>>>
>>> When the thread finishes its job in this time-slice, it calls Spin,
>>> expecting to be resumed at the next instruction after the Spin call,
>>> say instruction R.
>>>
>>> The Spin function loops, eating up the rest of the time-slice.
>>>
>>> The tick interrupt comes in.
>>>
>>> The tick interrupt handler saves the context of the interrupted
>>> thread. By comparing its PC to the address of the Spin loop, it can
>>> check that the thread has not overrun its time-slice. At this point:
>>>
>>
>> Note that a sensible Spin function would tell the kernel that it is
>> finished and entering the spin loop,
>
> Why would the kernel need that information, if it is not going to
> schedule another thread for the rest of this time-slice?
>
If this system is for scheduling important or time-critical tasks, and
there is no prioritised pre-emption, it is very likely that you need to
track when the task's work is done for testing and verification, or for
tracking errors.
But you are correct that the kernel might not /need/ that information.
>> rather than leaving the interrupt handler to figure it out in this
>> fragile way.
>
> I don't see much fragility in it. It is beautifully simple: if the
> thread finished what it had to do, it is in Spin; otherwise not. (I hope
> I am lauding the OP here, not my own guess about the design.)
>
Working around the compiler in this way /is/ fragile. It is a hack, and
it is dependent on details of the compiler, the processor, the stack
structure, it requires assembly for what should be simple C code, and it
hinders the compiler's optimiser.
However, I have to agree that you have come up with a simple
implementation of this design (and I am lauding /your/ implementation
here, not the OP's bad design, or our guesses about it).
> And perhaps the OP's kernel actually has an "I_am_done" kernel call,
> which just ends up in Spin.
>
>>> - For (A) the handler gets the return address of Spin
>>> by a POP and stores this return address as the resumption
>>> point for the thread to be suspended.
>>>
>>
>> Assuming, of course, that you've figured out a way to do that safely
>> and reliably....
>
> See above: a POP instruction. It is safe and reliable, if Spin is
> written in assembly language.
>
>>> In case (B), the thread is resumed in the middle of the flag-checking
>>> loop. It still has to read the flag, branch out of the loop, and
>>> execute a return instruction (effectively a POP from the stack),
>>> before instruction R is reached.
>>>
>>
>> Yes, you can expect it to take about 3 or 4 instructions before
>> getting to R. That would still be a lot less time than you spend
>> messing around getting the address of R in case (A), so case (B) wins
>> here in time.
>
> Getting the address of R takes one POP in case (A). Probably faster than
> these 3-4 instructions, at least not much slower.
>
Don't forget the comparisons to check that you are in the spin loop.
>> Remember that all ideas about how case (A) could feasibly be
>> implemented are based on hobbling the compiler.
>
> No, it is based only on putting the right code in Spin, and ensuring
> that Spin is called with the standard calling sequence that leaves a
> return address on top of stack. This is readily and normally done by
> writing Spin in assembly language. The compiler is not hobbled in the C
> code parts.
>
A function as simple as Spin case B would often be inlined into the
calling function (either explicitly, or via whole-program optimisation).
Code like that is smaller as well as faster when inlined.
>> Write the code correctly (case B), and you can let the optimiser do
>> its job - that will overwhelm any conceivable time advantage case A
>> might have had.
>
> It can hardly overwhelm it if Spin contains just the loop -- not much to
> optimise there.
The code calling Spin can be better optimised if Spin is a proper C
function, and the complier knows its definition.
> Or do you mean to write the tick interrupt handler in C?
> I know that some C compilers claim that you can use them to write
> interrupt handlers, but to me this seems more fragile than writing Spin
> in C. Especially for an interrupt handler that is meant to switch
> threads, not just manage som peripheral device and return to the
> interrupted thread.
>
The majority of interrupt handlers in embedded systems are written in C
these days. For general interrupts, if you are not happy to trust your
compiler to generate good and safe interrupt code, get a better
compiler! But thread switching code will almost certainly need some
inline assembly at least.
> However, the OP said that Spin in the OP's kernel contains some other
> things, too, so it's hard to say what the optimiser could do.
>
>> Among other things, Spin() could be inlined in its calling function
>> and remove most of the overhead.
>
> What "overhead" are you talking about, David?
I am referring to the check of the exit flag that you think makes my
Spin function too slow compared to your version.
> If Spin's profile is as
> simple as the OP showed (void Spin (void)), inlining would directly save
> only one call or branch instruction per time-slice. Perhaps the
> optimiser could let the thread keep more local data in registers over
> the (in-lined) Spin call, avoiding some store/load instructions. The IAR
> MSP430 compiler defines R12-R15 as scratch registers (caller-save) and
> R4-R11 as preserved registers (callee-save), so an inlined function
> could increase the available registers from 8 to 12; hard to say if that
> would be significant.
>
It is, as you say, hard to be sure - especially if Spin contains other code.
> If Spin is inlined we lose the ability to check for time-slice overruns
> by checking that the interrupted thread is in the unique and only Spin,
Yes, but you don't /need/ to check call or return addresses if you write
proper C code...
> so I think that Spin should not be inlined. Perhaps you consider this to
> be one of the "fragile" aspects of this overrun-checking method, but it
> is not difficult to make sure that Spin is not inlined.
>
>> Even making the great leap of faith that there is a reliable way to
>> get the desired return address,
>
> Wow, this makes me feel like a preacher. But "raise your eyes to the
> text above", and believe! :-)
>
I think we'll just have to agree to disagree on this one.
> There can hardly be a more reliable aspect of a standard calling
> convention than the presence and location of the return address.
>
> > and then making a second leap of faith that
>> case A is faster, the concept is /still/ wrong.
>
> That sounds rather dogmatic. If it works and is reliable, why is it
> "wrong"? Too heretical?
>
Yes, it's against my religion :-) You don't write hacked code based on
lying to the compiler, assembly code, and stack manipulation tricks when
there are perfectly safe, efficient and reliable ways to do the same job
with C. It's about writing legible, maintainable, portable code that is
clear in its purpose and easy to verify. Even if this is nothing more
than a simple test program, you should maintain a certain level of
development quality.
I am not saying that /all/ such hacks are a bad thing - just that you
have to have very good reason for using them. Shaving off a few
processor cycles (if you are correct and you /do/ save time) is very
seldom a good enough reason.
Just because code is accepted by the compiler, and works in practice,
does not mean it cannot be *wrong*. And yes, I know I am pontificating
- doesn't that beat preaching?
>> There is no way that shaving a few cycles off the latency
>> could justify using this horrible hack.
>
> Although I still think that case A is a bit faster (and feel I have
> given good reasons above and in my preceding posting) it isn't the main
> point in favour of case A, the unconditional loop. Given this design of
> a Spin function in which the threads are suspended and resumed, a
> flag-polling loop is logically unnecessary: after the tick interrupt,
> the thread that gets to poll the flag is *the* scheduled thread, so
> polling the flag is superfluous.
>
> I think the design is a neat solution to specific, limited requirements.
> It is a bit tricky, but interrupt-handling and thread-switching are
> often tricky. I mean "tricky" in the sense of "a trick", not in the
> sense of "difficult".
>