EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

Low Dhrystone score on Tricore TC1796

Started by Unknown July 15, 2007
Hello,

I am doing the Dhrystone benchmark on a TC1796A microcontroller from
Infineon with different compilers (TASKING, Hightec-RT GNU GCC). Both
compilers are set to produce highest speed optimized code, still I
only get about 15 DMIPs with GNU GCC and about 20 DMIPs with TASKING.
CPU is running at 150MHz, 75MHz system clock, programs run from
external SRAM; no difference (as it should be) with or without
connected Lauterbach Debugger.
TASKING performs probably better because of the optimized libc, still:
the difference is big and it seems overall very low compared in DMIPs
per MHz (0.1 DMIPs per MHz) to other microcontrollers: e.g. Freescales
MPC555 where DMIPs per MHz are at least over 1.

In fact my readings are so low that I think I either misconfigured
something profoundly wrong or my board design is broken.

Has anyone done similar tests or has any comments?

Thanks,
   Bernhard

In comp.arch.embedded,
bfroemel@gmail.com <bfroemel@gmail.com> wrote:
> Hello, > > I am doing the Dhrystone benchmark on a TC1796A microcontroller from > Infineon with different compilers (TASKING, Hightec-RT GNU GCC). Both > compilers are set to produce highest speed optimized code, still I > only get about 15 DMIPs with GNU GCC and about 20 DMIPs with TASKING. > CPU is running at 150MHz, 75MHz system clock, programs run from > external SRAM; no difference (as it should be) with or without > connected Lauterbach Debugger. > TASKING performs probably better because of the optimized libc, still: > the difference is big and it seems overall very low compared in DMIPs > per MHz (0.1 DMIPs per MHz) to other microcontrollers: e.g. Freescales > MPC555 where DMIPs per MHz are at least over 1. > > In fact my readings are so low that I think I either misconfigured > something profoundly wrong or my board design is broken. >
How fast is this external SRAM and how have you set-up your access to it? Does this cpu have caches (I don't know this chip) and are they enabled? If so, does your loop fit in the cache? -- Stef (remove caps, dashes and .invalid from e-mail address to reply by mail)
> > How fast is this external SRAM and how have you set-up your access to it?
It's CY7C1041BV33, a 12ns asynchronous 16bit SRAM. 2 chips build 1Mbyte memory on the 32bit bus running at 75MHz. ( http://www.chipcatalog.com/Cypress/CY7C1041BV33.htm ) I played around with wait states and other access parameters, but it seems they are well chosen from the manufacturer (TTTech.com).
> Does this cpu have caches (I don't know this chip) and are they enabled?
There is an instruction cache of 16Kbyte, no data caches.
> If so, does your loop fit in the cache?
I checked on the loops - they should fit in, but differences with or without instruction cache is superficial (maybe there is greater impact if run from flash).
bfroemel@gmail.com wrote:
> Hello, > > I am doing the Dhrystone benchmark on a TC1796A microcontroller from > Infineon with different compilers (TASKING, Hightec-RT GNU GCC). Both > compilers are set to produce highest speed optimized code, still I > only get about 15 DMIPs with GNU GCC and about 20 DMIPs with TASKING. > CPU is running at 150MHz, 75MHz system clock, programs run from > external SRAM; no difference (as it should be) with or without > connected Lauterbach Debugger. > TASKING performs probably better because of the optimized libc, still: > the difference is big and it seems overall very low compared in DMIPs > per MHz (0.1 DMIPs per MHz) to other microcontrollers: e.g. Freescales > MPC555 where DMIPs per MHz are at least over 1. > > In fact my readings are so low that I think I either misconfigured > something profoundly wrong or my board design is broken. > > Has anyone done similar tests or has any comments? >
Can you do a comparison with the code (at least the core loops) running from internal RAM? As you know, there are several distinct banks of internal RAM, having different properties. Some execute faster than others.
> Can you do a comparison with the code (at least the core loops) running > from internal RAM? As you know, there are several distinct banks of > internal RAM, having different properties. Some execute faster than others.
Okay, I did that - not expecting much of a difference; So suddenly my DMIPS jump to 45.5, when I place all the .text into on-chip RAM - stack/heap are still in external SRAM, so there could be room for improvement. Thanks a lot! :) So, there is something wrong with my external SRAM configuration, or is this "normal" for those kind of microcontrollers? Currently, I have more experience with ARM, LEON3, NIOS2 and they had all very small on- chip but large off-chip RAM, whereas SRAM has been the fastest.
bfroemel@gmail.com writes:

>> Can you do a comparison with the code (at least the core loops) running >> from internal RAM? As you know, there are several distinct banks of >> internal RAM, having different properties. Some execute faster than others. > Okay, I did that - not expecting much of a difference; So suddenly my > DMIPS jump to 45.5, when I place all the .text into on-chip RAM - > stack/heap are still in external SRAM, so there could be room for > improvement. Thanks a lot! :) > So, there is something wrong with my external SRAM configuration, or > is this "normal" for those kind of microcontrollers? Currently, I have > more experience with ARM, LEON3, NIOS2 and they had all very small on- > chip but large off-chip RAM, whereas SRAM has been the fastest.
Don't know the chip; but are you *sure* you have enabled the instruction cache? On e.g. ARM720, it is not easy and not done by default. -- John Devereux
bfroemel@gmail.com wrote:
>> Can you do a comparison with the code (at least the core loops) running >> from internal RAM? As you know, there are several distinct banks of >> internal RAM, having different properties. Some execute faster than others. > Okay, I did that - not expecting much of a difference; So suddenly my > DMIPS jump to 45.5, when I place all the .text into on-chip RAM - > stack/heap are still in external SRAM, so there could be room for > improvement. Thanks a lot! :) > So, there is something wrong with my external SRAM configuration, or > is this "normal" for those kind of microcontrollers? Currently, I have > more experience with ARM, LEON3, NIOS2 and they had all very small on- > chip but large off-chip RAM, whereas SRAM has been the fastest. >
It's *very* normal for the TC1796! Some of those internal banks have paths 8 words (256 bits) wide. See the context-management instructions for an example of this (you have, of course, put the context list in the proper RAM bank: there's just one intended for contexts.) You may get further speed-ups if you relocate code to the internal flash, & turn on code caching (if it isn't on already). Afaik, TC1796 is really intended to run all its code either in the internal flash, or in the so-called "scratchpad" RAM area. The real purpose of RAM-resident code (in this chip) is to enable you to self-reprogram the code flash: like most flash, it is inaccessible while a program/erase cycle is running.
> It's *very* normal for the TC1796! Some of those internal banks have > paths 8 words (256 bits) wide. See the context-management instructions > for an example of this (you have, of course, put the context list in the > proper RAM bank: there's just one intended for contexts.) > You may get further speed-ups if you relocate code to the internal > flash, & turn on code caching (if it isn't on already). > Afaik, TC1796 is really intended to run all its code either in the > internal flash, or in the so-called "scratchpad" RAM area. The real > purpose of RAM-resident code (in this chip) is to enable you to > self-reprogram the code flash: like most flash, it is inaccessible while > a program/erase cycle is running.
Thanks for clearing this issue! The guy who wrote those default linker scripts either had no worries about execution speed or expected, like me, more performance from the external bus. CSA lists must be put into on-chip LDRAM on the TC1796, I didn't try otherwise. I'll certainly heed your advice to use on-chip memories. By the way, David, through my searches (there are still not many posts about Tricore), I've seen that I missed a request from you about the sources of the GNU Hightec-RT toolchain back in December 2006. Now, they are finally offered on: http://www.hightec-rt.com/index.php?option=com_docman&task=cat_view&gid=30&Itemid=50 Back then, I've been given only a password protected FTP URL which stopped to work a few days later. Greetings, Bernhard
bfroemel@gmail.com wrote:
>> It's *very* normal for the TC1796! Some of those internal banks have >> paths 8 words (256 bits) wide. See the context-management instructions >> for an example of this (you have, of course, put the context list in the >> proper RAM bank: there's just one intended for contexts.) >> You may get further speed-ups if you relocate code to the internal >> flash, & turn on code caching (if it isn't on already). >> Afaik, TC1796 is really intended to run all its code either in the >> internal flash, or in the so-called "scratchpad" RAM area. The real >> purpose of RAM-resident code (in this chip) is to enable you to >> self-reprogram the code flash: like most flash, it is inaccessible while >> a program/erase cycle is running. > > Thanks for clearing this issue! The guy who wrote those default linker > scripts either had no worries about execution speed or expected, like > me, more performance from the external bus. > CSA lists must be put into on-chip LDRAM on the TC1796, I didn't try > otherwise. I'll certainly heed your advice to use on-chip memories. > > By the way, David, through my searches (there are still not many posts > about Tricore), I've seen that I missed a request from you about the > sources of the GNU Hightec-RT toolchain back in December 2006. Now, > they are finally offered on: > http://www.hightec-rt.com/index.php?option=com_docman&task=cat_view&gid=30&Itemid=50 > Back then, I've been given only a password protected FTP URL which > stopped to work a few days later. >
Many thanks, Bernhard! I've hunted high & low for those, & not (till now) been able to get them. Thanks again.
Back to the subject of making sure the code cache is on...  where are
you linking your code to?  If you link to 0xA....... (e.g. 0xA2000000
for external SRAM, say) then this isn't cached.  If you link to
0x8....... (e.g. 0x82000000) then this accesses it with the cache
enabled.  Check out page 9-6 of the user manual.  There's no need to
do this for internal SPRAM (0xd4000000).

Cheers
Richard


The 2024 Embedded Online Conference