EmbeddedRelated.com
Forums
Memfault Beyond the Launch

A new benchmark suitable for small systems: stdcbench

Started by Philipp Klaus Krause February 7, 2018
For benchmarking C implementations, the there are a few benchmarks, but
they all have their problems. Many benchmarks have memory requirements
that are far too high or need functionality not necessarily available.
Some are quite one-sided in what they measure (e.g. Whetstone,
Dhrystone, Coremark).

So, I deciced to write a new benchmark, stdcbench. I wanted it to be
suitable for small systems (4KB of RAM, about 32 KB of Flash). There is
a trade-off here, since all the data and code will fit easily into
caches on bigger systems, but IMO it is worth it.

The current version consists of 2 modules, which on typical systems
should contribute about equally to the score.

c90base:
It benchmarks a commonly-implemented subset of what the standard
requires for freestanding implementations of C90. It consists of three
submodules:
1) Huffman/RLE decompression (adapted from real-world code)
2) Integer matrix multiplication (synthetic)
3) Insertion sort (adapted from real-world code)

c90lib:
Benchmarks the standard library.
I consists of two submodules:
1) Computation of lnlc-width (adapted from real-world code).
2) Peephole optimizer (simplified from real-world code).


C99 features (e.g. bool, restrict) are used where available, but not
necessary.

So far, stdcbench seems to achieve the goals: benchmark a wide range of
important standard c functionality, without giving too much emphasis to
any particular aspect.

Scores are reported for each module and as total.

Example output from a i7-7500U-based system (benchmark compiled with GCC
7.2.0 using -O2 -march=native):

stdcbench 0.2
stdcbench c90base score: 7827
stdcbench c90lib score: 6548
stdcbench final score: 14375

Example output from a STM8AF5288 at 16 Mhz (benchmark compiled with SDCC
3.6.9 using -mstm8 --opt-code-speed --max-allocs-per-node 10000):

stdcbench 0.2
stdcbench c90base score: 6
stdcbench c90lib score: 6
stdcbench final score: 12

Future plans for the benchmark:

1) Come up with module(s) for floating-point performance. What matters
for embedded systems? How should correctness be verified for floating-point?
2) Find out why the c90lib module hangs on C8051F120 (possible compiler
bug).
3) State run/reporting rules.
4) Benchmark a few interesting systems


I am looking forward to comments from you.

http://stdcbench.org/

Philipp
miercuri, 7 februarie 2018, 17:50:05 UTC+2, Philipp Klaus Krause a scris:
> For benchmarking C implementations, the there are a few benchmarks, but > they all have their problems. Many benchmarks have memory requirements > that are far too high or need functionality not necessarily available. > Some are quite one-sided in what they measure (e.g. Whetstone, > Dhrystone, Coremark). > > So, I deciced to write a new benchmark, stdcbench. I wanted it to be > suitable for small systems (4KB of RAM, about 32 KB of Flash). There is > a trade-off here, since all the data and code will fit easily into > caches on bigger systems, but IMO it is worth it. > > The current version consists of 2 modules, which on typical systems > should contribute about equally to the score. > > c90base: > It benchmarks a commonly-implemented subset of what the standard > requires for freestanding implementations of C90. It consists of three > submodules: > 1) Huffman/RLE decompression (adapted from real-world code) > 2) Integer matrix multiplication (synthetic) > 3) Insertion sort (adapted from real-world code) > > c90lib: > Benchmarks the standard library. > I consists of two submodules: > 1) Computation of lnlc-width (adapted from real-world code). > 2) Peephole optimizer (simplified from real-world code). > > > C99 features (e.g. bool, restrict) are used where available, but not > necessary. > > So far, stdcbench seems to achieve the goals: benchmark a wide range of > important standard c functionality, without giving too much emphasis to > any particular aspect. > > Scores are reported for each module and as total. > > Example output from a i7-7500U-based system (benchmark compiled with GCC > 7.2.0 using -O2 -march=native): > > stdcbench 0.2 > stdcbench c90base score: 7827 > stdcbench c90lib score: 6548 > stdcbench final score: 14375 > > Example output from a STM8AF5288 at 16 Mhz (benchmark compiled with SDCC > 3.6.9 using -mstm8 --opt-code-speed --max-allocs-per-node 10000): > > stdcbench 0.2 > stdcbench c90base score: 6 > stdcbench c90lib score: 6 > stdcbench final score: 12 > > Future plans for the benchmark: > > 1) Come up with module(s) for floating-point performance. What matters > for embedded systems? How should correctness be verified for floating-point? > 2) Find out why the c90lib module hangs on C8051F120 (possible compiler > bug). > 3) State run/reporting rules. > 4) Benchmark a few interesting systems > > > I am looking forward to comments from you. > > http://stdcbench.org/ > > Philipp
Nice. One observation though: the STM8 score seems too low. I mean, it would be difficult to compare systems that have scores like that (11,12,15 etc.) I know STM8 and I know it's quite powerfull. I even use these (and some AVRs) at a much lower frequency (5MHz). What I'm trying to say is that the score for such a system (STM8/AVR8/16MHz) should be in a 100-1000 range. So please scale the scores up! (But take care that the lsb digits to not be noise!). I don't care if the PC scores would be millions...
Am 09.02.2018 um 07:36 schrieb raimond.dragomir@gmail.com:
> Nice. > One observation though: the STM8 score seems too low. I mean, it would > be difficult to compare systems that have scores like that (11,12,15 etc.) > I know STM8 and I know it's quite powerfull. I even use these (and some > AVRs) at a much lower frequency (5MHz). > > What I'm trying to say is that the score for such a system (STM8/AVR8/16MHz) > should be in a 100-1000 range. > > So please scale the scores up! (But take care that the lsb digits to not be noise!). > > I don't care if the PC scores would be millions... >
I agree. The previous resolution often was insufficient to even see the effect of compiler optimizations. In version 0.3, I did a bit of rebalancing and rescaling of scores. Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8 --opt-code-speed --max-allocs-per-node 10000): stdcbench 0.3 stdcbench c90base score: 109 stdcbench c90lib score: 88 stdcbench final score: 197 Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8 --opt-code-size --max-allocs-per-node 10000): stdcbench 0.3 stdcbench c90base score: 107 stdcbench c90lib score: 87 stdcbench final score: 194 Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large --stack-auto --opt-code-size --max-allocs-per-node 10000): stdcbench 0.3 stdcbench c90base score: 96 stdcbench final score: 96 Philipp P.S.: The reason the c90lib module is not enabled for the C8051F120 is that it runs out of stack space.
Philipp Klaus Krause <pkk@spth.de> writes:
> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large
Was that really supposed to say 98 mhz? Can you say the code size for the different compiler outputs? Could you do the AVR8 the and MSP430 with gcc, if you happen to have those available? Would the ARM Cortex M0 be getting outside the intended range of this benchmark? Thanks!
On 2018-02-09 Paul Rubin wrote in comp.arch.embedded:
> Philipp Klaus Krause <pkk@spth.de> writes: >> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large > > Was that really supposed to say 98 mhz?
No, I think he meant to say 98 MHz: https://www.silabs.com/products/mcu/8-bit/c8051f12x-f13x/device.c8051f120 Yes, those 8051's have progressed a bit since the 12MHz, 12-cycle instruction devices of some 25 years ago. ;-) -- Stef (remove caps, dashes and .invalid from e-mail address to reply by mail) Many hands make light work. -- John Heywood
Am 09.02.2018 um 22:28 schrieb Paul Rubin:
> Philipp Klaus Krause <pkk@spth.de> writes: >> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large > > Was that really supposed to say 98 mhz?
Yes. 24 Mhz from the internal oscillator, multiplied by 4 via the PLL. the C8051 is rated at 100 Mhz.
> > Can you say the code size for the different compiler outputs?
I'll report exact number when I have a bigger range of results. But for now, it seems that code size on the MCS-51 is about twice that of STM8 when using the same features (i.e c90lib module enabled or disabled for both targets).
> > Could you do the AVR8 the and MSP430 with gcc, if you happen to have > those available? Would the ARM Cortex M0 be getting outside the > intended range of this benchmark?
The M0 definitely falls into the intended range. However, I don't have any around at the moment. I intend to do a few more benchmarks with what I have, probably next weekend or during the week after: * STM8AF5288 @ 16 Mhz using SDCC 3.5.0, 3.6.0, 3.7.0, some IAR and Cosmic compilers and various optimization settings * C8051F120 @ 98 Mhz using SDCC 3.5.0, 3.6.0, 3.7.0 and various optimization settings * STM8S208 @ 24 Mhz * Z80 @ 3.58 Mhz (in the Sega Master System II or Sega Mark III) * CYC68013A @ 48 Mhz (a 8051-derivative from Cypress) I also intend to get a few more boards to compare (at least Cortex M0 and RISC-V). Philipp
On 10.2.18 19:47, Philipp Klaus Krause wrote:
> Am 09.02.2018 um 22:28 schrieb Paul Rubin: >> Philipp Klaus Krause <pkk@spth.de> writes: >>> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large >> >> Was that really supposed to say 98 mhz? > > Yes. 24 Mhz from the internal oscillator, multiplied by 4 via the PLL. > the C8051 is rated at 100 Mhz. >
4 * 24 MHz = 96 MHz. -- -TV
Am 10.02.2018 um 20:32 schrieb Tauno Voipio:
> On 10.2.18 19:47, Philipp Klaus Krause wrote: >> Am 09.02.2018 um 22:28 schrieb Paul Rubin: >>> Philipp Klaus Krause <pkk@spth.de> writes: >>>> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large >>> >>> Was that really supposed to say 98 mhz? >> >> Yes. 24 Mhz from the internal oscillator, multiplied by 4 via the PLL. >> the C8051 is rated at 100 Mhz. >> > > 4 * 24 MHz = 96 MHz. >
Yes. Sorry for the mistake. The C8051 internal oscillator frequency is 24.5 Mhz. Philipp
vineri, 9 februarie 2018, 21:48:34 UTC+2, Philipp Klaus Krause a scris:
> Am 09.02.2018 um 07:36 schrieb raimond.dragomir@gmail.com: > > Nice. > > One observation though: the STM8 score seems too low. I mean, it would > > be difficult to compare systems that have scores like that (11,12,15 etc.) > > I know STM8 and I know it's quite powerfull. I even use these (and some > > AVRs) at a much lower frequency (5MHz). > > > > What I'm trying to say is that the score for such a system (STM8/AVR8/16MHz) > > should be in a 100-1000 range. > > > > So please scale the scores up! (But take care that the lsb digits to not be noise!). > > > > I don't care if the PC scores would be millions... > > > > I agree. The previous resolution often was insufficient to even see the > effect of compiler optimizations. In version 0.3, I did a bit of > rebalancing and rescaling of scores. > > Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8 > --opt-code-speed --max-allocs-per-node 10000): > > stdcbench 0.3 > stdcbench c90base score: 109 > stdcbench c90lib score: 88 > stdcbench final score: 197 > > Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8 > --opt-code-size --max-allocs-per-node 10000): > > stdcbench 0.3 > stdcbench c90base score: 107 > stdcbench c90lib score: 87 > stdcbench final score: 194 > > Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large > --stack-auto --opt-code-size --max-allocs-per-node 10000): > > stdcbench 0.3 > stdcbench c90base score: 96 > stdcbench final score: 96 > > Philipp > > P.S.: The reason the c90lib module is not enabled for the C8051F120 is > that it runs out of stack space.
Now it's better :) This kind of benchmark is very interesting. Without it you can only have a "feeling" about the power of an architecture, and only if you have much experience with it. And of course it much depends on the application. For example, it seems that the STM8S 16MHz performs better than the C8051 at 100MHz. This is not a surprise for me. I have worked a long time with 8051 and I know very well what is it capable of. For example, an 8051 is almost unbeatable for small control applications of under 8K program size and max. 256 bytes of internal ram. But if you step this line and your program goes bigger, and especially if you need bigger ram and start to use the XRAM, the efficiency goes down rapidly. The 8051 just doesn't scale well in the addressing range. In the 8K/256 range is probably the best 8bitter, in the 64K/64K range is probably the worst :) Here the program size is not a direct factor, it usually depends on how much ram you need. My experience is that you can "grow" your program up to 8K and still use only the internal 256 bytes ram. But of course, this benchmark is not suppose to reveal this kind of things...
Here is a small comparison of STM8 results with various current
compilers (all done on the STM8AF5288).

SDCC 3.7.0 RC1 with optimization for code size (-mstm8 --opt-code-size
--max-allocs-per-node 100000), binary size 20953 B:

stdcbench 0.3

stdcbench c90base score: 106
stdcbench c90lib score: 87
stdcbench final score: 193

SDCC 3.7.0 RC1 with optimization for code speed  (-mstm8
--opt-code-speed --max-allocs-per-node 100000), binary size 21083 B:

stdcbench 0.3
stdcbench c90base score: 109
stdcbench c90lib score: 88
stdcbench final score: 197

IAR 3.10.1.201 with optimization for code size, binary size 24288 B:

stdcbench 0.3
stdcbench c90base score: 117
stdcbench c90lib score: 71
stdcbench final score: 188

IAR 3.10.1.201 with optimization for code speed, binary size 27268 B:

stdcbench 0.3
stdcbench c90base score: 197
stdcbench c90lib score: 100
stdcbench final score: 297

Cosmic 4.4.4 with optimization for code size:

stdcbench 0.3
stdcbench c90base score: 116
stdcbench final score: 116

Cosmic 4.4.4 with optimization for code speed:

stdcbench 0.3
stdcbench c90base score: 123
stdcbench final score: 123

For Cosmic 4.4.4, the c90lib module was disabled, since Cosmic 4.4.4
doesn't provide qsort() in the standard library. The Raisonance compiler
was not included in the comparison due to dificulties getting an
evaluation license.

These results are quite interesting when compared to Dhrystone and
Coremark (see http://www.colecovision.eu/stm8/compilers.shtml). In
particular, while SDCC is ahead in Dhrystone and Coremark scores, it
apparently falls behing in stdcbench scores. On the other hand, SDCC
seems to do better in code size for stdcbench.

Philipp

Memfault Beyond the Launch