For benchmarking C implementations, the there are a few benchmarks, but they all have their problems. Many benchmarks have memory requirements that are far too high or need functionality not necessarily available. Some are quite one-sided in what they measure (e.g. Whetstone, Dhrystone, Coremark). So, I deciced to write a new benchmark, stdcbench. I wanted it to be suitable for small systems (4KB of RAM, about 32 KB of Flash). There is a trade-off here, since all the data and code will fit easily into caches on bigger systems, but IMO it is worth it. The current version consists of 2 modules, which on typical systems should contribute about equally to the score. c90base: It benchmarks a commonly-implemented subset of what the standard requires for freestanding implementations of C90. It consists of three submodules: 1) Huffman/RLE decompression (adapted from real-world code) 2) Integer matrix multiplication (synthetic) 3) Insertion sort (adapted from real-world code) c90lib: Benchmarks the standard library. I consists of two submodules: 1) Computation of lnlc-width (adapted from real-world code). 2) Peephole optimizer (simplified from real-world code). C99 features (e.g. bool, restrict) are used where available, but not necessary. So far, stdcbench seems to achieve the goals: benchmark a wide range of important standard c functionality, without giving too much emphasis to any particular aspect. Scores are reported for each module and as total. Example output from a i7-7500U-based system (benchmark compiled with GCC 7.2.0 using -O2 -march=native): stdcbench 0.2 stdcbench c90base score: 7827 stdcbench c90lib score: 6548 stdcbench final score: 14375 Example output from a STM8AF5288 at 16 Mhz (benchmark compiled with SDCC 3.6.9 using -mstm8 --opt-code-speed --max-allocs-per-node 10000): stdcbench 0.2 stdcbench c90base score: 6 stdcbench c90lib score: 6 stdcbench final score: 12 Future plans for the benchmark: 1) Come up with module(s) for floating-point performance. What matters for embedded systems? How should correctness be verified for floating-point? 2) Find out why the c90lib module hangs on C8051F120 (possible compiler bug). 3) State run/reporting rules. 4) Benchmark a few interesting systems I am looking forward to comments from you. http://stdcbench.org/ Philipp
A new benchmark suitable for small systems: stdcbench
Started by ●February 7, 2018
Reply by ●February 9, 20182018-02-09
miercuri, 7 februarie 2018, 17:50:05 UTC+2, Philipp Klaus Krause a scris:> For benchmarking C implementations, the there are a few benchmarks, but > they all have their problems. Many benchmarks have memory requirements > that are far too high or need functionality not necessarily available. > Some are quite one-sided in what they measure (e.g. Whetstone, > Dhrystone, Coremark). > > So, I deciced to write a new benchmark, stdcbench. I wanted it to be > suitable for small systems (4KB of RAM, about 32 KB of Flash). There is > a trade-off here, since all the data and code will fit easily into > caches on bigger systems, but IMO it is worth it. > > The current version consists of 2 modules, which on typical systems > should contribute about equally to the score. > > c90base: > It benchmarks a commonly-implemented subset of what the standard > requires for freestanding implementations of C90. It consists of three > submodules: > 1) Huffman/RLE decompression (adapted from real-world code) > 2) Integer matrix multiplication (synthetic) > 3) Insertion sort (adapted from real-world code) > > c90lib: > Benchmarks the standard library. > I consists of two submodules: > 1) Computation of lnlc-width (adapted from real-world code). > 2) Peephole optimizer (simplified from real-world code). > > > C99 features (e.g. bool, restrict) are used where available, but not > necessary. > > So far, stdcbench seems to achieve the goals: benchmark a wide range of > important standard c functionality, without giving too much emphasis to > any particular aspect. > > Scores are reported for each module and as total. > > Example output from a i7-7500U-based system (benchmark compiled with GCC > 7.2.0 using -O2 -march=native): > > stdcbench 0.2 > stdcbench c90base score: 7827 > stdcbench c90lib score: 6548 > stdcbench final score: 14375 > > Example output from a STM8AF5288 at 16 Mhz (benchmark compiled with SDCC > 3.6.9 using -mstm8 --opt-code-speed --max-allocs-per-node 10000): > > stdcbench 0.2 > stdcbench c90base score: 6 > stdcbench c90lib score: 6 > stdcbench final score: 12 > > Future plans for the benchmark: > > 1) Come up with module(s) for floating-point performance. What matters > for embedded systems? How should correctness be verified for floating-point? > 2) Find out why the c90lib module hangs on C8051F120 (possible compiler > bug). > 3) State run/reporting rules. > 4) Benchmark a few interesting systems > > > I am looking forward to comments from you. > > http://stdcbench.org/ > > PhilippNice. One observation though: the STM8 score seems too low. I mean, it would be difficult to compare systems that have scores like that (11,12,15 etc.) I know STM8 and I know it's quite powerfull. I even use these (and some AVRs) at a much lower frequency (5MHz). What I'm trying to say is that the score for such a system (STM8/AVR8/16MHz) should be in a 100-1000 range. So please scale the scores up! (But take care that the lsb digits to not be noise!). I don't care if the PC scores would be millions...
Reply by ●February 9, 20182018-02-09
Am 09.02.2018 um 07:36 schrieb raimond.dragomir@gmail.com:> Nice. > One observation though: the STM8 score seems too low. I mean, it would > be difficult to compare systems that have scores like that (11,12,15 etc.) > I know STM8 and I know it's quite powerfull. I even use these (and some > AVRs) at a much lower frequency (5MHz). > > What I'm trying to say is that the score for such a system (STM8/AVR8/16MHz) > should be in a 100-1000 range. > > So please scale the scores up! (But take care that the lsb digits to not be noise!). > > I don't care if the PC scores would be millions... >I agree. The previous resolution often was insufficient to even see the effect of compiler optimizations. In version 0.3, I did a bit of rebalancing and rescaling of scores. Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8 --opt-code-speed --max-allocs-per-node 10000): stdcbench 0.3 stdcbench c90base score: 109 stdcbench c90lib score: 88 stdcbench final score: 197 Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8 --opt-code-size --max-allocs-per-node 10000): stdcbench 0.3 stdcbench c90base score: 107 stdcbench c90lib score: 87 stdcbench final score: 194 Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large --stack-auto --opt-code-size --max-allocs-per-node 10000): stdcbench 0.3 stdcbench c90base score: 96 stdcbench final score: 96 Philipp P.S.: The reason the c90lib module is not enabled for the C8051F120 is that it runs out of stack space.
Reply by ●February 9, 20182018-02-09
Philipp Klaus Krause <pkk@spth.de> writes:> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-largeWas that really supposed to say 98 mhz? Can you say the code size for the different compiler outputs? Could you do the AVR8 the and MSP430 with gcc, if you happen to have those available? Would the ARM Cortex M0 be getting outside the intended range of this benchmark? Thanks!
Reply by ●February 9, 20182018-02-09
On 2018-02-09 Paul Rubin wrote in comp.arch.embedded:> Philipp Klaus Krause <pkk@spth.de> writes: >> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large > > Was that really supposed to say 98 mhz?No, I think he meant to say 98 MHz: https://www.silabs.com/products/mcu/8-bit/c8051f12x-f13x/device.c8051f120 Yes, those 8051's have progressed a bit since the 12MHz, 12-cycle instruction devices of some 25 years ago. ;-) -- Stef (remove caps, dashes and .invalid from e-mail address to reply by mail) Many hands make light work. -- John Heywood
Reply by ●February 10, 20182018-02-10
Am 09.02.2018 um 22:28 schrieb Paul Rubin:> Philipp Klaus Krause <pkk@spth.de> writes: >> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large > > Was that really supposed to say 98 mhz?Yes. 24 Mhz from the internal oscillator, multiplied by 4 via the PLL. the C8051 is rated at 100 Mhz.> > Can you say the code size for the different compiler outputs?I'll report exact number when I have a bigger range of results. But for now, it seems that code size on the MCS-51 is about twice that of STM8 when using the same features (i.e c90lib module enabled or disabled for both targets).> > Could you do the AVR8 the and MSP430 with gcc, if you happen to have > those available? Would the ARM Cortex M0 be getting outside the > intended range of this benchmark?The M0 definitely falls into the intended range. However, I don't have any around at the moment. I intend to do a few more benchmarks with what I have, probably next weekend or during the week after: * STM8AF5288 @ 16 Mhz using SDCC 3.5.0, 3.6.0, 3.7.0, some IAR and Cosmic compilers and various optimization settings * C8051F120 @ 98 Mhz using SDCC 3.5.0, 3.6.0, 3.7.0 and various optimization settings * STM8S208 @ 24 Mhz * Z80 @ 3.58 Mhz (in the Sega Master System II or Sega Mark III) * CYC68013A @ 48 Mhz (a 8051-derivative from Cypress) I also intend to get a few more boards to compare (at least Cortex M0 and RISC-V). Philipp
Reply by ●February 10, 20182018-02-10
On 10.2.18 19:47, Philipp Klaus Krause wrote:> Am 09.02.2018 um 22:28 schrieb Paul Rubin: >> Philipp Klaus Krause <pkk@spth.de> writes: >>> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large >> >> Was that really supposed to say 98 mhz? > > Yes. 24 Mhz from the internal oscillator, multiplied by 4 via the PLL. > the C8051 is rated at 100 Mhz. >4 * 24 MHz = 96 MHz. -- -TV
Reply by ●February 10, 20182018-02-10
Am 10.02.2018 um 20:32 schrieb Tauno Voipio:> On 10.2.18 19:47, Philipp Klaus Krause wrote: >> Am 09.02.2018 um 22:28 schrieb Paul Rubin: >>> Philipp Klaus Krause <pkk@spth.de> writes: >>>> Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large >>> >>> Was that really supposed to say 98 mhz? >> >> Yes. 24 Mhz from the internal oscillator, multiplied by 4 via the PLL. >> the C8051 is rated at 100 Mhz. >> > > 4 * 24 MHz = 96 MHz. >Yes. Sorry for the mistake. The C8051 internal oscillator frequency is 24.5 Mhz. Philipp
Reply by ●February 11, 20182018-02-11
vineri, 9 februarie 2018, 21:48:34 UTC+2, Philipp Klaus Krause a scris:> Am 09.02.2018 um 07:36 schrieb raimond.dragomir@gmail.com: > > Nice. > > One observation though: the STM8 score seems too low. I mean, it would > > be difficult to compare systems that have scores like that (11,12,15 etc.) > > I know STM8 and I know it's quite powerfull. I even use these (and some > > AVRs) at a much lower frequency (5MHz). > > > > What I'm trying to say is that the score for such a system (STM8/AVR8/16MHz) > > should be in a 100-1000 range. > > > > So please scale the scores up! (But take care that the lsb digits to not be noise!). > > > > I don't care if the PC scores would be millions... > > > > I agree. The previous resolution often was insufficient to even see the > effect of compiler optimizations. In version 0.3, I did a bit of > rebalancing and rescaling of scores. > > Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8 > --opt-code-speed --max-allocs-per-node 10000): > > stdcbench 0.3 > stdcbench c90base score: 109 > stdcbench c90lib score: 88 > stdcbench final score: 197 > > Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8 > --opt-code-size --max-allocs-per-node 10000): > > stdcbench 0.3 > stdcbench c90base score: 107 > stdcbench c90lib score: 87 > stdcbench final score: 194 > > Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large > --stack-auto --opt-code-size --max-allocs-per-node 10000): > > stdcbench 0.3 > stdcbench c90base score: 96 > stdcbench final score: 96 > > Philipp > > P.S.: The reason the c90lib module is not enabled for the C8051F120 is > that it runs out of stack space.Now it's better :) This kind of benchmark is very interesting. Without it you can only have a "feeling" about the power of an architecture, and only if you have much experience with it. And of course it much depends on the application. For example, it seems that the STM8S 16MHz performs better than the C8051 at 100MHz. This is not a surprise for me. I have worked a long time with 8051 and I know very well what is it capable of. For example, an 8051 is almost unbeatable for small control applications of under 8K program size and max. 256 bytes of internal ram. But if you step this line and your program goes bigger, and especially if you need bigger ram and start to use the XRAM, the efficiency goes down rapidly. The 8051 just doesn't scale well in the addressing range. In the 8K/256 range is probably the best 8bitter, in the 64K/64K range is probably the worst :) Here the program size is not a direct factor, it usually depends on how much ram you need. My experience is that you can "grow" your program up to 8K and still use only the internal 256 bytes ram. But of course, this benchmark is not suppose to reveal this kind of things...
Reply by ●February 11, 20182018-02-11
Here is a small comparison of STM8 results with various current compilers (all done on the STM8AF5288). SDCC 3.7.0 RC1 with optimization for code size (-mstm8 --opt-code-size --max-allocs-per-node 100000), binary size 20953 B: stdcbench 0.3 stdcbench c90base score: 106 stdcbench c90lib score: 87 stdcbench final score: 193 SDCC 3.7.0 RC1 with optimization for code speed (-mstm8 --opt-code-speed --max-allocs-per-node 100000), binary size 21083 B: stdcbench 0.3 stdcbench c90base score: 109 stdcbench c90lib score: 88 stdcbench final score: 197 IAR 3.10.1.201 with optimization for code size, binary size 24288 B: stdcbench 0.3 stdcbench c90base score: 117 stdcbench c90lib score: 71 stdcbench final score: 188 IAR 3.10.1.201 with optimization for code speed, binary size 27268 B: stdcbench 0.3 stdcbench c90base score: 197 stdcbench c90lib score: 100 stdcbench final score: 297 Cosmic 4.4.4 with optimization for code size: stdcbench 0.3 stdcbench c90base score: 116 stdcbench final score: 116 Cosmic 4.4.4 with optimization for code speed: stdcbench 0.3 stdcbench c90base score: 123 stdcbench final score: 123 For Cosmic 4.4.4, the c90lib module was disabled, since Cosmic 4.4.4 doesn't provide qsort() in the standard library. The Raisonance compiler was not included in the comparison due to dificulties getting an evaluation license. These results are quite interesting when compared to Dhrystone and Coremark (see http://www.colecovision.eu/stm8/compilers.shtml). In particular, while SDCC is ahead in Dhrystone and Coremark scores, it apparently falls behing in stdcbench scores. On the other hand, SDCC seems to do better in code size for stdcbench. Philipp