Reply by Tim Wescott June 26, 20142014-06-26
On Thu, 26 Jun 2014 10:31:16 +0200, Noob wrote:

> On 25/06/2014 16:01, Noob wrote: > >> Profiling shows that merely decoding one HD channel (audio and video) >> pegs the system CPU to 50%, which is unexpected, because all the heavy >> lifting is done elsewhere. > > After /much/ digging around, it turns out that some moron on the team > decided to disable compiler optimizations for the support libraries. > > Changing optimization level back to -Os drops the system CPU load to > 27%. > >> If I disable the audio, the load drops to 25%... even though audio >> tasks were far from taking 25%. When audio is disabled, the system CPU >> spends less time in ALL other parts of the software. > > Disabling audio on the optimized build drops the CPU load to 14.5% The > phenomenon I described still occurs (the system spends less time in > video related tasks) but it is much less of a factor than I had > originally assumed. > > Also someone suggested that disabling audio means no longer needing to > perform audio/video synchronization, which lowers system load even > further. > > Sorry for the noise.
Hey, sometimes you just have to publicly admit that you're stumped before the answer jumps out at you. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Reply by Mel Wilson June 26, 20142014-06-26
On Thu, 26 Jun 2014 10:31:16 +0200, Noob wrote:

> Sorry for the noise.
Good example of problem-solving at least. Mel.
Reply by Noob June 26, 20142014-06-26
On 26/06/2014 00:16, Theo Markettos wrote:
> In comp.arch Noob wrote: >> Hmmm, there is a "ram" event, but its only a counting event, so no cigar. >> Perhaps using cache misses? >> pfi and pfo (Pipeline Freeze due to cache miss Instruction/Operand) >> >> But the problem is not really SHARING the memory, but merely accessing it. >> In the limit, each processor could have its own little private part of >> RAM, but only one processor can access RAM at the same time. But that >> would still impact the latency of the pipeline freeze on a miss. >> (Sorry for thinking out loud, I'm really in the dark here.) > > What's the cache architecture? Is there only one cache, shared between all > these processors doing different things, or does each processor have its own > cache(s)?
The system CPU has its own cache hierarchy (L1+L2). I don't know much about the other processors on the SoC. Anyway, data shared by several processors are placed in non-cached memory. Also, cf. my other message: the main problem came from someone disabling optimizations for the build :-( Regards.
Reply by Noob June 26, 20142014-06-26
On 25/06/2014 16:01, Noob wrote:

> Profiling shows that merely decoding one HD channel (audio and video) > pegs the system CPU to 50%, which is unexpected, because all the > heavy lifting is done elsewhere.
After /much/ digging around, it turns out that some moron on the team decided to disable compiler optimizations for the support libraries. Changing optimization level back to -Os drops the system CPU load to 27%.
> If I disable the audio, the load drops to 25%... even though audio > tasks were far from taking 25%. When audio is disabled, the system > CPU spends less time in ALL other parts of the software.
Disabling audio on the optimized build drops the CPU load to 14.5% The phenomenon I described still occurs (the system spends less time in video related tasks) but it is much less of a factor than I had originally assumed. Also someone suggested that disabling audio means no longer needing to perform audio/video synchronization, which lowers system load even further. Sorry for the noise.
Reply by Theo Markettos June 25, 20142014-06-25
In comp.arch Noob <root@127.0.0.1> wrote:
> Hmmm, there is a "ram" event, but its only a counting event, so no cigar. > Perhaps using cache misses? > pfi and pfo (Pipeline Freeze due to cache miss Instruction/Operand) > > But the problem is not really SHARING the memory, but merely accessing it. > In the limit, each processor could have its own little private part of > RAM, but only one processor can access RAM at the same time. But that > would still impact the latency of the pipeline freeze on a miss. > (Sorry for thinking out loud, I'm really in the dark here.)
What's the cache architecture? Is there only one cache, shared between all these processors doing different things, or does each processor have its own cache(s)? Theo
Reply by Noob June 25, 20142014-06-25
[ NB: cross-posted to comp.arch.embedded and comp.arch ]

Hello everyone,

I'm currently working on a "typical" set-top box project (digital TV).

The system can be considered an "heterogeneous computing" system,
with various "processing elements" for different tasks:

- an SH4 (ST40) "system" CPU, where the app runs on top of a mini OS
- a micro-controller for watchdog and low-power/stand-by functions
- a co-processor for audio decoding
- another co-processor and/or ASIC for video decoding
(the media decoders are not well documented)
- a few DMA engines
- a blitter gizmo for UI whiz-bang
- a crypto co-processor
- stuff I don't even know about

All of these accessing a shared resource: RAM
(through a shared bus??)

The ODM provides minimal profiling tools (instruction pointer sampling,
and a post-processing script tho parse the symbol table, matching IP
with the corresponding function).

Problem is, these tools only profile the "system" CPU. The rest of
the system is a giant black-box to me.

Profiling shows that merely decoding one HD channel (audio and video)
pegs the system CPU to 50%, which is unexpected, because all the heavy
lifting is done elsewhere.

If I disable the audio, the load drops to 25%... even though audio
tasks were far from taking 25%. When audio is disabled, the system
CPU spends less time in ALL other parts of the software.

This would seem to incriminate some kind of bus contention for a shared
resource, and I'm thinking main memory.

Drop audio decoding => bus contention drops => everything runs smoother.

Does this theory make sense/hold water?

More importantly, how would I validate/invalidate it?

In order to be a credible explanation, it is required that when the
system CPU needs to access RAM, if the bus is locked by another
entity, the CPU just spins, instead of switching to a different task.

I'm thinking maybe I can use the perfcounters to high-light CPU
twiddling its thumbs while waiting for RAM access?
http://www.stlinux.com/devel/debug/perfcounters/modes

Hmmm, there is a "ram" event, but its only a counting event, so no cigar.
Perhaps using cache misses?
pfi and pfo (Pipeline Freeze due to cache miss Instruction/Operand)

But the problem is not really SHARING the memory, but merely accessing it.
In the limit, each processor could have its own little private part of
RAM, but only one processor can access RAM at the same time. But that
would still impact the latency of the pipeline freeze on a miss.
(Sorry for thinking out loud, I'm really in the dark here.)

Anyway, I'm open to suggestions / advice / warnings / etc.

Regards.