EmbeddedRelated.com
Forums

Looking for ARM system with RTOS

Started by Bruce Varley December 31, 2012
Tim Wescott <tim@seemywebsite.com> wrote:
> Yes, I know you can run audio through it -- but I'm not sure how _good_ > of audio you can run through it, how much of the audio is working because > it's really real time and how much is just deep FIFO's pasted over a > bunch of problems, or how far away you can get from the optimized-for- > audio software paths in the OS before things break down.
I would claim that most consumer USB audio gear still adjust the DAC clock rate based on the 1ms USB frame timer. The audio devices that use one of the standard-specified feedback methods are considered better and probably even are, but the simple way is good enough for most use. Buffer-wise, double 1ms buffers are normal on the device side. Host-side buffering requirements depends on a variety of factors, but eg. on Windows you can get as low as 1-2ms worth with properly written ASIO drivers and a reasonably powerful machine. -a
On Thu, 3 Jan 2013 19:30:20 -0800, Mark Borgerson
<mborgerson@comcast.net> wrote:

>In article <c9ece8pp43md4q71b1047451sejrfabfes@4ax.com>, >jonk@infinitefactors.org says... >> ><<SNIP>> >> >> I'll give another example of my mindset. The current spate of >> multi-GHz x86 processors from Intel are fabricated with >> feature sizes and GTL technology (unless they've got >> something still newer since I last looked) that would permit >> the production of a VERY LOW power 100MHz laptop that could >> easily run for quite some time using nothing more than a few >> AA batteries. Nothing special. Just cheap Costco alkalines. > >OK, so you can run the CPU with a few milliwatts. Can you do >anything other than a reflective LCD display? Lighting >up even an 11" display could suck those AA cells dry >pretty quickly.
The HP Omnibook 300 with Win 3.1 in ROM would run on 4 AA batteries for about 2-3 weeks of regular use. In the early 1990s. Using OLD tech. Today? A 80386 die space would be practically invisible, would have near 100% yield, and would use lots less power still. The static ram retention would be much less power, as well. And that was the main loss of battery power when the unit was closed. (It would retain SRAM for almost two months.) It also included a capacitor, so that you would have about 10 minutes to replace the AA batteries.
>> (In fact, it was done once with the HP Omnibook 300/Win >> 3.1... but with older feature sizes.) The current technology >> would wipe the floor with that older HP Omnibook, which >> itself put Windows completely in ROM (no boot from secondary >> storage) and would run for weeks on AA batteries available >> anywhere in the world. I need nothing more than a 80386 using >> those feature sizes -- no FP -- and running at 66MHz to >> 100MHz for word processing. The nice thing about that >> specific Omnibook (and none of the others) is that there was >> no special battery technology, it weighed almost nothing, >> included a wonderful pop-out mouse built in, and required >> nothing special when you closed it. It just shut off all >> power except and only what was required to retain the static >> ram. So when I opened it, I was exactly where I left off -- >> cursor, etc -- with exactly 0 seconds wait. When someone >> asked me a question, I closed it, answered the question, >> opened the laptop, and just kept on going. Weight was VERY >> low -- lower than any laptop I'm aware of today. > >Sounds sort of like a MacBook Air without the WIFI and >11" LCD screen.
Except with weeks of typical use and months of SRAM retention and absolutely ZERO time delay when opening it up for use, even weeks later. Office and Win 3.1 were both ROM'd.
>> But their is no longer a marketplace for this. So I can only >> get laptops with MUCH MUCH shorter active runtimes, despite >> huge advances in battery technology (for much more cost) and >> despite huge advances in FAB technology (which could be used >> to greatly reduce power consumption from that time.) > >I never did get an estimate on battery life for my OLPC laptop. >I suspect that the onboard wifi contributed about half the >power drain. I also suspect than the older Omnibooks didn't >have either Ethernet or Wifi active most of the time. Those >two alone will suck up a couple of AA cells pretty quickly.
There were a LOT of Omnibooks. Only 1 of them though was anything at all like the 300. That one stood out among the other Omnibooks like an Ostrich stands out in an ant farm. It had nothing similar to any of the other Omnibooks. It was a complete outlier. Unique.
>You can easily run an ARM CM4 on an average power of 15mA. >That should give you at least 100 hours off AA cells---it's >the peripherals that people expect today that kill the batteries.
I still have the Omnibook, by the way. I haven't used it in a while (carefully packed away) but I would guess from memory that the (4) AA batteries give on the order of 100 hours or so of continuous use. The display was NOT color, but grayscale. So that's a difference. Used a 1" hard drive.
>I've done a lot of low-power stuff---instruments that sit >on oceanographic moorings for a year at a time. Displays >aren't used and couldn't be continuously powered. The big >power suckers are the storage medium---at 200MB per day, >alkaline cells wouldn't cut it. We end up using Lithium >primary cells, which makes shipping units and batteries >a true PITA!
Thing is, AA batteries are the ONLY battery you can be most sure of being able to find anywhere in the world. No special requirements, no unique shape to find, no $100 cost. Do you remember ANY Windows laptop that used AA batteries? Ever?? This one does. And that was 20 years ago. Jon
On Thu, 3 Jan 2013 19:10:05 -0800, Mark Borgerson
<mborgerson@comcast.net> wrote:

>In article <gq3ce898jeru18r5ufgarts0tb7kfl88ri@4ax.com>, >jonk@infinitefactors.org says... >> >> On Thu, 3 Jan 2013 10:32:46 -0800, Mark Borgerson >> <mborgerson@comcast.net> wrote: >> >> >In article <u3eae8d0bt9c51qq0tbp30mucskp1o4csd@4ax.com>, >> >jonk@infinitefactors.org says... >> >> >> >> On Wed, 2 Jan 2013 22:43:15 -0800, Mark Borgerson >> >> <mborgerson@comcast.net> wrote: >> >> >> >> >In article <9289e8p4ecr3qalegrs5avpq9nmk1ap8jb@4ax.com>, >> >> >jonk@infinitefactors.org says... >> >> >> >> >> >> On Wed, 2 Jan 2013 12:52:47 -0800, Mark Borgerson >> >> >> <mborgerson@comcast.net> wrote: >> >> >> >> >> >> >In article <l996e8dcrons0s9d6104r0t500fra0c0t2@4ax.com>, >> >> >> >jonk@infinitefactors.org says... >> >> >> >> >> >> >> >> On Tue, 01 Jan 2013 12:09:40 +0200, upsidedown@downunder.com >> >> >> >> wrote: >> >> >> >> >> >> >> >> >On Tue, 1 Jan 2013 09:54:30 +0800, "Bruce Varley" <bv@NoSpam.com> >> >> >> >> >wrote: >> >> >> >> > >> >> >> >> >>I need: >> >> >> >> >> >> >> >> >> >>o CPU clock 200MHz or higher. >> >> >> >> >> >> >> >> >> >>o 2 serial ports, with access to the logic level lines on at least one (LV >> >> >> >> >>OK). >> >> >> >> >> >> >> >> >> >>o USB support. Socket support also would be nice, not essential. >> >> >> >> >> >> >> >> >> >>o Some sort of file system. >> >> >> >> >> >> >> >> >> >>o Guaranteed turnround of 10mS, even lower would be nice. My ARM Linux >> >> >> >> >>won'd do better than 20. >> >> ><<SNIP>> >> >> >> >> >> >> >> >> 10ms turnaround would be... unacceptable. >> >> >> >> >> >> >> >I'm a bit puzzled here. I usually read '10ms' as 10 milliseconds. >> >> >> >> >> >> As do I. >> >> >> >> >> >> >That seems like a lot of time for most embedded systems RTOS >> >> >> >variants, which have task switch times in the low microseconds >> >> >> >on chips like 160MHz ARM-Cortex STM32s. >> >> >> >> >> >> I was using a 20ns cycle time ADSP-21xx processor (50MHz.) >> >> >> It's a DSP with fixed cycle counts (1) for each instruction >> >> >> and a guaranteed interrupt latency that NEVER varies (with >> >> >> certain, inconsequential [to my application] conditions being >> >> >> met.) >> >> >> >> >> >> >10milliseconds would certainly be too long a response time on >> >> >> >many of the instruments I've developed--none of which use >> >> >> >an RTOS. I'm just now starting to play around with >> >> >> >ChiBios and UCoS-II on the STM32 chips. >> >> >> >> >> >> In measurement instruments, which may be used in closed loop >> >> >> control systems, predictability (both in terms of phase delay >> >> >> relative to the sensor observation and also in terms of the >> >> >> variability allowed in that phase delay) is vital. >> >> >> >> >> >> I shoot for (and achieve where it is important) variability >> >> >> that is measured as 0, or if forced in very small integers >0 >> >> >> like 1 or maybe 2, of cycle variation... measurement to >> >> >> measurement... both in sampling the sensor as well as in >> >> >> outputting it via a DAC. (I can't help what happens after.) >> >> >> In the best of all cases, I implement the closed loop control >> >> >> in the instrument, as well, so that there is no variability >> >> >> caused by an external ADC and remaining system. In that case, >> >> >> I drive the 0-100% control with similar attention to >> >> >> precision control of the external device (heater, boule >> >> >> puller, etc.) I also go to the trouble to ensure, where >> >> >> branching code exists, that each branch takes exactly the >> >> >> same number of cycles. >> >> >> >> >> >> I very much dislike, in cases like this, devices with varying >> >> >> interrupt latencies (which is almost guaranteed to happen if >> >> >> the processor has instructions with varying execution time.) >> >> >> I can control my code and the number of cycles each edge of >> >> >> it may take, but the hardware latency is out of my control. >> >> >> So I look for processors where it is predictable, if I need >> >> >> that. >> >> >> >> >> >> An STM32 would not qualify in the case I am thinking about. >> >> >> >> >> >IIRC, the Cortex M4 instructions which would cause the greatest >> >> >variation in interrupt latency (load and store multiple and divide) >> >> >are, themselves, interruptible. I would guess that the interrupt >> >> >latency variation would be on the order of 1 to 2 cycle times--- >> >> >or about 12.5 nSec for a 168MHz clock. The overall latency is >> >> >listed as 12 clock cycles or about 60-70nSec. >> >> >> >> I gained some slightly useful benefits by having exactly 0 >> >> cycle variation in the application I'm talking about. One >> >> cycle (20ns in that application) of variation would have made >> >> a difference to me. The fact that I didn't have to add >> >> hardware to gain that tiny advantage ALSO was a useful >> >> benefit. >> >> >> >> In the M4, there is also a pipeline and, if I remember, >> >> "faults" can occur not only in one stage. (I might be wrong >> >> about that.) You have to consider everything -- instruction >> >> faults (memory, etc.) But I admit I'm pretty ignorant of the >> >> M4, too. >> >> >> >> >I can see that multiple-cycle instructions with variable execution time >> >> >inside the interrupt handler could cause phase variations in the output. >> >> >It might requirem more work to eliminate them than would be the case >> >> >with a DSP having only a few rare cases to consider. >> >> > >> >> >If you're using a DAC in the loop and want consistent phase delays, >> >> >does that require a flash DAC? With a successive approximation DAC, the >> >> >delay until you get the desired output would seem to depend on the >> >> >value output unless there is a fast sample-and-hold between the >> >> >DAC and the control system. >> >> >> >> I added the full closed loop control PID into the instrument. >> >> (It didn't have the ability beforehand.) In doing so, there >> >> was no DAC involved at that stage, anymore. >> >> >> >> >If you want outputs free of all phase jitter, a sample and hold >> >> >triggered by a hardware clock could solve the problem. The problem >> >> >then becomes what synchronization delays are acceptable. >> >> >> >> Price, size, power, etc., all mattered. Very competitive >> >> marketplace in that case. >> >> >> >> Jon >> > >> >The 50Mhz ADSP-21061KSZ-200-ND is $101.43 qty 1 at Digikey. The >> >168Mhz STM32F407 is about $12. That seems pretty competitive to >> >me ;-) How much work would it take to tune the STM32 and would you >> >sell enough with an $80 lower price to be worth the effort. >> >> The ADSP-21xxx is not even close to the ADSP-21xx and I >> wasn't using the ADSP-21061KSZ. It was an ADSP-2111 and >> ADSP-2105. They were MUCH cheaper at the time (circa early >> 1990's) and the competition elsewhere was effectively zero. >> Since then there are many more options and many more players >> and the ADSP-21xx processors I was using probably aren't even >> available (much, if at all.) If I were doing this today, I'd >> pick something else. >> >> >I suspect the STM32 is lower in power at 168Mhz than the DSP >> >at 50MHz, but I haven't verified that guess. >> >> There was NO floating point on the units I used. A nice >> barrel shifter (combinatorial, one-cycle) though and I used >> it for writing my own floating point. Power consumption was >> quite low --- for the time. >> > >I think all the Cortex M3 and M4s have single cycle barrel shifters and >single-cycle multiply. Integer divides can take a few cycles.
I don't think the M4 has a barrel shifter -- not one that is available to the instruction set. The ADSP-21xx could find the leading bit in a 16 bit word in 1 clock, in a 32-bit word in two clocks (two seperate instructions.) But during that time, I could also do two memory moves per cycle, as well. So it normalized and denormalized in 1 to 2 clocks depending on the word size I was using. The number of shifts required (or used) was stored in another register. If you know of the instructions on the M4 that do that, please let me know.
>Such are the advances in electronics that you get all this capability >for less than the cost and power of an 8-bit CPU from 15 years ago.
I love many aspects of today's micros. No question. But some aspects have little market, yet are something I'd use because I have the knowledge to use them. They appear from time to time. It used to be that every programmer was a Ph.D physicist or Ph.D mathematician. The "pyramid" of programmer skills was tiny -- only the apex of today's pyramid existed then because EVERYONE was highly skilled. Now, that pyramid has grown to huge heights with its base including people who have never so much as heard of an ALU. It's a MUCH BIGGER tent, so to speak. But that also means that those making products find they need to cater to the bottom 90% of the pyramid, not the top 10% (which is close enough to zero market to them as to be equivalent.)
>I'm waiting on delivery of one of the Parallela multicore systems >from Adapteva. It has an ARM supervisor running linux and multicore >RISC chips with FPUs. More number crunching power than I should ever >need.
hehe.
>For now, I just appreciate the ability of the CM4 to run fairly >simple IIR and FIR filters using floating point coefficients I >generate with Matlab.
You are on the "time to market" driving side of things where cost per unit is less of an issue. I have similar pressures (speed, number crunching, IIR and FIR filtering, low power, etc.) But I also _may_ have some other pressures that include those and add some more -- such as very low cost, very small size, long term support by vendors, and so on. Jon
On Fri, 4 Jan 2013 04:00:58 +0000 (UTC),
Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

>Mark Borgerson <mborgerson@comcast.net> wrote: >> IIRC, the Cortex M4 instructions which would cause the greatest >> variation in interrupt latency (load and store multiple and divide) >> are, themselves, interruptible. I would guess that the interrupt >> latency variation would be on the order of 1 to 2 cycle times--- >> or about 12.5 nSec for a 168MHz clock. The overall latency is >> listed as 12 clock cycles or about 60-70nSec. > >At least on the M3, the interrupt tail-chaining optimization can vary the >latency by up to six cycles if I'm reading the documentation right (if >there is a pending interrupt when the CPU is leaving an ISR, it will skip >unstacking and immediately restacking the CPU registers). I don't know if >there is a way to turn off this feature, and I assume it is also present >on the M4. The newer and faster Cortexes also have various flash >acceleration mechanisms that are growing ever closer to full caches, but >those can at least be turned off (at the expense of increased latency).
Latency is tolerable. Variability not so much. Jon
In article <e65de8t4vl1fh4srvt3ul0eb5lhs40bp70@4ax.com>, 
jonk@infinitefactors.org says...
> > On Thu, 3 Jan 2013 19:10:05 -0800, Mark Borgerson > <mborgerson@comcast.net> wrote: > > >In article <gq3ce898jeru18r5ufgarts0tb7kfl88ri@4ax.com>, > >jonk@infinitefactors.org says... > >> > >> On Thu, 3 Jan 2013 10:32:46 -0800, Mark Borgerson > >> <mborgerson@comcast.net> wrote: > >> > >> >In article <u3eae8d0bt9c51qq0tbp30mucskp1o4csd@4ax.com>, > >> >jonk@infinitefactors.org says... > >> >> > >> >> On Wed, 2 Jan 2013 22:43:15 -0800, Mark Borgerson > >> >> <mborgerson@comcast.net> wrote: > >> >> > >> >> >In article <9289e8p4ecr3qalegrs5avpq9nmk1ap8jb@4ax.com>, > >> >> >jonk@infinitefactors.org says... > >> >> >> > >> >> >> On Wed, 2 Jan 2013 12:52:47 -0800, Mark Borgerson > >> >> >> <mborgerson@comcast.net> wrote: > >> >> >> > >> >> >> >In article <l996e8dcrons0s9d6104r0t500fra0c0t2@4ax.com>, > >> >> >> >jonk@infinitefactors.org says... > >> >> >> >> > >> >> >> >> On Tue, 01 Jan 2013 12:09:40 +0200, upsidedown@downunder.com > >> >> >> >> wrote: > >> >> >> >> > >> >> >> >> >On Tue, 1 Jan 2013 09:54:30 +0800, "Bruce Varley" <bv@NoSpam.com> > >> >> >> >> >wrote: > >> >> >> >> > > >> >> >> >> >>I need: > >> >> >> >> >> > >> >> >> >> >>o CPU clock 200MHz or higher. > >> >> >> >> >> > >> >> >> >> >>o 2 serial ports, with access to the logic level lines on at least one (LV > >> >> >> >> >>OK). > >> >> >> >> >> > >> >> >> >> >>o USB support. Socket support also would be nice, not essential. > >> >> >> >> >> > >> >> >> >> >>o Some sort of file system. > >> >> >> >> >> > >> >> >> >> >>o Guaranteed turnround of 10mS, even lower would be nice. My ARM Linux > >> >> >> >> >>won'd do better than 20. > >> >> ><<SNIP>> > >> >> >> >> > >> >> >> >> 10ms turnaround would be... unacceptable. > >> >> >> >> > >> >> >> >I'm a bit puzzled here. I usually read '10ms' as 10 milliseconds. > >> >> >> > >> >> >> As do I. > >> >> >> > >> >> >> >That seems like a lot of time for most embedded systems RTOS > >> >> >> >variants, which have task switch times in the low microseconds > >> >> >> >on chips like 160MHz ARM-Cortex STM32s. > >> >> >> > >> >> >> I was using a 20ns cycle time ADSP-21xx processor (50MHz.) > >> >> >> It's a DSP with fixed cycle counts (1) for each instruction > >> >> >> and a guaranteed interrupt latency that NEVER varies (with > >> >> >> certain, inconsequential [to my application] conditions being > >> >> >> met.) > >> >> >> > >> >> >> >10milliseconds would certainly be too long a response time on > >> >> >> >many of the instruments I've developed--none of which use > >> >> >> >an RTOS. I'm just now starting to play around with > >> >> >> >ChiBios and UCoS-II on the STM32 chips. > >> >> >> > >> >> >> In measurement instruments, which may be used in closed loop > >> >> >> control systems, predictability (both in terms of phase delay > >> >> >> relative to the sensor observation and also in terms of the > >> >> >> variability allowed in that phase delay) is vital. > >> >> >> > >> >> >> I shoot for (and achieve where it is important) variability > >> >> >> that is measured as 0, or if forced in very small integers >0 > >> >> >> like 1 or maybe 2, of cycle variation... measurement to > >> >> >> measurement... both in sampling the sensor as well as in > >> >> >> outputting it via a DAC. (I can't help what happens after.) > >> >> >> In the best of all cases, I implement the closed loop control > >> >> >> in the instrument, as well, so that there is no variability > >> >> >> caused by an external ADC and remaining system. In that case, > >> >> >> I drive the 0-100% control with similar attention to > >> >> >> precision control of the external device (heater, boule > >> >> >> puller, etc.) I also go to the trouble to ensure, where > >> >> >> branching code exists, that each branch takes exactly the > >> >> >> same number of cycles. > >> >> >> > >> >> >> I very much dislike, in cases like this, devices with varying > >> >> >> interrupt latencies (which is almost guaranteed to happen if > >> >> >> the processor has instructions with varying execution time.) > >> >> >> I can control my code and the number of cycles each edge of > >> >> >> it may take, but the hardware latency is out of my control. > >> >> >> So I look for processors where it is predictable, if I need > >> >> >> that. > >> >> >> > >> >> >> An STM32 would not qualify in the case I am thinking about. > >> >> >> > >> >> >IIRC, the Cortex M4 instructions which would cause the greatest > >> >> >variation in interrupt latency (load and store multiple and divide) > >> >> >are, themselves, interruptible. I would guess that the interrupt > >> >> >latency variation would be on the order of 1 to 2 cycle times--- > >> >> >or about 12.5 nSec for a 168MHz clock. The overall latency is > >> >> >listed as 12 clock cycles or about 60-70nSec. > >> >> > >> >> I gained some slightly useful benefits by having exactly 0 > >> >> cycle variation in the application I'm talking about. One > >> >> cycle (20ns in that application) of variation would have made > >> >> a difference to me. The fact that I didn't have to add > >> >> hardware to gain that tiny advantage ALSO was a useful > >> >> benefit. > >> >> > >> >> In the M4, there is also a pipeline and, if I remember, > >> >> "faults" can occur not only in one stage. (I might be wrong > >> >> about that.) You have to consider everything -- instruction > >> >> faults (memory, etc.) But I admit I'm pretty ignorant of the > >> >> M4, too. > >> >> > >> >> >I can see that multiple-cycle instructions with variable execution time > >> >> >inside the interrupt handler could cause phase variations in the output. > >> >> >It might requirem more work to eliminate them than would be the case > >> >> >with a DSP having only a few rare cases to consider. > >> >> > > >> >> >If you're using a DAC in the loop and want consistent phase delays, > >> >> >does that require a flash DAC? With a successive approximation DAC, the > >> >> >delay until you get the desired output would seem to depend on the > >> >> >value output unless there is a fast sample-and-hold between the > >> >> >DAC and the control system. > >> >> > >> >> I added the full closed loop control PID into the instrument. > >> >> (It didn't have the ability beforehand.) In doing so, there > >> >> was no DAC involved at that stage, anymore. > >> >> > >> >> >If you want outputs free of all phase jitter, a sample and hold > >> >> >triggered by a hardware clock could solve the problem. The problem > >> >> >then becomes what synchronization delays are acceptable. > >> >> > >> >> Price, size, power, etc., all mattered. Very competitive > >> >> marketplace in that case. > >> >> > >> >> Jon > >> > > >> >The 50Mhz ADSP-21061KSZ-200-ND is $101.43 qty 1 at Digikey. The > >> >168Mhz STM32F407 is about $12. That seems pretty competitive to > >> >me ;-) How much work would it take to tune the STM32 and would you > >> >sell enough with an $80 lower price to be worth the effort. > >> > >> The ADSP-21xxx is not even close to the ADSP-21xx and I > >> wasn't using the ADSP-21061KSZ. It was an ADSP-2111 and > >> ADSP-2105. They were MUCH cheaper at the time (circa early > >> 1990's) and the competition elsewhere was effectively zero. > >> Since then there are many more options and many more players > >> and the ADSP-21xx processors I was using probably aren't even > >> available (much, if at all.) If I were doing this today, I'd > >> pick something else. > >> > >> >I suspect the STM32 is lower in power at 168Mhz than the DSP > >> >at 50MHz, but I haven't verified that guess. > >> > >> There was NO floating point on the units I used. A nice > >> barrel shifter (combinatorial, one-cycle) though and I used > >> it for writing my own floating point. Power consumption was > >> quite low --- for the time. > >> > > > >I think all the Cortex M3 and M4s have single cycle barrel shifters and > >single-cycle multiply. Integer divides can take a few cycles. > > I don't think the M4 has a barrel shifter -- not one that is > available to the instruction set. The ADSP-21xx could find > the leading bit in a 16 bit word in 1 clock, in a 32-bit word > in two clocks (two seperate instructions.) But during that > time, I could also do two memory moves per cycle, as well.
Doesn't the ability to rotate right by 1 to 32 bits in a single cycle imply a barrel shifter? I think the Cortex M4 can find the leading bit in a 32-bit register with the CLZ (Count Leading Zeroes) instruction in a single cycle.
> > So it normalized and denormalized in 1 to 2 clocks depending > on the word size I was using. The number of shifts required > (or used) was stored in another register.
> > If you know of the instructions on the M4 that do that, > please let me know.
The ARM reference suggests a way to normalize a 32-bit word in 2 clocks using the CLZ and shift instructions: "Use the CLZ Thumb-2 instruction followed by a left shift of Rm by the resulting Rd value to normalize the value of register Rm. Use MOVS, rather than MOV, to flag the case where Rm is zero: CLZ r5, r9 MOVS r9, r9, LSL r5" http://infocenter.arm.com/help/index.jsp? topic=/com.arm.doc.dui0553a/CIHJJEIH.html Of course, if you have an FPU, you would generally let it handle normalization and denormalization. IIRC, the CM4 can convert a 32-bit integer to IEEE-854 floating point with a single instruction. You may have to set some global rounding and saturation flags before that.
> > >Such are the advances in electronics that you get all this capability > >for less than the cost and power of an 8-bit CPU from 15 years ago. > > I love many aspects of today's micros. No question. But some > aspects have little market, yet are something I'd use because > I have the knowledge to use them. They appear from time to > time. It used to be that every programmer was a Ph.D > physicist or Ph.D mathematician.
I don't think that's been true since the very early 1960s. In 1968, I took an undergrad university course that used an early BASIC-like language on a time-shared CDC machine. By 1974, PDP-8s were widely used at sea and ashore by oceanographers. There was even a PDP-10 available on the top floor of the oceanography building for free use by grad students. I had a friend back in the early 80's with most of an associate degree that did a lot of Apple II and Macintosh programming. He also went on to work on the math functions in Excel at Microsoft. He was very smart--- but too busy programming to finish a college degree. The Phds in math and physics that I've worked with all seemed to want to use Linux for everything! ;-)
> The "pyramid" of programmer > skills was tiny -- only the apex of today's pyramid existed > then because EVERYONE was highly skilled. Now, that pyramid > has grown to huge heights with its base including people who > have never so much as heard of an ALU. It's a MUCH BIGGER > tent, so to speak. But that also means that those making > products find they need to cater to the bottom 90% of the > pyramid, not the top 10% (which is close enough to zero > market to them as to be equivalent.) > > >I'm waiting on delivery of one of the Parallela multicore systems > >from Adapteva. It has an ARM supervisor running linux and multicore > >RISC chips with FPUs. More number crunching power than I should ever > >need. > > hehe. > > >For now, I just appreciate the ability of the CM4 to run fairly > >simple IIR and FIR filters using floating point coefficients I > >generate with Matlab. > > You are on the "time to market" driving side of things where > cost per unit is less of an issue. I have similar pressures > (speed, number crunching, IIR and FIR filtering, low power, > etc.) But I also _may_ have some other pressures that include > those and add some more -- such as very low cost, very small > size, long term support by vendors, and so on. >
I do appreciate being on the low-volume end of things. Spending a few extra bucks for CPUs and batteries is not much of an issue when it costs $200K plus $20K per day for 30 days to deploy a dozen instrumentsmoorings on the equator. I'm not even on the top end of that cost curve---the space guys have the oceanographers beat by orders of magnitude. OTOH, they don't have to worry about their equipment being damaged or destroyed by fishermen who find their equipment a handy place to tie up for the night. Mark Borgerson
In article <ip5de8dmnidd23jt44f8il9pl5m3qddk9d@4ax.com>, 
jonk@infinitefactors.org says...
> > On Fri, 4 Jan 2013 04:00:58 +0000 (UTC), > Anders.Montonen@kapsi.spam.stop.fi.invalid wrote: > > >Mark Borgerson <mborgerson@comcast.net> wrote: > >> IIRC, the Cortex M4 instructions which would cause the greatest > >> variation in interrupt latency (load and store multiple and divide) > >> are, themselves, interruptible. I would guess that the interrupt > >> latency variation would be on the order of 1 to 2 cycle times--- > >> or about 12.5 nSec for a 168MHz clock. The overall latency is > >> listed as 12 clock cycles or about 60-70nSec. > > > >At least on the M3, the interrupt tail-chaining optimization can vary the > >latency by up to six cycles if I'm reading the documentation right (if > >there is a pending interrupt when the CPU is leaving an ISR, it will skip > >unstacking and immediately restacking the CPU registers). I don't know if > >there is a way to turn off this feature, and I assume it is also present > >on the M4. The newer and faster Cortexes also have various flash > >acceleration mechanisms that are growing ever closer to full caches, but > >those can at least be turned off (at the expense of increased latency). > > Latency is tolerable. Variability not so much.
Hmmm. If the process requiring minimal variation was the highest priority, it shouldn't have to worry about variations from tail- chaining. Doesn't that only happen with an interrupt of lower ore equal priority is triggered and whose handler gets executed after the handler of the higher priority interrupt is finished? I'll have to re-read that section of the STM32 user guide. Mark Borgerson
Mark Borgerson <mborgerson@comcast.net> wrote:
> Hmmm. If the process requiring minimal variation was the highest > priority, it shouldn't have to worry about variations from tail- > chaining. Doesn't that only happen with an interrupt of lower > ore equal priority is triggered and whose handler gets executed after > the handler of the higher priority interrupt is finished?
A higher-priority interrupt can arrive during exception return. Quoting section B1.5.12 of the ARMv7-M ARM: "The ARMv7-M architecture does not specify the point at which the processor recognizes any asynchronous exception that arrives during an exception. If the processor recognizes a new exception while it is tail-chaining another exception, and the new exception has higher priority than the exception being tail-chained, then the processor can, instead, take the new exception, using late-arrival preemption. It is IMPLEMENTATION DEFINED what conditions, if any, lead to late arrival preemption." -a
On Fri, 4 Jan 2013 08:26:23 -0800, Mark Borgerson
<mborgerson@comcast.net> wrote:

>In article <e65de8t4vl1fh4srvt3ul0eb5lhs40bp70@4ax.com>, >jonk@infinitefactors.org says... >> >> On Thu, 3 Jan 2013 19:10:05 -0800, Mark Borgerson >> <mborgerson@comcast.net> wrote: >> >> >In article <gq3ce898jeru18r5ufgarts0tb7kfl88ri@4ax.com>, >> >jonk@infinitefactors.org says... >> >> >> >> On Thu, 3 Jan 2013 10:32:46 -0800, Mark Borgerson >> >> <mborgerson@comcast.net> wrote: >> >> >> >> >In article <u3eae8d0bt9c51qq0tbp30mucskp1o4csd@4ax.com>, >> >> >jonk@infinitefactors.org says... >> >> >> >> >> >> On Wed, 2 Jan 2013 22:43:15 -0800, Mark Borgerson >> >> >> <mborgerson@comcast.net> wrote: >> >> >> >> >> >> >In article <9289e8p4ecr3qalegrs5avpq9nmk1ap8jb@4ax.com>, >> >> >> >jonk@infinitefactors.org says... >> >> >> >> >> >> >> >> On Wed, 2 Jan 2013 12:52:47 -0800, Mark Borgerson >> >> >> >> <mborgerson@comcast.net> wrote: >> >> >> >> >> >> >> >> >In article <l996e8dcrons0s9d6104r0t500fra0c0t2@4ax.com>, >> >> >> >> >jonk@infinitefactors.org says... >> >> >> >> >> >> >> >> >> >> On Tue, 01 Jan 2013 12:09:40 +0200, upsidedown@downunder.com >> >> >> >> >> wrote: >> >> >> >> >> >> >> >> >> >> >On Tue, 1 Jan 2013 09:54:30 +0800, "Bruce Varley" <bv@NoSpam.com> >> >> >> >> >> >wrote: >> >> >> >> >> > >> >> >> >> >> >>I need: >> >> >> >> >> >> >> >> >> >> >> >>o CPU clock 200MHz or higher. >> >> >> >> >> >> >> >> >> >> >> >>o 2 serial ports, with access to the logic level lines on at least one (LV >> >> >> >> >> >>OK). >> >> >> >> >> >> >> >> >> >> >> >>o USB support. Socket support also would be nice, not essential. >> >> >> >> >> >> >> >> >> >> >> >>o Some sort of file system. >> >> >> >> >> >> >> >> >> >> >> >>o Guaranteed turnround of 10mS, even lower would be nice. My ARM Linux >> >> >> >> >> >>won'd do better than 20. >> >> >> ><<SNIP>> >> >> >> >> >> >> >> >> >> >> 10ms turnaround would be... unacceptable. >> >> >> >> >> >> >> >> >> >I'm a bit puzzled here. I usually read '10ms' as 10 milliseconds. >> >> >> >> >> >> >> >> As do I. >> >> >> >> >> >> >> >> >That seems like a lot of time for most embedded systems RTOS >> >> >> >> >variants, which have task switch times in the low microseconds >> >> >> >> >on chips like 160MHz ARM-Cortex STM32s. >> >> >> >> >> >> >> >> I was using a 20ns cycle time ADSP-21xx processor (50MHz.) >> >> >> >> It's a DSP with fixed cycle counts (1) for each instruction >> >> >> >> and a guaranteed interrupt latency that NEVER varies (with >> >> >> >> certain, inconsequential [to my application] conditions being >> >> >> >> met.) >> >> >> >> >> >> >> >> >10milliseconds would certainly be too long a response time on >> >> >> >> >many of the instruments I've developed--none of which use >> >> >> >> >an RTOS. I'm just now starting to play around with >> >> >> >> >ChiBios and UCoS-II on the STM32 chips. >> >> >> >> >> >> >> >> In measurement instruments, which may be used in closed loop >> >> >> >> control systems, predictability (both in terms of phase delay >> >> >> >> relative to the sensor observation and also in terms of the >> >> >> >> variability allowed in that phase delay) is vital. >> >> >> >> >> >> >> >> I shoot for (and achieve where it is important) variability >> >> >> >> that is measured as 0, or if forced in very small integers >0 >> >> >> >> like 1 or maybe 2, of cycle variation... measurement to >> >> >> >> measurement... both in sampling the sensor as well as in >> >> >> >> outputting it via a DAC. (I can't help what happens after.) >> >> >> >> In the best of all cases, I implement the closed loop control >> >> >> >> in the instrument, as well, so that there is no variability >> >> >> >> caused by an external ADC and remaining system. In that case, >> >> >> >> I drive the 0-100% control with similar attention to >> >> >> >> precision control of the external device (heater, boule >> >> >> >> puller, etc.) I also go to the trouble to ensure, where >> >> >> >> branching code exists, that each branch takes exactly the >> >> >> >> same number of cycles. >> >> >> >> >> >> >> >> I very much dislike, in cases like this, devices with varying >> >> >> >> interrupt latencies (which is almost guaranteed to happen if >> >> >> >> the processor has instructions with varying execution time.) >> >> >> >> I can control my code and the number of cycles each edge of >> >> >> >> it may take, but the hardware latency is out of my control. >> >> >> >> So I look for processors where it is predictable, if I need >> >> >> >> that. >> >> >> >> >> >> >> >> An STM32 would not qualify in the case I am thinking about. >> >> >> >> >> >> >> >IIRC, the Cortex M4 instructions which would cause the greatest >> >> >> >variation in interrupt latency (load and store multiple and divide) >> >> >> >are, themselves, interruptible. I would guess that the interrupt >> >> >> >latency variation would be on the order of 1 to 2 cycle times--- >> >> >> >or about 12.5 nSec for a 168MHz clock. The overall latency is >> >> >> >listed as 12 clock cycles or about 60-70nSec. >> >> >> >> >> >> I gained some slightly useful benefits by having exactly 0 >> >> >> cycle variation in the application I'm talking about. One >> >> >> cycle (20ns in that application) of variation would have made >> >> >> a difference to me. The fact that I didn't have to add >> >> >> hardware to gain that tiny advantage ALSO was a useful >> >> >> benefit. >> >> >> >> >> >> In the M4, there is also a pipeline and, if I remember, >> >> >> "faults" can occur not only in one stage. (I might be wrong >> >> >> about that.) You have to consider everything -- instruction >> >> >> faults (memory, etc.) But I admit I'm pretty ignorant of the >> >> >> M4, too. >> >> >> >> >> >> >I can see that multiple-cycle instructions with variable execution time >> >> >> >inside the interrupt handler could cause phase variations in the output. >> >> >> >It might requirem more work to eliminate them than would be the case >> >> >> >with a DSP having only a few rare cases to consider. >> >> >> > >> >> >> >If you're using a DAC in the loop and want consistent phase delays, >> >> >> >does that require a flash DAC? With a successive approximation DAC, the >> >> >> >delay until you get the desired output would seem to depend on the >> >> >> >value output unless there is a fast sample-and-hold between the >> >> >> >DAC and the control system. >> >> >> >> >> >> I added the full closed loop control PID into the instrument. >> >> >> (It didn't have the ability beforehand.) In doing so, there >> >> >> was no DAC involved at that stage, anymore. >> >> >> >> >> >> >If you want outputs free of all phase jitter, a sample and hold >> >> >> >triggered by a hardware clock could solve the problem. The problem >> >> >> >then becomes what synchronization delays are acceptable. >> >> >> >> >> >> Price, size, power, etc., all mattered. Very competitive >> >> >> marketplace in that case. >> >> >> >> >> >> Jon >> >> > >> >> >The 50Mhz ADSP-21061KSZ-200-ND is $101.43 qty 1 at Digikey. The >> >> >168Mhz STM32F407 is about $12. That seems pretty competitive to >> >> >me ;-) How much work would it take to tune the STM32 and would you >> >> >sell enough with an $80 lower price to be worth the effort. >> >> >> >> The ADSP-21xxx is not even close to the ADSP-21xx and I >> >> wasn't using the ADSP-21061KSZ. It was an ADSP-2111 and >> >> ADSP-2105. They were MUCH cheaper at the time (circa early >> >> 1990's) and the competition elsewhere was effectively zero. >> >> Since then there are many more options and many more players >> >> and the ADSP-21xx processors I was using probably aren't even >> >> available (much, if at all.) If I were doing this today, I'd >> >> pick something else. >> >> >> >> >I suspect the STM32 is lower in power at 168Mhz than the DSP >> >> >at 50MHz, but I haven't verified that guess. >> >> >> >> There was NO floating point on the units I used. A nice >> >> barrel shifter (combinatorial, one-cycle) though and I used >> >> it for writing my own floating point. Power consumption was >> >> quite low --- for the time. >> >> >> > >> >I think all the Cortex M3 and M4s have single cycle barrel shifters and >> >single-cycle multiply. Integer divides can take a few cycles. >> >> I don't think the M4 has a barrel shifter -- not one that is >> available to the instruction set. The ADSP-21xx could find >> the leading bit in a 16 bit word in 1 clock, in a 32-bit word >> in two clocks (two seperate instructions.) But during that >> time, I could also do two memory moves per cycle, as well. > >Doesn't the ability to rotate right by 1 to 32 bits in a single >cycle imply a barrel shifter?
I suppose. The one in the ADSP-21xx requires much more logic. The ADSP-21xx barrel shifter can do both normalization and denormalization in a single cycle. Lane changes alone is, in my mind, only part of the job. Once you have the ability to do a 0-31 lane change, it's a shame to not add the gates for normalization.
>I think the Cortex M4 can find the leading bit in a 32-bit register >with the CLZ (Count Leading Zeroes) instruction in a single cycle.
If this is a processor with a floating point unit, it's not something I care about. I'd be looking for integer units (as I wouldn't want to waste power on clocking substantial die space when not in use.) A quick google tells me there is an M4 and an M4F, but then looking at the web page below you point towards, I see that there is a chapter (3.11) called "Floating-point instructions" underneath the heading of "Cortex-M4 Devices Generic User Guide"... so I don't know if all of them include FP or if some do and some don't and which you may be discussing here.
>> So it normalized and denormalized in 1 to 2 clocks depending >> on the word size I was using. The number of shifts required >> (or used) was stored in another register. > >> If you know of the instructions on the M4 that do that, >> please let me know. > >The ARM reference suggests a way to normalize a 32-bit word >in 2 clocks using the CLZ and shift instructions: > >"Use the CLZ Thumb-2 instruction followed by a left shift of Rm by the >resulting Rd value to normalize the value of register Rm. Use MOVS, >rather than MOV, to flag the case where Rm is zero: > CLZ r5, r9 > MOVS r9, r9, LSL r5" > >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/CIHJJEIH.html
Thanks!
>Of course, if you have an FPU, you would generally let it handle >normalization and denormalization. IIRC, the CM4 can >convert a 32-bit integer to IEEE-854 floating point with a >single instruction. You may have to set some global rounding >and saturation flags before that.
I do specialized floating point which permits me to optimize for the application. Generic FP is great for generic work. Not great for some things where, for example, dynamic range can be traded for precision or visa versa or where I know, a priori, that an entire vector will all share the same exponent. Just as a few real world examples in actual applications already fielded. I want the core tools, but I want to write my own microcode (in effect.) And I want small die space (better yield, lower cost, lower power consumption.) Just give me the basic lower level components of FP.
>> >Such are the advances in electronics that you get all this capability >> >for less than the cost and power of an 8-bit CPU from 15 years ago. >> >> I love many aspects of today's micros. No question. But some >> aspects have little market, yet are something I'd use because >> I have the knowledge to use them. They appear from time to >> time. It used to be that every programmer was a Ph.D >> physicist or Ph.D mathematician.
>I don't think that's been true since the very early >1960s. In 1968, I took an undergrad university course that >used an early BASIC-like language on a time-shared >CDC machine.
You are right about the time period. My recollection is similar. Doesn't change the point about the pyramid or the programmer marketplace that today is being addressed by chip vendors and software development tool vendors.
>By 1974, PDP-8s were widely used at sea and ashore >by oceanographers. There was even a PDP-10 available >on the top floor of the oceanography building for >free use by grad students.
Makes sense.
>I had a friend back in the early 80's with >most of an associate degree that did a lot of Apple II and >Macintosh programming. He also went on to work on the >math functions in Excel at Microsoft. He was very smart--- >but too busy programming to finish a college degree.
That also makes sense to me!
>The Phds in math and physics that I've worked with all seemed >to want to use Linux for everything! ;-)
Hehe.
>> The "pyramid" of programmer >> skills was tiny -- only the apex of today's pyramid existed >> then because EVERYONE was highly skilled. Now, that pyramid >> has grown to huge heights with its base including people who >> have never so much as heard of an ALU. It's a MUCH BIGGER >> tent, so to speak. But that also means that those making >> products find they need to cater to the bottom 90% of the >> pyramid, not the top 10% (which is close enough to zero >> market to them as to be equivalent.) >> >> >I'm waiting on delivery of one of the Parallela multicore systems >> >from Adapteva. It has an ARM supervisor running linux and multicore >> >RISC chips with FPUs. More number crunching power than I should ever >> >need. >> >> hehe. >> >> >For now, I just appreciate the ability of the CM4 to run fairly >> >simple IIR and FIR filters using floating point coefficients I >> >generate with Matlab. >> >> You are on the "time to market" driving side of things where >> cost per unit is less of an issue. I have similar pressures >> (speed, number crunching, IIR and FIR filtering, low power, >> etc.) But I also _may_ have some other pressures that include >> those and add some more -- such as very low cost, very small >> size, long term support by vendors, and so on. >> >I do appreciate being on the low-volume end of things. Spending a few >extra bucks for CPUs and batteries is not much of an issue >when it costs $200K plus $20K per day for 30 days to deploy >a dozen instrumentsmoorings on the equator. I'm not even on the top >end of that cost curve---the space guys have the oceanographers beat by >orders of magnitude. OTOH, they don't have to worry about >their equipment being damaged or destroyed by fishermen who >find their equipment a handy place to tie up for the night.
Hehe!! Some day I'd really love to hear the stories!! And share some of my own. My only exposure to ocean work was with sound propagation through and between thermal layers (reflections, etc.) Jon
On Fri, 04 Jan 2013 12:40:59 -0800, Jon Kirwan
<jonk@infinitefactors.org> wrote:

><snip>
>>Mark B. wrote: >>Doesn't the ability to rotate right by 1 to 32 bits in a single >>cycle imply a barrel shifter? > >I suppose. The one in the ADSP-21xx requires much more logic. >The ADSP-21xx barrel shifter can do both normalization and >denormalization in a single cycle. Lane changes alone is, in >my mind, only part of the job. Once you have the ability to >do a 0-31 lane change, it's a shame to not add the gates for >normalization.
See this: http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_2.pdf It covers the 2100 Family barrel shifter unit, starting on page 2-22 (section 2.4). The overview says, "The shifter provides a complete set of shifting functions for 16-bit inputs, yielding a 32-bit output. These include arithmetic shift, logical shift and normalization. The shifter also performs derivation of exponent and derivation of common exponent for an entire block of numbers. These basic functions can be combined to efficiently implement any degree of numerical format control, including full floating-point representation." My kind of barrel shifter module. Wouldn't mind a 32x64. But this is quite tolerable. For interrupt latency (and I was using the timer here and had complete control over the memory system), see: http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_3.pdf In this case, section 3.4.3.1, page 3-19ff. "For the timer interrupt on these processors, the latency from when the interrupt occurs to when the first instruction of the service routine is executed is only one cycle. This is shown in Figure 3.3. The single cycle of latency is needed to fetch the instruction stored at the interrupt vector location." My kind of interrupt latency variability. Jon
On Fri, 4 Jan 2013 19:19:44 +0000 (UTC),
Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

>Mark Borgerson <mborgerson@comcast.net> wrote: >> Hmmm. If the process requiring minimal variation was the highest >> priority, it shouldn't have to worry about variations from tail- >> chaining. Doesn't that only happen with an interrupt of lower >> ore equal priority is triggered and whose handler gets executed after >> the handler of the higher priority interrupt is finished? > >A higher-priority interrupt can arrive during exception return. Quoting >section B1.5.12 of the ARMv7-M ARM: >"The ARMv7-M architecture does not specify the point at which the >processor recognizes any asynchronous exception that arrives during an >exception. If the processor recognizes a new exception while it is >tail-chaining another exception, and the new exception has higher priority >than the exception being tail-chained, then the processor can, instead, >take the new exception, using late-arrival preemption. It is >IMPLEMENTATION DEFINED what conditions, if any, lead to late arrival >preemption."
Thanks, Anders. I can see that there is interesting reading ahead should I decide to use this architecture for certain applications. I don't mind nuance, so long as it is predictable. From your points and the above, if a particular implementation of the core is chosen (a specific part from a specific manufacturer) then would it be possible to establish timer interrupts together with crafted software in order to drive I/O pins with guaranteed known latencies? ("Implementation defined" connotes to me that it may actually be defined for some specific implementation.) To put the question in concrete terms, assume there is a background task running but that I want to use a timer to trigger an ADC sample and hold circuit, followed by another triggering the ADC conversion start, where an exact number of CPU cycles from one to the other is vital... and do this WITHOUT the use of a timer counter output module designed in hardware? (That isn't a real example. I would normally use the output module's features. But removing that possibility gets at the question I'm asking better without having to describe the real application in detail. So assume no hardware support except for the timer interrupt event.) Thanks by the way for what you've already added! Jon