2010-09-24 09:28, David Brown skrev:
> On 24/09/2010 01:48, Ulf Samuelsson wrote:
>> 2010-09-23 23:17, rickman skrev:
>>> On Sep 23, 12:52 pm, Ulf Samuelsson<nospam....@atmel.com> wrote:
>>>> 2010-09-23 09:52, David Brown skrev:
>>>>
>>>>
>>>>
>>>>> On 23/09/2010 08:30, Ulf Samuelsson wrote:
>>>>>> 2010-09-21 13:16, rickman skrev:
>>>>>>> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote:
>>>>>>>> David Brown skrev:
>>>>>>>>> A better solution for micros like that is a wider flash design
>>>>>>>>> with an
>>>>>>>>> sram buffer in the flash module - that is certainly how some
>>>>>>>>> manufacturers handle the problem. It is a simpler solution than
>>>>>>>>> a full
>>>>>>>>> instruction cache because you have only a single "tag" (or perhaps
>>>>>>>>> two,
>>>>>>>>> if you have two such buffers), and there are no issues with
>>>>>>>>> coherence or
>>>>>>>>> anything else. The buffer of perhaps 256 bytes gets filled
>>>>>>>>> whenever
>>>>>>>>> you
>>>>>>>>> access a new "page" in the flash, so that the processor then reads
>>>>>>>>> from
>>>>>>>>> the buffer rather than directly from the flash. And if
>>>>>>>>> space/economics
>>>>>>>>> allow, you have have a wider flash-to-buffer bus to keep up a high
>>>>>>>>> bandwidth even with slow flash and a fast processor.
>>>>
>>>>>>>> The disadvantage of having a 256 byte wide memory, is power
>>>>>>>> consumption.
>>>>>>>> You will have 2048 active sense amplifiers.
>>>>>>>> I dont see that coming soon.
>>>>
>>>>>>> I hope you aren't involved in architecting new MCU designs. I don't
>>>>>>> think anyone said they wanted 2048 sense amplifiers. I would either
>>>>>>> interpret the above to be "256 bits" or I would consider an
>>>>>>> implementation that used a 256 byte cache of some sort. What
>>>>>>> would be
>>>>>>> the utility of a 256 byte wide interface to the Flash? Even the
>>>>>>> fastest CM3 CPUs can't run at nearly that speed.
>>>>
>>>>>>> Rick
>>>>
>>>>>> I am certainly involved in the definition of new MCU designs,
>>>>>> altough mostly by providing ideas.
>>>>
>>>>>> He said that he wanted a 256 byte buffer, and i really doubt
>>>>>> that this should be interpreted as bits.
>>>>
>>>>>> He only said that the buffer will be filled when you
>>>>>> accessed a new page, and did not state how many cycles it would take.
>>>>>> From performance point of view, it makes more sense to load it in one
>>>>>> cycle. If you start loading using sequential accesses to the flash,
>>>>>> you will probably waste both cycles and power.
>>>>
>>>>> From the performance viewpoint, loading in a single cycle would be
>>>>> ideal - but from the space and power viewpoint that would be a bad
>>>>> idea.
>>>>> So loading sequentially with a medium-width bus (I suggested 64
>>>>> bit) is
>>>>> likely to be the best compromise.
>>>>
>>>>>> The proposal is already implemented in page mode DRAMs,
>>>>>> so it may make sense at first, unless you know more about flash
>>>>>> internals.
>>>>
>>>>> I know enough about flash internals to know it is a useful idea, and
>>>>> could be a cheap, simple and low-power method to improve flash access
>>>>> speeds. I know enough about chip design and logic design to know that
>>>>> de-coupling the flash access and control logic from the processor's
>>>>> memory bus will simplify some of the logic, and reduce the levels of
>>>>> combination logic that must be completed within a clock cycle. It also
>>>>> allows the processor and the flash module to run at independent
>>>>> speeds.
>>>>
>>>>> I also know that it would complicate other parts of the design, and
>>>>> the
>>>>> extra unnecessary flash reads may outweigh the flash reads spared.
>>>>
>>>>> In effect, my suggestion is a cache front-end to the flash with just
>>>>> one
>>>>> line, but a large line width and perhaps two-way associativity. The
>>>>> ideal balance may be different - half the line width and four-way
>>>>> associativity might be better. It's all a balancing act.
>>>>
>>>>> I also know that I don't know nearly enough detail to judge whether
>>>>> the
>>>>> sums will add up to making this a good idea in practice. It depends on
>>>>> so many factors such as flash design (some incur extra delays when
>>>>> switching pages), access times, power requirements of the different
>>>>> parts, access patterns on the instruction bus, area costs, design
>>>>> times
>>>>> and design costs, etc., and I don't know anything about these.
>>>>
>>>>> I am also fairly sure that the designers who /are/ capable of
>>>>> calculating and balancing these tradeoffs will have thought of doing
>>>>> something like this. There are certainly similar solutions used on
>>>>> many
>>>>> high-speed flash microcontrollers, though they may be much smaller. It
>>>>> could well be that my suggested 256 byte buffer is far too big, and
>>>>> that
>>>>> an 8 or 16 byte buffer is fine when your cpu clock speed is not too
>>>>> much
>>>>> higher than the flash access speed.
>>>>
>>>> I think that the way this is implemented is through an instruction
>>>> queue. This was implemented in early 32 bit chips, like the NS32016
>>>> and the MC68010. The MC68010 even allowed you to loop in the queue.
>>>>
>>>> It is not implemented on the ARM, and I do not think that it
>>>> exists in the Cortex-M3 as well. The AVR32 does have a queue
>>>> and will fetch instructions faster that it will execute,
>>>> and this is one reason why the AVR32 can handle waitstates
>>>> better than the Cortex-m3.
>>>>
>>>> On the AVR32 you lose about 7% due to the waitstate on the first
>>>> access,
>>>> and you only need one waitstate at 66 MHz, the top speed of current
>>>> production parts.
>>>>
>>>> You will not get 100% hitrate, so your boost will be less than 7%.
>>>> If you do add SRAM, you might be better off adding a branch-target
>>>> cache to get rid of the initial waitstates.
>>>> Once you start running sequential fetch the wide memory will
>>>> give you a benefit but even a 128 bit flash can be a hog on power.
>>>>
>>>> The SAM7 with a 32 bit flash is faster than an LPC2xxx with 128 bit
>>>> flash, at the same frequency when running Thumb Mode,
>>>> and it draws much less current.
>>>> The faster flash makes all the difference.
>>>> The LPC2xxxx can offset this with a slightly higher clock rate,
>>>> but that will not make power consumption better.
>>>
>>> So many IFs, so little time. Benchmarking is an art, not a science.
>>> Best to run your app and see what is faster for your app.
>>>
>>> Rick
>>
>> If fast is the parameter you are looking for!
>> Many applications need a certain speed, but once it is there,
>> it will not use additional performance.
>>
>> You have a basic selection between speed and code size on the ARM7,
>> but with waitstates the lower memory use of the Thumb instruction set
>> can make it faster than the ARM instruction set.
>>
>
> I think it is interesting to look at the history of instruction sets.
> Long ago, there were two competing ideas - there were CISC instruction
> sets with very varied instruction sets (typically in 8-bit parts), and
> RISC which were all consistent and wide (typically 32-bit). It turns out
> that both extremes were "wrong", and the most efficient modern
> instruction sets for small devices are 16-bit wide for most
> instructions, with some 32-bit (or 48-bit) for flexibility. Consistency
> and orthogonality of the architecture is important, but should not be
> taken to extremes. There is a lot to like about the Thumb2 set - I think
> it's a big improvement on the original ARM ISA.
>
> Of course, the 68000 designers at Motorola figured this out about 30
> years ago...
>
I consider the 68000 to be a nightmare compared to a real orthogonal
machine like the Series 32000...
This did not fix the location of the fields though,
so you would get a messy instruction decoder.
The immediates were 7,14 or 30 bits wide with the size
encoded in either the top bit or the two top bits.
I think variable size, fixed location of fields seems to be most
efficient. The National CompactRISC was one of the first implementations
of this idea.
16/32/48 bit instructions. "Quick" 5 bit immediates
with reserved values that indicated that an extension word or two
followed the instruction.
Internally, the decoder would decode directly the 16/32/48 bit
instruction for a simple pipeline.
For multiple clock functions like interrupts & exceptions
there were 7-8 state machines that could override the
instruction decoder.
Since the datapath was controlled by ~90 signals,
this turned out to be a significant part of the chip.
90 x (9->1 mux) + state logic...
Turned out that this could be simplifed further.
The reason for the state machines were that you need to do
operations which are not supported by the instruction set.
When you enter an interrupt, you need to clear the interrupt flag.
This is really an (unsupported) instruction
AND $IRQMASK, PSR I.E: AND 0xFF7F, PSR
The problem we found, was that you need to do operations
on registers which are not accessible,and you need
a few more instructions.
The problem was solved by extending each register address
from 4 to 6 bits, allowing all registers (including PSR) in the CPU
to be directly addressable by all instructions operating on registers.
The opcode was extended by two bits allowing more instructions
to be directly handled.
The normal instructions was extended from 16 bits to 22 bits
but user code, would only use the normal 16 bits.
If a multiclock function was needed, then the instruction decoder
was fed from a 22 bit wide ROM which ran for a few clock cycles.
8 x 90 bit statemachines + 90 x 9->1 multiplexer were replaced by
a 32-64 x 22 bit ROM and 22 x 2->1 muxes.
--
Best Regards
Ulf Samuelsson
These are my own personal opinions, which may (or may not)
be shared by my employer Atmel Nordic AB