EmbeddedRelated.com
Forums

Small CPUs in FPGAs

Started by Rick Collins July 31, 2008
--- In f..., Veronica Merryfield
wrote:
>
> Richard
>
> > The ARM processors have multiple stacks.
> >
> The ARMs have operating modes with register sets each. The software
> usually has a stack assigned to each mode.
> > One drawback is that they are of fixed size and statically allocated
> > at startup.
> >
> The ARM does not imply anything on the size and allocation of a stack.
> An OS might though.
> > I don't see why a memory management unit couldn't solve the whole
> > problem. Memory for stacks could be dynamically allocated in
> > relatively small chunks. There would need to be some kind of page
> > fault detection. Presumably the system level stacks would be large
> > enough to prevent faults during hardware interrupts.
> >
> The ARM memory manager works in 1K, 4K or 64K (IIRC) chunks depending
> on how it is set up and how much space one wants to use for paging
> tables. Some OSes allocate an initial page for a task stack and use
> the page faulting as you suggest. It works well enough for regular
> tasks but not so well on realtime tasks and as you point out, the
> supervisor stack should be fixed. One side issue here is what happens
> in the exception handling if you use page faults for resizing a stack
> if the exception handler uses the stack.
>
> Veronica
>

I guess I was thinking about the ARM7TDMI devices - microcontrollers
really. Certainly the larger devices with MMUs have much more
capability but, other than at the Linux user level, I haven't spent
much time considering the hardware.

I would imagine that the exception handler on these larger ARMs would
use one of the system stacks. I don't know that for sure but even the
ARM7s have separate stacks for undefined instruction trap, abort
interrupt, FIQ (fast interrupts), IRQ (normal interrupts), supervisor
and user.

Richard



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Hi Richard,

rtstofer wrote:
> The ARM processors have multiple stacks. One drawback is that they
> are of fixed size and statically allocated at startup. An RTOS like
> FreeRTOS uses fixed size task stacks.
>
>
A fixed stack is not a problem. I am trying to design a processor with
single byte instructions. 4 bits for the Opcode and 4 bits for
addressing or opcode extensions. This allows the instruction word to be
lengthened from 8 bits to 32 bits, which is what I was referring to as a
scalable design. Also keeping the instruction word simple means that you
are reducing control overhead. Using stack based operators means that
you don't have to use up opcode field bits for specifying the register.

> I don't see why a memory management unit couldn't solve the whole
> problem. Memory for stacks could be dynamically allocated in
> relatively small chunks. There would need to be some kind of page
> fault detection. Presumably the system level stacks would be large
> enough to prevent faults during hardware interrupts.
>
>
Certainly memory management came to mind. With the 8 bit instruction
approach the idea was to load and store words from the stack to either
static/global memory via a "common" pointer or to and from local stack
variables indexed via a frame pointer. Because of the fact that the
instructions are all single word, addressing must be done in segments or
pages. The global variables work up from the bottom of a segment where
as stack variables work. There is no reason why global and local
variables should not share the same segment and the MMU does not need to
allocate an entire segment of memory, as the pointers will wrap around
to use whatever memory is allocated.

On System09, I follow the simple Dynamic Address Translation system
implemented on the old SWTP Computer which allows 1 MByte of memory to
be mapped in on 4 K pages. This is simply done with a register lookup
table on the high order address bits. The CoCo3 computer I think can map
2 MBytes of RAM in 8 Kbyte pages. You might want something more advanced
than that, that can detect segment and page faults.

> How did Burroughs handle stacks in their Algol machines such as the
> B5000? http://en.wikipedia.org/wiki/Burroughs_large_systems
>
>
I'll take a closer look at the page later. I had a brief look but found
it a little difficult trying to figure out the stack arrangement from
the diagram.

> Yes, there is a heap with all the usual complications of allocation,
> release and cleanup. For security, some portion of the system needs
> to clear the heap chunks as they are allocated.
>
>
I hadn't considered cleaning the memory pages, but yes ... for the
integrity of the system, you would want to do that. I think Microsoft's
C# uses a hierarchical memory allocation system to simplify memory
allocation and release.

> Given that the expression could be rewritten in postfix notation, I
> think it could. I believe the stack on the HP48GX is not limited
> other than by the size of memory.
>
Would the expression analyzer need to be rewritten for GCC or LCC ?
That's the big question.

> I don't believe I have spent enough time admiring the B5000. This was
> a very clever design!
>
> Richard
>
Yes, the Wikipedia page said it used all the latest approaches to
hardware and software design, circa 1961 :-) Processor design has been
shrunk courtesy of silicon integration, but I don't know that the
architectures have developed much. I mean not software wise. Obviously
there is a lot of work gone into pipelining and branch prediction, FPUs,
caching and so on, but from a point of memory allocation with frame
pointers and so on, I could be wrong, but I'm inclined to think that if
any thing, things have been simplified.

The idea of processors in FPGAs should really be to try and keep the
overhead of control to a minimum so that you are focusing on parallelism
of the design. To implement a Von Neuman processor on an FPGA seems like
a horrible waste of resources. You don't get the advantage of speed of a
custom ASIC and you don't get the benefit of the parallel logic.

Is Reiner Hartenstein monitoring the list ? :-) I noticed he had a few
papers on the "Von Neuman Syndrome".

John.

--
http://www.johnkent.com.au
http://members.optushome.com.au/jekent


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Is it possible to design a scalable search processor that performs
parallel searches in data ....
It would certainly make use of the parallel capacity of FPGA hardware.
What sort of features would such a processor have ?
What sort of searches would it perform ?
What sort of applications would it have ?

John.

--
http://www.johnkent.com.au
http://members.optushome.com.au/jekent


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Did you look up CAMs? (Content addressable memories)

Essentially it depends on the type of operations that you want to perform
and what data structures support that operations.
It also depends on the size of the data you are processing.

In each cycle you can access one word per register, one word per
distributed RAM,
one (or two) words per BRAM, a fraction of one address per external RAM port.

For data that fits in registers there is a very nice architecture for
systolic priority queues.
This means that you can sort small data sets as fast as you can input
data, which means
that you can find finimum, maximum, median, etc.

Also, you can search an arbitrarily large string for a pattern that
fits into your FPGA as
fast as you can input the string. (Smith-Waterman-Algorithm) You can
even find the
optimal non-exact match for rather complex metrics.

Kolja Sulimma

2008/8/8 John Kent :
> Is it possible to design a scalable search processor that performs
> parallel searches in data ....
> It would certainly make use of the parallel capacity of FPGA hardware.
> What sort of features would such a processor have ?
> What sort of searches would it perform ?
> What sort of applications would it have ?
>
> John.
>
> --
> http://www.johnkent.com.au
> http://members.optushome.com.au/jekent

--
cronologic ohg Frankfurt am Main
HRA 42869 beim Amtsgericht Frankfurt
Telefon 069 38 09 78 254



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Hi Kolja,
Kolja Sulimma wrote:
> Did you look up CAMs? (Content addressable memories)
>
> Essentially it depends on the type of operations that you want to perform
> and what data structures support that operations.
> It also depends on the size of the data you are processing.
>
> In each cycle you can access one word per register, one word per
> distributed RAM,
> one (or two) words per BRAM, a fraction of one address per external RAM port.
>
> For data that fits in registers there is a very nice architecture for
> systolic priority queues.
> This means that you can sort small data sets as fast as you can input
> data, which means
> that you can find finimum, maximum, median, etc.
>
>
Back in the 1990s I was working on video median filters. I came up with
a scheme that used an array of magnitude comparators connected to a
parallel shift register array. There was a bus for inserting pixels and
a bus for deleting pixels from the shift register. The delete pixel was
delayed by a delay shift register the length of the sorting shift
register. The idea was that a pixel was inserted into or deleted from
the shift register where there was a change in the magnitude being
greater to being less than or equal. Only the pixels between the insert
point and delete point were shifted up or down. I thought I was rather
clever coming up with the design until I discovered that Arce and Warter
et al. had come up with an almost identical design 10 years earlier,
which goes to show how difficult it is to come up with a truly original
design.

> Also, you can search an arbitrarily large string for a pattern that
> fits into your FPGA as
> fast as you can input the string. (Smith-Waterman-Algorithm) You can
> even find the
> optimal non-exact match for rather complex metrics.
>
>
I notice in Donald Knuth's book on sorting and searching algorithms that
the Batcher sorting network is about the most efficient method of
performing a parallel sort. I would imagine for non exact matches you
could use a binary correlator. i.e. XOR the bits with the pattern being
searched for and sum the bits. You then use a min / max filter to
determine the best correlation. I see no reason why that would not work
on text as well as other data, although on text, you might want to weigh
some of the terms.
> Kolja Sulimma
>
I was pondering the nature of problems that people solve. Systolic
arrays are good for sorting and searching or for Fourier transforms and
so on but that class of regular geometric processing does not really
pervade conventional personal computing. (A bold unjustifiable statement
perhaps.) I notice that the iMX31 ARM chip has a built in MPEG encoder,
so with multimedia applications becoming more prevalent, there does
appear to be a move to integrate dedicated signal processing hardware
into the CPU designs.

For a CPU to really be able to handle sparse data, the data must be
distilled down, using some criteria such as sorting, matching,
transformation or min/max etc, to a condensed form where by you can make
an evaluation and decision on it, usually in the human domain. In the
case of a display card, the information is expanded up from condensed
information to sparse information for display in human readable form
using some sort of geometric transformation.

So I guess, if a transformation is used often enough, it gets integrated
into hardware.

John.

--
http://www.johnkent.com.au
http://members.optushome.com.au/jekent


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Philip Freidin wrote:
>
> And to my surprise, my design got mentioned in the August 2008
> issue of Circuit Cellar magazine, in an article by Tom Cantrell
>

The picture therein demonstrating the R16 to have the
crowning element needed to qualify as a "real" computer:

A front panel with switches and blinkenlights !!!

Brian



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Brian Davis wrote:
> The picture therein demonstrating the R16 to have the
> crowning element needed to qualify as a "real" computer:
>
> A front panel with switches and blinkenlights !!!
>
>
I'm working on that .... (but not R16)
But alas switches and blinking lights use a lot of I/O pins.
> Brian
>
Who at the moment is trying to decide to build 12/24 bit cpu
or 8/16/32 bit cpu both with 256Kb of memory. Note the 12 bit
is a nicer design but everything is 8 bit bytes today.



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Rick Collins wrote:
>
> today's FPGAs no longer include T-bufs so multiplexers
> have to be used increasing the LUT count.
>
My 4005e processor attempt weighed in at about 200 LUTs
and 180 TBUFs in 16 bit mode, including tiny RAM, ROM, and I/O:

http://www.fpga-faq.com/archives/49250.html#49263
http://members.aol.com/fpgastuff/early_y1.tif

I've never managed the time to get the documentation and
compiler finished, or the VHDL code tidied up enough to
release. The simple absolute assembler worked well enough
for writing instruction set test cases and small examples.

I finally got the code synthesizing again last Christmas
holiday (worked in ISE 6.3, but not in ISE 7 & 8 & 9 due
to #$%!@#!! XST alias bugs ), the 32 bit core is now about
800 LUTs in a Spartan3

The unfinished improvements involved adding a SP/FP stack
addressing mode, manual push/pop of the hardware return
stack, cleaner long immediates, and other sundry updates.

see also:
http://tech.groups.yahoo.com/group/fpga-cpu/message/106
http://tech.groups.yahoo.com/group/fpga-cpu/message/118
http://tech.groups.yahoo.com/group/fpga-cpu/message/332
http://tech.groups.yahoo.com/group/fpga-cpu/message/666
Brian



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Hi John,

> I was wondering how a stack oriented machine would handle multitasking ...
> Saving the context of the machine and restoring it.
> I'm inclined to think you'd need a supervisory state like the 68000 to
> support the swapping of TOS, NOS, FP, SP and so on.

In JOP I have some 'special' instructions that can access the stack
cache directly. On the thread switch the stack is exchanged - it is
the machine state.

> I was thinking of using the argument and local variable stack for
> working space, but using the same stack for work space would preclude
> the allocation of new local variables, which may be needed in languages

why? There is no reason to not allocate data (in C with alloca) on
the stack. In Java stack allocation is not possible.

> like C++. Pascal scoping rules mean that all variables are effectively
> created on the stack, (there is the concept of a heap in Pacal too isn't
yes
> there ?). I assume Java is the same. C and C++ have global or static
> variable space as well as arguments and local variables, which are
> normally referenced above or below a frame pointer.

In Java there is also global/static data, but only primitives. All
objects and arrays are allocated on the heap.

Cheers,
Martin



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Hi Martin,
Martin Schoeberl wrote:
> In JOP I have some 'special' instructions that can access the stack
> cache directly. On the thread switch the stack is exchanged - it is
> the machine state.
>
>
I guess it depends on how large the stack is. I would have though it
would be easier to save the frame and stack pointers, CCs, PC and TOS on
the supervisor stack. Presumably you still have a frame pointer and
stack pointer. Moving an entire stack would make a context switch pretty
time consuming. On the other hand if you save the stack pointer, you
need a supervisor stack to save it on.
>> I was thinking of using the argument and local variable stack for
>> working space, but using the same stack for work space would preclude
>> the allocation of new local variables, which may be needed in languages
>>
>
> why? There is no reason to not allocate data (in C with alloca) on
> the stack. In Java stack allocation is not possible.
>
>
well ... in a procedure call, arguments are pushed on the stack, then
the return address then the frame pointer and the frame pointer is
loaded with the stack pointer. Local variables are then allocated on the
stack. Local variables are indexed with a negative offset from the frame
pointer, while the arguments are indexed with a positive offset, taking
into account the space taken up by the return address and frame pointer.
If the stack below the local variables is used for working space for
evaluating expressions, then it becomes hard to re-allocate space for
any new local variables. It depends if there are any working variables
on the stack when a new variable is declared at the start of a block.

> In Java there is also global/static data, but only primitives. All
> objects and arrays are allocated on the heap.
>
>
OK
> Cheers,
> Martin
>
Bed time.
Cheers

John.

--
http://www.johnkent.com.au
http://members.optushome.com.au/jekent


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...