Sign in

username:

password:



Not a member?

Search fpga-cpu



Search tips

Subscribe to fpga-cpu



fpga-cpu by Keywords

Altera | CISCifying | IDE | ISA | Java | JHDL | JTAG | LBU | MicroBlaze | PAR | PCI | RISC | SoC | Spartan | Transputers | Verilog | VHDL | Virtex | VLIW | WebPack | Xilinx | Xsoc | YARD-1A

Ads

Discussion Groups

Discussion Groups | FPGA-CPU | Stack Machine & Memory Bandwidth

This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).

Stack Machine & Memory Bandwidth - rtstofer - Oct 22 23:17:00 2004


It shouldn't be a surprise that with a stack machine for every
memory fetch there is a consequent stack manipulation. So, we're
looking at 2 memory accesses for every instruction, minimum. This
type of machine screams for a Harvard architecture which,
heretofore, I had considered an oddity. Silly me, I get nearly a
100% speedup with the increased bandwidth.

So, the issue is: which prototype board will allow me to have 2
completely separate memory systems 32 bits wide with a 17 or 18 bit
address.

One board that I am considering is
http://www.easyfpga.com/ez3susb_features.htm This board is quite
compact - it doesn't have anything except a 400k gate Spartan 3 and
a USB interface. That would mean that all I/O would have to come
from somewhere else over a somewhat slow USB link. But that might
be very interesting - a Visual Basic program to handle console
functions, perhaps some debugging information, file transfer and the
like. Interesting...

Another board I am considering is
https://digilent.us/Sales/Product.cfm?Prod=D2FT-DIO5 where there is
a possiblity of connecting 2 memory systems. The combination kit
has some advantages but once I fill up the edge connectors with
memory I won't have a way to get off board for I/O such as the IDE.
I would eventually have to lose the peripheral card but it would be
very useful for debugging.

Any other suggestions for a board with a bunch of edge connections?

I'm not hung up on Xilinx, Altera is certainly in the running if I
can find a board with the required configuration.





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )


Re: Stack Machine & Memory Bandwidth - ben franchuk - Oct 22 23:25:00 2004

rtstofer wrote:
>
> It shouldn't be a surprise that with a stack machine for every
> memory fetch there is a consequent stack manipulation. So, we're
> looking at 2 memory accesses for every instruction, minimum. This
> type of machine screams for a Harvard architecture which,
> heretofore, I had considered an oddity. Silly me, I get nearly a
> 100% speedup with the increased bandwidth.
>
> So, the issue is: which prototype board will allow me to have 2
> completely separate memory systems 32 bits wide with a 17 or 18 bit
> address. Most stack machines are often FORTH and you only need a small amount
of stack space say 256 words if you go with that langauge. How ever this
stack could be internal in the FPGA. The same goes for the Forth
return stack as well.
Ben.





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Stack Machine & Memory Bandwidth - Arius - Rick Collins - Oct 23 0:25:00 2004

At 12:17 AM 10/23/2004, you wrote: >It shouldn't be a surprise that with a stack machine for every
>memory fetch there is a consequent stack manipulation. So, we're
>looking at 2 memory accesses for every instruction, minimum. This
>type of machine screams for a Harvard architecture which,
>heretofore, I had considered an oddity. Silly me, I get nearly a
>100% speedup with the increased bandwidth.

I'm not clear how you are architecting this. Most stack machines use a
separate on chip memory for the stack, often two separate memories for data
and return stacks. Likewise, most modern FPGAs have enough memory that you
can put the program and data memory on chip as well. This greatly improves
the speed and allow each memory to be separate.

I found with my stack machine that trying to combine memories and stacks
resulted in a lot of muxes on the address busses. This uses a lot of
resources and slows down the cycle. Also, on chip memories can be dual
port which may be of value.
Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Stack Machine & Memory Bandwidth - rtstofer - Oct 23 3:29:00 2004



All data, especially data local to a procedure, is dynamically
allocated on a stack when the procedure is invoked and the stack is
cut back when the procedure terminates. If the procedure was
actually a function, the return value remains on the top of the
stack (TOS) for the calling procedure to use. There are a bunch of
exceptions but let's just assume a function returning an integer.

Of course, program calling locations (return addresses) are also
stored on this stack in a structure known as a mark stack frame.
This 5 word data structure contains the return address as well as
the contents of the stack pointer, mark stack pointer, a base
pointer and space for the return value. This implies that when a
procedure returns it is possible to cut the stack back to exactly
the point where it was before the call. The memory used by the
terminated procedure is returned to the system for reuse.

Recursive functions are a good example of how this data stack grows
as the function recurses and gets cut back as the function unwinds.
Imagine how it would grow if the procedure had, say, 100 words of
local variables.

There is another data area known as the heap which is just another
stack where data is allocated dynamically. A linked list, for
example, will grow in this area as new nodes are created. Although
it won't be implemented, this is an area where garbage collection
strategies can be employed to recover space as structures are
released.

One stack grows upward in data memory, one stack grows downward; if
they meet in the middle memory overflow results and the CPU stops.

This is not a small return address stack, everything is allocated
here and recursion is encouraged. The entire Pascal compiler uses
recursive descent as it translates the source. In fact, one of the
best features of Pascal is using recursion while traversing
structures. Sure, it can be done in an indexed loop but it isn't as
pretty or as intuitive.

Code, once loaded into memory, is static. In concept, if there was
a small OS, then there would be a small amount of resident code and
other program code would be allocated space before execution and cut
back when terminated. So the compiler could be loaded using 32K
words, or whatever, and when it terminated with results stored on
disk the code space would be available for executing (interpreting)
the results. Mentally, this can be viewed as a stack where chunks
of code are added and removed as different programs are run.

Let's take a simple integer addition of the top two elements on the
stack that pops the two operands and returns the result as the new
TOS (Top Of Stack). To complicate things, let's assume that the TOS
element and NOS (Next On Stack) are not really in SRAM but are held
in registers connected to the ALU. Pretty simple, the ALU is
already connected to TOS and NOS so the result is clocked into TOS.
Now, the NOS register is trash and it needs to be reloaded with the
element below in SRAM pointed to by SP. So, SP is decremented and a
memory fetch refills NOS. Meanwhile, while the stack fetch is going
on it is also time to fetch the next instruction and we might as
well use the incoming instruction to cause a branch in the FSM to
the proper point to execute the instruction - instruction decode is
just a branch at the end of the fetch when the Instruction Register
is loaded. Oh, and SP might as well be maintained as, mentally, SP-
1 since we will almost always be popping something off the stack.
In fact, there should be an adder creating SP-1 and SP+1 at all
times.

In theory, the addition, the stack fetch, updating SP and
instruction fetch could all occur in parallel in a single cycle. If
this was interpreted in code it would likely take 4 CISC
instructions: pop op1, pop op2, add, push result. Or some similar
scheme using index registers. But, the point is, multiple
instructions and perhaps multiple data fetches have to occur to
complete the operation. Really ugly unless the architecture is
designed to support stack operations.

But I can't overlap instruction fetch with stack cleanup unless I
have two completely separate memory spaces. It's not fatal to the
design, I could still execute the integer add in 2 cycles but that
has cut machine speed in half.

I'm really just thinking out loud. I started playing with the
instruction set, looking at how the FSM might execute instructions
and one of the first things to jump out was the issue of instruction
fetch colliding with stack cleanup.

Now, if I really wanted to speed things up, I would interleave the
data stack so that I could fetch two operands in a single cycle. I
think I'll let that go... I would need 3 separate memory spaces or
about 150 pins... Sure, IBM used to fetch 512 bytes at a time but
they had more money to spend.

There are some other interesting instructions because Pascal is
truly a block structured language and nested blocks are encouraged
to enforce scope rules. So, if a procedure wants to fetch an
operand from its calling procedure's calling procedure (grand
parent) then two levels of indirection are required. This n-level
indirection occurs whenever a non-local variable is used and 'n' is
related to the level of nesting. The instruction format includes
the number of levels to indirect plus the offset within the final
level to the specific variable.

Pascal is a beautiful language but the underlying hardware can be a
little awkward. --- In , Arius - Rick Collins
<dsprelated@a...> wrote:
> At 12:17 AM 10/23/2004, you wrote: > >It shouldn't be a surprise that with a stack machine for every
> >memory fetch there is a consequent stack manipulation. So, we're
> >looking at 2 memory accesses for every instruction, minimum. This
> >type of machine screams for a Harvard architecture which,
> >heretofore, I had considered an oddity. Silly me, I get nearly a
> >100% speedup with the increased bandwidth.
>
> I'm not clear how you are architecting this. Most stack machines
use a
> separate on chip memory for the stack, often two separate memories
for data
> and return stacks. Likewise, most modern FPGAs have enough memory
that you
> can put the program and data memory on chip as well. This greatly
improves
> the speed and allow each memory to be separate.
>
> I found with my stack machine that trying to combine memories and
stacks
> resulted in a lot of muxes on the address busses. This uses a lot
of
> resources and slows down the cycle. Also, on chip memories can be
dual
> port which may be of value. >
> Rick Collins
>
> rick.collins@a...
>
> Arius - A Signal Processing Solutions Company
> Specializing in DSP and FPGA design http://www.arius.com
> 4 King Ave 301-682-7772 Voice
> Frederick, MD 21701-3110 301-682-7666 FAX






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - ben franchuk - Oct 23 14:39:00 2004

rtstofer wrote:
... { stack stuff removed }
> There are some other interesting instructions because Pascal is
> truly a block structured language and nested blocks are encouraged
> to enforce scope rules. So, if a procedure wants to fetch an
> operand from its calling procedure's calling procedure (grand
> parent) then two levels of indirection are required. This n-level
> indirection occurs whenever a non-local variable is used and 'n' is
> related to the level of nesting. The instruction format includes
> the number of levels to indirect plus the offset within the final
> level to the specific variable.

Have you looked at the very old Burrough's computers like the B5000?
They look to be interesting stack machines, but documents are few on the
internet.





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Stack Machine & Memory Bandwidth - Tomasz Sztejka - Oct 23 14:50:00 2004

--- ben franchuk <> wrote:
>
> rtstofer wrote:
> >
> > It shouldn't be a surprise that with a stack machine
> for every
> > memory fetch there is a consequent stack manipulation.
> So, we're
> > looking at 2 memory accesses for every instruction,
> minimum. This
> > type of machine screams for a Harvard architecture
> which,
> > heretofore, I had considered an oddity. Silly me, I
> get nearly a
> > 100% speedup with the increased bandwidth.
> >
> > So, the issue is: which prototype board will allow me
> to have 2
> > completely separate memory systems 32 bits wide with a
> 17 or 18 bit
> > address. > Most stack machines are often FORTH and you only need a
> small amount
> of stack space say 256 words if you go with that
> langauge. How ever this
> stack could be internal in the FPGA. The same goes for
> the Forth
> return stack as well.
> Ben.
>
Well, I do beleve, that 256 words may be not enough for
more complex programs. However knowning the way stack is
used (push, pop, peek and peek few elements below top of
stack) you can relatively easy change internal fixed size
stack to ring like based stack cache. The data are pushed
on stack and if stack overflow appears memory write is
generated to flush the oldest data and make place for new
data. Same for pop - if there is not enough data for
deepest possible peek the memory read cycle is generated.
This can spare a lot of memory bandwidth and gives high
performance gain compared to pure stack-in-shared-memory
solution. This also gives you much more flexibility than
on-chip stack only.
I do beleve that 8 or 16 stack levels ring cache will be
enough - most program grows stack by significant number of
words then creates a lot of tiny jitter and shrinks it back
in one move. The whole 'jitter' stuff will be cached and
won't absorb memory. regards,
Tomasz Sztejka. ___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Stack Machine & Memory Bandwidth - rtstofer - Oct 23 16:35:00 2004


I am aware that the old Burrough's machines were stack oriented but I
haven't really researched them very much. IIRC they were developed to
run Algol which is a very close ancestor of Pascal.

There is a reason that stack computers aren't popular - I don't think
they make efficient use of memory bandwidth or the addition of
registers. Then too, most machines aren't really designed for block
structured languages. They can be compiled and run, of course, but it
is always less than optimal.

C has a lot of advantages here in that blocks aren't nested and there
are really only two data areas: global or local. In the case of
Pascal with perhaps 10 levels of lexical nesting the situation on
operand fetch is much more complex. Particularly if a procedure has
recursed down a tree.

I still don't have many of the details straight in my mind. I have a
lot more thinking to do.

> Have you looked at the very old Burrough's computers like the B5000?
> They look to be interesting stack machines, but documents are few on the
> internet.






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Stack Machine & Memory Bandwidth - rtstofer - Oct 23 17:06:00 2004


The memory system you suggest would be elegant. I'm not smart enough
to build it so I'll let it pass. But I do see where it could solve
the bandwidth issue.

The bookkeeping would be very complex, particularly when it comes to
referencing the data segments of enclosing procedures. Maybe complex
is the wrong word but it could take several cycles to calculate where
the stack frame for the enclosing procedure is located. It doesn't
usually have to be loaded, there are a couple of pointers that are
required to perform the indirect addressing. What would be ugly is
the situation where the local data of a procedure was larger than the
internal stack could hold. Some operands could be fetched from
BlockRAM and others would require stack fetches. Now, the compiler
could be adapted to know this and emit the proper code; but not by me!

Still, the idea of maximizing the use of BlockRAM has great merit. I
was thinking that this could contain the code space for certain
standard procedures such as boot loading, device IO, maybe some
functions of the file system. All I have to do is mirror the BlockRAM
into the code memory space and I am off and running. If I go with the
board having the USB interface then driver code will need to be
included here. Oh, and the interpreter FSM better be able to handle
certain IO extensions that are not part of Pascal.

I have the Televideo 950 emulation in mirrored BlockRAM on my CP/M
machine. I didn't want to give up program execution space by having
the code in the BIOS so I mirror it in and out as needed. Yes, it
slows down the terminal routines but that is a good thing - the screen
scrolls by so fast it can't be read anyway.

--- In , Tomasz Sztejka <sztejkat@y...> wrote:
> --- ben franchuk <bfranchuk@j...> wrote:
> >
> > rtstofer wrote:
> > >
> > > It shouldn't be a surprise that with a stack machine
> > for every
> > > memory fetch there is a consequent stack manipulation.
> > So, we're
> > > looking at 2 memory accesses for every instruction,
> > minimum. This
> > > type of machine screams for a Harvard architecture
> > which,
> > > heretofore, I had considered an oddity. Silly me, I
> > get nearly a
> > > 100% speedup with the increased bandwidth.
> > >
> > > So, the issue is: which prototype board will allow me
> > to have 2
> > > completely separate memory systems 32 bits wide with a
> > 17 or 18 bit
> > > address.
> >
> >
> > Most stack machines are often FORTH and you only need a
> > small amount
> > of stack space say 256 words if you go with that
> > langauge. How ever this
> > stack could be internal in the FPGA. The same goes for
> > the Forth
> > return stack as well.
> > Ben.
> >
> Well, I do beleve, that 256 words may be not enough for
> more complex programs. However knowning the way stack is
> used (push, pop, peek and peek few elements below top of
> stack) you can relatively easy change internal fixed size
> stack to ring like based stack cache. The data are pushed
> on stack and if stack overflow appears memory write is
> generated to flush the oldest data and make place for new
> data. Same for pop - if there is not enough data for
> deepest possible peek the memory read cycle is generated.
> This can spare a lot of memory bandwidth and gives high
> performance gain compared to pure stack-in-shared-memory
> solution. This also gives you much more flexibility than
> on-chip stack only.
> I do beleve that 8 or 16 stack levels ring cache will be
> enough - most program grows stack by significant number of
> words then creates a lot of tiny jitter and shrinks it back
> in one move. The whole 'jitter' stuff will be cached and
> won't absorb memory. > regards,
> Tomasz Sztejka. > ___________________________________________________________ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - Arius - Rick Collins - Oct 23 21:22:00 2004

Whew! I forgot that this will not be programmed in Forth... :) At 04:29 AM 10/23/2004, you wrote:
>All data, especially data local to a procedure, is dynamically
>allocated on a stack when the procedure is invoked and the stack is
>cut back when the procedure terminates. If the procedure was
>actually a function, the return value remains on the top of the
>stack (TOS) for the calling procedure to use. There are a bunch of
>exceptions but let's just assume a function returning an integer.

...snip...
Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - Arius - Rick Collins - Oct 23 21:36:00 2004

At 05:35 PM 10/23/2004, you wrote: >I am aware that the old Burrough's machines were stack oriented but I
>haven't really researched them very much. IIRC they were developed to
>run Algol which is a very close ancestor of Pascal.

HP also made stack oriented computers.

>There is a reason that stack computers aren't popular - I don't think
>they make efficient use of memory bandwidth or the addition of
>registers. Then too, most machines aren't really designed for block
>structured languages. They can be compiled and run, of course, but it
>is always less than optimal.

That depends on the hardware. If your stack is separate from memory,
rather part of the CPU, then it is a lot like registers, providing fast
access to variables and uses *NO* memory bandwidth. I don't think this is
good for huge computers with many MB of memory, but for smaller machines it
can be very efficient. >C has a lot of advantages here in that blocks aren't nested and there
>are really only two data areas: global or local. In the case of
>Pascal with perhaps 10 levels of lexical nesting the situation on
>operand fetch is much more complex. Particularly if a procedure has
>recursed down a tree.

I am not aware of a significant difference between C and Pascal in the
usage of stack frames. Don't they both use that for local storage?

Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - ben franchuk - Oct 23 22:31:00 2004

Arius - Rick Collins wrote:
> Whew! I forgot that this will not be programmed in Forth... :)
Ya! A non FORTH stack machine for a change.
PS. Another virtual stack architecture here - BCPL.
http://www.cl.cam.ac.uk/users/mr/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Stack Machine & Memory Bandwidth - rtstofer - Oct 23 22:56:00 2004


<snipped all over the place!>

> HP also made stack oriented computers.

I didn't know that. More research to do... > >There is a reason that stack computers aren't popular - I don't
think
> >they make efficient use of memory bandwidth or the addition of
> >registers. Then too, most machines aren't really designed for
block
> >structured languages. They can be compiled and run, of course,
but it
> >is always less than optimal.
>
> That depends on the hardware. If your stack is separate from
memory,
> rather part of the CPU, then it is a lot like registers, providing
fast
> access to variables and uses *NO* memory bandwidth. I don't think
this is
> good for huge computers with many MB of memory, but for smaller
machines it
> can be very efficient. The stack on a Pascal machine is HUGE. Perhaps the active stack
reaches 30K words running the compiler and the heap (another stack)
is probably similar. In fact, all memory in use is either on the
stack or in the heap. There is no static memory for all practical
purposes. > >C has a lot of advantages here in that blocks aren't nested and
there
> >are really only two data areas: global or local. In the case of
> >Pascal with perhaps 10 levels of lexical nesting the situation on
> >operand fetch is much more complex. Particularly if a procedure
has
> >recursed down a tree.
>
> I am not aware of a significant difference between C and Pascal in
the
> usage of stack frames. Don't they both use that for local storage? I could be all wet here but lexically there are only two levels in
C: those variables that are local and usually placed on the stack
and those that are global and permanently allocated. Blocks are not
nested in C.

Consider the nesting of a portion of the Pascal compiler:

Program Pascal
..Procedure Block
....Procedure Body
......Procedure Statement
........Procedure Expression
..........Procedure SimpleExpression
............Procedure Term

To get to a variable defined in Procedure Block from Procedure Term
the run time interpreter (or hardware) has to do indirect addressing
through the stack frames until it gets back to the one for Procedure
Block. Then it can calculate an actual address by adding in the
offset from the base of the stack frame to the actual variable.

That's a lot of indirection. And you can't really allocate
statically because of recursion. The value in a variable in the
most recent invocation of a procedure following recursion does not
have to be the same as an encompassing invocation.

Next week I am going to sit down and try to figure out all of the op
codes and how I plan to implement them. Once I have that laid out
in a spreadsheet it may be easier to see what the architecture will
look like.

This is going to be quite a project!

Richard > Rick Collins
>
> rick.collins@a...
>
> Arius - A Signal Processing Solutions Company
> Specializing in DSP and FPGA design http://www.arius.com
> 4 King Ave 301-682-7772 Voice
> Frederick, MD 21701-3110 301-682-7666 FAX





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Stack Machine & Memory Bandwidth - rtstofer - Oct 24 13:57:00 2004


<snip>
> Well, I do beleve, that 256 words may be not enough for
> more complex programs. However knowning the way stack is
> used (push, pop, peek and peek few elements below top of
> stack) you can relatively easy change internal fixed size
> stack to ring like based stack cache. The data are pushed
> on stack and if stack overflow appears memory write is
> generated to flush the oldest data and make place for new
> data. Same for pop - if there is not enough data for
> deepest possible peek the memory read cycle is generated.
> This can spare a lot of memory bandwidth and gives high
> performance gain compared to pure stack-in-shared-memory
> solution. This also gives you much more flexibility than
> on-chip stack only.
> I do beleve that 8 or 16 stack levels ring cache will be
> enough - most program grows stack by significant number of
> words then creates a lot of tiny jitter and shrinks it back
> in one move. The whole 'jitter' stuff will be cached and
> won't absorb memory. > regards,
> Tomasz Sztejka. Tomasz,

Intellectually, I might find it challenging to keep track of what is
in the ring cache and what is in RAM. I think the approach has
great merit - I don't know that, at this stage in my experience, I
can implement it. I know, in concept, how this would work but
writing VHDL to make it happen may be above me.

But, as I look further at the issues, I am beginning to think that I
could implement a 2 element cache, complete with 'dirty' bits for
the top two elements of the stack. This means that the stack
pointer really does point to the real location in memory and I don't
have to keep track of what is in registers and where the real stack
is located. All I have to do is a write-through and maintain
the 'dirty' bits.

I think I will still implement the Harvard architecture for a
separate program space later on. It means I have to change boards
at some point, but not today. For now I will do a separate
instruction fetch at the end of each instruction execution. I will
be able to pull that state out quite easily and implement it as a
parallel process.

An interesting idea about the ring cache...

Richard





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - ben franchuk - Oct 24 14:19:00 2004

rtstofer wrote:

> But, as I look further at the issues, I am beginning to think that I
> could implement a 2 element cache, complete with 'dirty' bits for
> the top two elements of the stack. This means that the stack
> pointer really does point to the real location in memory and I don't
> have to keep track of what is in registers and where the real stack
> is located. All I have to do is a write-through and maintain
> the 'dirty' bits. But often the compilers for stack machines are very simple you have
alot of stack thrashing. What may be needed is a architecture
that can access to stack machines registers as a regular machine
for special cases of operations.
Ben.





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - Tomasz Sztejka - Oct 25 5:06:00 2004

--- rtstofer <> wrote: > <snip>
> > Well, I do beleve, that 256 words may be not enough for
> > more complex programs. However knowning the way stack
(...)
> Intellectually, I might find it challenging to keep track
> of what is
> in the ring cache and what is in RAM. I think the
> approach has
> great merit - I don't know that, at this stage in my
> experience, I
> can implement it. I know, in concept, how this would
> work but
> writing VHDL to make it happen may be above me.

This is very simple. All you have to do is to stop
thinking about HDL/Verilog and start thinking in terms of
gates, flip-flops and cycles. The HDL is only a langugae -
if you can draw a schematic by hand then coding it in any
HDL should be no problem (that's why I do prefer Verilog -
it has _wires_! )

Let's say we have:

SP - stack pointer, which points to first free element of
stack, where we would push data
N - which is the deepest peek range (or DUP_N in Forth
terms) (a power of two). 0 means only peek is possible, 1
that you can look at element one below element on top of
stack and so on;
M - is a capacity of ring (a power of two)
RP - this is a ring pointer, size necessary to carry M
possitions. Points to free place in ring.
K - actuall used space in ring.

(below pseudocode certailny contains mistakes)

on init:
K = N+1
SP = xxx
RP = 0
- stack contains N+1 trash words, but it does not matter -
you won't ever pop them;

push:
ring[RP] = data
RP++
SP++
K++
if (K==M)
{
generate memory write cycle:
memory[SP-M]=ring[RP]
K=M-1
}

pop:
--RP
--SP
--K
return value = ring[RP]
if (K==N)
{
generate memory read cycle:
ring[RP-N-1]=memory[SP-N-1]
}

stack switch:

push times M
set K = N+1
set SP = target SP + N +1
pop N times

the add operation:

ring[RP-1]=ring[RP-1]+ring[RP-2]
proceed like during pop operation.

As you can see, as long as M-N stack moves are generated no
memory cycle will be inserted. Having ring to be a
dual-port memory (or extra register for ring[RP-1] element)
allows all two arg operations to be executed within more or
less single cycle.

The second subject of this discussion appears to be how
compilers works and how they do generate code. In case of
stack machines it often looks diferrent that others
exposed.

You have following data structures:
- a work stack - this is like registers set in risc/cisc
machines. Local variables are not allocated there.
- a return stack - this stores jump targets, at other
branch bookeeping informations. It can be joined with work
stack but having it separate allows speed boost and more
safety;
- a frame stack - this is a pure software stack - you have
a frame pointer and indirect addressing opcodes. Frame
stack is managed by software and is used to store local
variables. Note, blocks in code ( the C or java { } ) does
not move frame pointer - this is up to compiler to find
right place for local variables within subroutine frame. No
multiple indirections there (there are also so called
'compiled stacks' which does not use indirection at all but
prevents recursion). At the entry to subroutine frame
pointer is incremented by known constant and at return is
decremented to restore previus value;
- a rest of memory, for dynamic variables heap, static
variables and so on.

The more variables compiler can allocate as static or can
find a fixed space in local variables frame the better it
works. I recommend to read JVM specs (brief scan through it
will give you a lot of informations about dos and dont's in
stack machines).
=====
Tomasz Sztejka
POLON ALFA
(work) http://www.polon-alfa.com.pl/
(private) http://www.sztejkat.prv.pl/ ___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - Author Unknown - Oct 25 9:31:00 2004

----- Original Message -----
> The more variables compiler can allocate as static or can
> find a fixed space in local variables frame the better it
> works. I recommend to read JVM specs (brief scan through it
> will give you a lot of informations about dos and dont's in
> stack machines). I would like to jump in here. The JVM is a very interesting stack machine. It's a little bit
different from FORTH. If you are interested how a JVM can be implemented in an FPGA
there is a design available at: http://www.jopdesign.com/
I've also a 'not jet published' paper especially on designing a stack architecture for the JVM
inside an FPGA. If you are interested I can send you a copy of it. It will get available on
the website after being published.

Martin





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - ben franchuk - Oct 25 16:28:00 2004

Tomasz Sztejka wrote:

> This is very simple. All you have to do is to stop
> thinking about HDL/Verilog and start thinking in terms of
> gates, flip-flops and cycles. The HDL is only a langugae -
> if you can draw a schematic by hand then coding it in any
> HDL should be no problem (that's why I do prefer Verilog -
> it has _wires_! )

what about a real ring ... the data moves :)

Q-1[ 0 mulx ]
Q+1[1 ]--[D Q]--

Typical stack element for small rings.
While a poor idea for a FPGA this may
be a better idea if you ever want to port
to real logic in a custom chip.
Any how I am sticking to TTL logic rather
than using FPGA's since all my designs
have a front panel and I have large amount
of bulbs to light up, and a large TTL 4 bit
slice PCB has power to drive the panel.
A CPLD device needs about 128? clb's
and about 75 real i/o pins to replace the
ttl pcb. I suspect you will run into a similar
problem when sombody builds a machine like
the IBM1130.
Ben.






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - ben franchuk - Oct 25 17:56:00 2004

Arius - Rick Collins wrote: > If you try to build the 1130 out of TTL, you will likely end up with a not
> so portable space heater! Minicomputers of the 70's were all built with
> TTL and they were big and hot! An FPGA is a *MUCH* better way to go since
> you can correct all your mistakes as if they were software, by just
> recompiling and downloading. No soldering iron required!

I use schematic entry for any FPGA work I have done in the past.
Moving to TTL (LS) is a advantage for me since the Logic is visable.
Also if I have a idea 3 am I don't need to turn on the computer and fire
up the software. With a 200 MHZ computer the recompiling took often
several hours before I could download the firmware, for the FPGA board
I had then. The main reason I stopped development work was I could
get a serial rom programmed, and the FPGA I was useing was TOO small
for the CPU design I was doing at the time.

> You can power plenty of light bulbs with the PSU to drive an FPGA. By
> light bulbs, I assume you really mean LEDs, right? I know you are going
> the nostalgia route, but light bulbs can be a PITA with the heat and short
> life. Even so, if you must have bulbs, use a transistor to drive the light
> bulbs from the FPGA output. Typical TTL does not have enough umph to drive
> a 60 mA light bulb.

I plan to use a 74ls06 driver -- ample current ( 30 ma ) for a led or a
bulb.
I might use CPLD's but right now I can allways port TTL to FPGA rather than
the other way around.
Ben.
PS. I think a IBM1130 1/4 size would be neat -- every thing scaled down,
but
the console selectric and the disk drives would be a real pain to have
functional
since nobody has 'mechanical' resources nowdays to machine parts.






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Stack Machine & Memory Bandwidth - ben franchuk - Oct 25 19:33:00 2004

rtstofer wrote:
>
> --- In , ben franchuk <bfranchuk@j...> wrote:

> Well, if I build the 1130, it won't have any of the 'real' IO
> devices anyway.

The whole point of the IBM1130 Computing envorment
is PUNCHED CARDS - NOISEY LINE PRINTER - MASIVE DISK IO.
Did I also say Blinking lights and other stuff.

> What would be really neat is to have an FPGA with a
> network interface. Now my printer output can go to a networked
> printer.

How ever the printer and screen fonts need to be created for the correct
printing of the characters used in FORTRAN and say APL.

> I guess Disk IO could come from a networked file server and
> the console functions could go to a networked PC running a Visual
> Basic application.

But if you do want to play with a IBM1130 you can get a software
emuator today -- simh http://simh.trailing-edge.com/
Software may be hard to find nowdays.
Good luck with your FPGA projects.
Ben.




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

[Ad] FPGA Boards Massive Sale - Tony Burch - Oct 31 10:55:00 2004


Hi all,

B u r c h E D is having a discount sale. Hopefully many
of the items are of interest to FPGA-CPUers.
If you are already the user of a B u r c h E D system, it's
a great time to consider expanding your system with
some extra plug-on modules, or to upgrade with one of
our "hardware upgrade deals".

The ad for the sale is below.

Best regards,
Tony Burch

-----------------
FPGA Boards Massive Sale!

10% to 25% off the normal price of all items!
Sale ends 30 November 2004.

Download sale flyer at http://www.BurchED.biz/BurchEDSale.pdf
Website at http://www.BurchED.biz

Great savings for a limited time!

* 25% off B5-X300 FPGA Boards, including download cable.
Great FPGA boards for building real-world projects!
Largest number of accessible I/O pins.

* 20% off Super-Value-Packs.
B5-X300 and plug-on modules. Super-Super value!

* 20% off all plug-on modules.
Widest selection, including Switches, LEDS, SRAM, CF, ...

* 15% off all hardware upgrade deals, for current
B u r c h E D customers.

* 10% off Sydney-X1 and Sydney-X1E development systems.

* 10% off all accessories, including the new B5-Easy-Clips!

* 10% off all bulk-saver-packs!
Great value multiplied! Set up your university lab!

If you already have some other FPGA board, which was good
for evaluation or demo, then maybe now you're "past it"
and you need a board for building real-world designs,
or advanced student projects...
B u r c h E D boards have:
* The largest number of accessible I/O pins
* The widest selection of plug-on modules
* Are ideal for prototyping real-world designs
* Are economical for advanced student projects

At these sale prices, it is a great time to get a
B u r c h E D system, or some sets of systems.
Order online now, or ask us for a quote online now -
the sale ends on 30 November 2004!

B u r c h E D
Making FPGA Prototyping Easy
http://www.BurchED.biz




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )