Reply by whygee April 19, 20072007-04-19
Hello,

Jim Granville wrote:
> whygee wrote: >> However, off-chip programs are going to exist, and >> a typical use of the VSP core includes a single (or a couple) >> SDRAM chip (16-bit wide bus) so your streaming example >> is easy to translate to SDRAM. > Yes, but there is a class of design, that could mostly > simply-stream-code-from-SPI. > Yes, it is slower then SDRAM, but it is a lot cheaper and simpler too!
I'm too much a speed freak to consider this ;-) My early estimations have found that at 10MHz core speed, the peak memory bandwidth can reach 3x4x10=3D120MBs so a 100MHz SDRAM chip with 16 bit wide bus is ok. To compensate for the frequency difference, a 8x clock ratio is possible, so the SDRAM would run at 80MHz for example. To compensate the access latency, there is also the possibility to use a SMT approach (with 4 independent threads sharing the core, in the straight-forward CDC6600 PP fashion) and a DDR SDRAM chip would compensate for the increased bandwidth requirements.
> The SPI memory bandwidths are getting fairly good - 150MBd is > nearly 20Mbyte/sec, which is quite ok for many microcontroller tasks. > Jumps are more costly, hence this discussion on short-skips.
but 20MB/s is still too slow for most tasks the VSP is meant to perform (streaming data to/from mass storage into/from signal coprocessors).
> In such a system, you might want to lock some small code into BRAM, > for interrupts, but that could be handled with a simple address > compare and a simple duplicate of code - the size needed is so small, > you'd just build-time-copy the BRAM mapped stuff, into FPGA config, > and also simply leave it in the SPI memory. > A next-step would be to have this locked INT memory, and add > a BRAM cache that is less memory fixed, but HW complexity of > the next-step is higher, and the system is less deterministic.
"Memory is like an orgasm. It's a lot better if you don't have to fake it= =2E" (Seymour Cray) 16KB of on-hip fast SRAM is a good start, which also helps reducing the p= ower drain of the SDRAM interface. And for the prototypes (if any), if the FPGA doesn't have enough room, i still have a collection of 17ns 32KB asynch c= ache SRAMs from old PC motherboards ;-)
>>> I've also seen Conditional RET encoded, which used an otherwise >>> unused field from the conditional jump variants, and that looked >>> like a useful idea - esp. for assembler coding. >> >> where ? >=20 > hmm... I think it was the PicoBlaze that has conditional returns ?
I don't remember. Note that the VSP's "Q" instruction group has unconditional and condition= al versions, which are used both for call and return. (see http://f-cpu.seul.org/whygee/vspsim/doc/opcode_map.html ) The mechanism is a bit... unusual but worth the exploration :-)
> Another feature to look at on Embedded Controller cores, is a > Fixed interrupt response time - ie you remove opcode-length jitter, > so a timer interrupt will be truly time-locked. Typically, this > just means extending the shortest opcodes INT reaction times, but > it does not impact the longest INT rsponse times.
By design, each VSP instruction takes the same time so it's not an issue.= (yes, even for jumps/skips/call/return/whatever, which also explains why it can't be agressively pipelined and is limited to the tens of MHz ballp= ark). But i don't see where a difference of 100 or 200ns in IRQ response time c= an be critical. If something is so important, I implement it directly in HW :-) And to make small jitter tolerable for data streams, a FIFO usually does = the job well. With audio systems (my main target), up to 10=B5s of jitter is tolerable because FIFOs are everywhere, and delta-sigma converters' latency is ofte= n longer. 10=B5s are enough for 100 instructions at 100ns, an eternity... Furthermore, the "interrupt response time" is not always a good measure, because it depends on what you count (time to execute the first ISR instr= uction ?) and many parameters (most are context-specific) play a significant role. For example, the registers usually need to be saved : flushing the regist= er set to memory, loading new value, etc. takes a time proportional to the regis= ter set's size. But even that is not always acurate because the interrupt routine could need only a few registers (at least in the beginning) to service the IRQ.= So if acknowledging an IRQ needs 3 registers (just by hypothesis), then 7 instructions are needed (3 save the registers, 3 load new values, and one toggles the acknowledge bit). If the instruction cycle time is 100ns (10MHz), then it takes roughly 800= ns (including IRQ signal sampling and associated jitter) to answer. So the jitter is mostly due to the IRQ signal sampling electronics (if the signal comes at the beginning or end of the cycle etc.) Nothing i can reasonably do here. So i have thought about the interrupts. however, i have not implemented it yet in the JavaScript simulator. The memory system is much more critical. have a nice day,
> -jg
yg
Reply by Jim Granville April 18, 20072007-04-18
whygee wrote:
> Hello, > > I am often away for a few days, and work a lot on > http://f-cpu.seul.org/whygee/vspsim/ > It should now work a bit better (still under Mozilla/Firefox only). > > I have not received any personal reply, > OTOH the posts in this thread are interesting.
> Jim Granville noticed : > >> Another benefit of a short-skip opcode, is for a core you >> wish to feed from Serial memory : SPI Flash is getting faster >> all the time [Winbond have 150MBd streaming], so the sequential >> access time is reasonable, but a branch is more costly. >> That means a skip makes sense, as it does not spawn a new address, >> and for small distances, that is faster than the jump. > > I have never thought about this, because i think that the most used > instructions > will be stored in on-chip SRAMs. SPI Flash would be used for bootstrap > only, > probably with an Alpha-like method (fill the cache from external SPI > then let the CPU execute from address 0). > > However, off-chip programs are going to exist, and > a typical use of the VSP core includes a single (or a couple) > SDRAM chip (16-bit wide bus) so your streaming example > is easy to translate to SDRAM.
Yes, but there is a class of design, that could mostly simply-stream-code-from-SPI. Yes, it is slower then SDRAM, but it is a lot cheaper and simpler too! The SPI memory bandwidths are getting fairly good - 150MBd is nearly 20Mbyte/sec, which is quite ok for many microcontroller tasks. Jumps are more costly, hence this discussion on short-skips. In such a system, you might want to lock some small code into BRAM, for interrupts, but that could be handled with a simple address compare and a simple duplicate of code - the size needed is so small, you'd just build-time-copy the BRAM mapped stuff, into FPGA config, and also simply leave it in the SPI memory. A next-step would be to have this locked INT memory, and add a BRAM cache that is less memory fixed, but HW complexity of the next-step is higher, and the system is less deterministic.
>> I've also seen Conditional RET encoded, which used an otherwise >> unused field from the conditional jump variants, and that looked >> like a useful idea - esp. for assembler coding. > > where ?
hmm... I think it was the PicoBlaze that has conditional returns ? Another feature to look at on Embedded Controller cores, is a Fixed interrupt response time - ie you remove opcode-length jitter, so a timer interrupt will be truly time-locked. Typically, this just means extending the shortest opcodes INT reaction times, but it does not impact the longest INT rsponse times. -jg
Reply by whygee April 18, 20072007-04-18
Hello,

I am often away for a few days, and work a lot on
http://f-cpu.seul.org/whygee/vspsim/
It should now work a bit better (still under Mozilla/Firefox only).

I have not received any personal reply,
OTOH the posts in this thread are interesting.

   -~o0o~-

Tim Wescott wrote :
> As long as you have a one-instruction bit set, I can synthesize a carry bit so I'm mostly happy.
I'm not sure to understand. Can you give an example ?
> There are times when I have done assembly language coding that I have found it convenient
> to wait a bit before I checked a condition bit, but I could probably cope with an > add-skip-no-carry instruction (ASNC -- odd, but it'd do). with a skip-on-no-carry, you can set a register or memory location to a value that will be checked later. However, most asm i have done uses the carry immediately, often to do something "else". Bresenham-like algorithms come to my mind, and there are others.
> If you're inventing an instruction set, remember that the PowerPC architecture
> has an EIEIO instruction. Please try to top it. hmmm i'm not trying to compete with IBM :-) i'm trying to make "something cool, fun and maybe useful" (and it's certainly very instructive, it's at least a great way to learn JavaScript). Have a look at http://f-cpu.seul.org/whygee/vspsim/doc/opcode_map.html and tell me what names sound/look weird (or too obscure). -~o0o~- Paul Taylor suggested :
> With regards to SHR, SAR, SHL, ROL, ROR, are ROL and ROR _really_ > necessary? Not trying to discourage you from implementing them. My reason > for asking is that I playing with a compact 16-bit design, and I looked at > those and decided that I could sacrifice them. > Regards, > Paul.
I can't name a program that I have written that does not use shifts. I even see the absence of rotation operator in C as a plague. My VSP (I also discovered later that this name is also used by others, if someone can find a better name, please apply :-P) was designed for interactive/multimedia stream processing (like : ID3 tag parsing) and user I/O (LCD matrix). These applications require a certain amount of bit and byte-level processing. Byte-level is ok (look at the IE group of instructions), SHL provide some necessary functions but i'm still not satisfied when it comes to bit stream insertion/extraction. I'm limited to 2 reads and 1 write (with often the same address). So yes, these 5 "bit shuffling" instructions are necessary and IMHO not enough. I have probably found an answer in the Cray1 architecture manual, with one clever trick, but I don't know how/if i can implement it here. -~o0o~- Walter Banks remarked :
> You can certainly sacrifice left operations. Right operations will depend a > lot on the rest of the instruction set. A single barrel rotate can replace them all.
I see ROL/ROR/SHL/SHR as different ways to use a shifter. In the code i have written so far, i have not remarked a preference for a specific direction. I have also examined the possibility of having only one rotation direction but this could create problems at the algorithmic level. The opcode space is still quite comfortable and i have seen no way or reason to remove one of these opcodes. Walter then added :
> There are quite a few processors that don't have a > condition code register.
Right ( MIPS, Alpha come to my mind). That's one of the RISC methodology cornerstones. From my point of view, addind a separate register is a lot of troubles, because new specific instructions must be included.
> For extended math yours is > one approach but you can also use some form of > chained multiprecision math.
chained ? I don't know this method.
> Multiprecision operations > with 32 bit processors probably could be dropped with > very little impact on most applications.
Multiprecision is not the primary purpose. Overflow detection is much more common.
> Conceptually skip and conditional skip are powerful tools > that can be used in clever combinations. Generally more > skip conditions can be used than conventional conditional > branches. A lot of thought needs to be put into what happens > with sequential skip instructions. Is a skip treated as a > pre-another instruction or a separate instruction?
I'm not sure about what you mean but here is an example of VSP code : ; Addition of R2 to the 64-bit value R0:R1 adds2 r2 r0 ; r0 = r0+r2 ; The next instruction is skipped if no carry was generated add 1 r1 ; carry : r1 = r1+1 (long form : 2 half-words) The core computes the address of the next instruction at every cycle. Either it's a whole new address (then the prefetch mechanism is critical), or the skip advances a small counter that addresses the prefetch buffer. My idea is to do the following in parallel, during the same cycle: - the prefetch buffer automatically advances by 1 or 2 half-words (16 or 32 bits) - the new pointer into the prefetch buffer is computed in the early stages of the pipeline (add 1 or 2 half-words to the given value, plus 1 because skip 0 is equivalent to no-skip) - the addition is performed and if a carry does not occur, then the above computed pointer is committed into the buffer instead of the automatically advanced pointer. But that mechanism will be implemented later, i want to make sure that the instruction set is satisfying now. -~o0o~- Terran Melconian asked :
> How about for implementing multiplication and division?
This makes me think that the core has no multiplier, because it is not meant to computate stuff, only to move data around. So if multiplies must be implemented, a bit-by-bit version is a good compromise (complexity/latency/size, because a Booth multiply array is obviously overkill). I have two options : either create "multiply/divide step" instructions, or build a separate, asynchronous unit (accessible through special registers). Both have drawbacks : - mulstep/divstep instructions would use some amount of program space, and occupy the core. Also, i'm not sure how to implement the instructions. - a separate, asynchronous unit would allow the core to execute other instructions in parallel. The program would write the 2 operands to the input registers, then poll until the multiplier has finished. The problem ? I intend the VSP to become SMT later. So several threads could compete for the access to this "shared" unit. Any suggestion is welcome (and will be integrated if it is elegant)
> I often use them for serialization and deserialization of I/O data > streams when that is being done in software.
"Bit banging" is often a major headache. I tried to take this into account. -~o0o~- Jim Granville noticed :
> I think you have a variable-length skip - which is a good idea.
There are good reasons for this, on top of the pure coolness factor. The most important aspect is that the instructions are variable-length too (but they are quite simple, anyway). So the decoding logic has probably not yet read or decoded the next instructions, and may not know how long they are. The assembly software must compute the skip length so i though, if the core can skip 1 or 2 half-words, why not 3 or 4. More would create problems, though, and i'll have to make sure that the prefetching mechanisms can prepare instructions fast enough to keep the instruction buffers filled with at least 2+4+2=8 16-bit words, or 16 bytes, or 128 bits... Longer skips would create a fetching penalty so i stick to 2 bits.
> Another benefit of a short-skip opcode, is for a core you > wish to feed from Serial memory : SPI Flash is getting faster > all the time [Winbond have 150MBd streaming], so the sequential > access time is reasonable, but a branch is more costly. > That means a skip makes sense, as it does not spawn a new address, > and for small distances, that is faster than the jump.
I have never thought about this, because i think that the most used instructions will be stored in on-chip SRAMs. SPI Flash would be used for bootstrap only, probably with an Alpha-like method (fill the cache from external SPI then let the CPU execute from address 0). However, off-chip programs are going to exist, and a typical use of the VSP core includes a single (or a couple) SDRAM chip (16-bit wide bus) so your streaming example is easy to translate to SDRAM.
> Some CPUs have conditional fields in the opcodes, which mean they > can skip. It tends to be wasteful, as this is not often needed, but the > CC bits come along for the ride anyway.
Condition Code Registers ... what a pain...
> I've also seen Conditional RET encoded, which used an otherwise > unused field from the conditional jump variants, and that looked > like a useful idea - esp. for assembler coding.
where ?
> Have you looked at the Lattice Mico8, and PicoBlaze / PacoBlaze > SoftCPUs - they have some good 'compact' ideas.
I am not trying to make "the most compact code ever". Often, this requires a lot of instruction-specific fields here and there in the instruction word, and their proliferation is nefast for decoding speed and complexity. For example, the VSP uses only one immediate field (16 bits should be enough for most instructions ;-P) OTOH i have not found a way to use a single place for the 2-bit skip length field (it's in bits 6-7 in the ADDSx instructions, but in bits 8-9 for conditional skip instructions). Compromises... -~o0o~- Ulf Samuelsson wrote :
> COP800 , HPC16xxx...
This remark made me check what the COP8 is and i have found an instruction that decrements, then skips if the result is zeo. That's used for loops and it's similar to one PIC instruction. So all I did was generalise this idea. cool :-) Thanks everybody for the read, Yann Guidon http://ygdes.com http://f-cpu.org
Reply by David Brown April 16, 20072007-04-16
Walter Banks wrote:
> > Jim Granville wrote: > >> I think you have a variable-length skip - which is a good idea. >> IIRC the COP8 had a packed jump, that was only 1 byte, and of >> limited reach (but efficent). >> >> Another benefit of a short-skip opcode, is for a core you >> wish to feed from Serial memory : SPI Flash is getting faster >> all the time [Winbond have 150MBd streaming], so the sequential >> access time is reasonable, but a branch is more costly. >> That means a skip makes sense, as it does not spawn a new address, >> and for small distances, that is faster than the jump. > > The COP8 is a remarkably compact instruction set. (We wrote a > C compiler for it) >
I've done lots of assembly programming for the COP8. It's nice in some ways, and fairly compact, as long as you don't need much speed and can be smart about ram bank switching.
> 1) It used a lot of the instruction space for jumps and calls. 31 > opcodes were used for branches you referred to as was a > 2 byte in page branch and a 3 byte branch anywhere. > It had 2 and 3 byte calls >
That's nice for the jumps, but it means less space for other features.
> 2) The COP8 has a swap instead of a store (But does have > a load). The swap saves a lot of temp space operations in > expression evaluation. >
The swap instruction is very useful. However, you end up with a lot of swaps followed by loads to simulate a save - it's not clear to me that this is a win.
> 3) The COP8 was implemented as a bit serial alu which made > swap a very low cost instruction to implement. Most RMW > instructions are very low cost bit serial (INC,DEC CLR set to 1 -1 > for example) >
These are low cost, assuming you are accessing [B] instead of direct access to specific memory locations. With direct memory access, you have (IIRC) 3 bytes and 4 cycles for such operations - and direct memory access is extremely common. The bit serial nature of the cpu means that every instruction cycle takes 10 clock cycles, so something like a call instruction takes 50 clock cycles.
> 4) The interrupt service in the COP8 is worth looking at. It is implemented > as a combination of minimum hardware and specialized instructions to create > a vectored interrupt system. Most of the logic is software. >
It's worth looking at, and avoiding. A basic interrupt setup that will save and restore a few critical registers and jump to vectored interrupt routines has an overhead (for the save and restore) of around 80 instruction cycles - that's 800 clock cycles, which is more what you would expect from a PC cpu than a microcontroller. You can get a bit faster if you don't need to save and restore registers. The COP8 has it's advantages as a microcontroller - it's a solid and robust device. But its cpu core is not great.
> 5) Several processors have software I/O devices SX is well known. One that > should also be looked at is how the Z8 handled its serial port. > > w.. > > > > w.. > > > >
Reply by David R Brooks April 16, 20072007-04-16
Walter Banks wrote:
> > Paul Taylor wrote: > >> On Sun, 15 Apr 2007 08:07:24 -0400, Walter Banks wrote: >> >>> You can certainly sacrifice left operations. Right operations will depend a >>> lot on the rest of the instruction set. A single barrel rotate can replace >>> them all. >> I tentatively decided to have a shrc, shlc and shra - shrift right through >> carry, shift left through carry, and shift right arithmetic. I decided on >> just those three because I figured that the shift operations don't get >> used that much - mostly to multiply or divide by two on occasion. However, >> using this scheme, shifting logically takes two instructions - a clear >> carry instruction followed by the shrc/shrl instruction. But I can live >> with that. It means my instruction set is a bit smaller - I have taken >> this approach through the whole design. > > I have found that ASR is more important for general purpose > computing that either LSR or ROR. I have dealt with many > processors that did not have shift with carry and some that > did not. Either is not a particularly big problem for code > generation. > > I did not explain my barrel shift point earlier. Barrel shift or > rotate is a very effective method of field extraction. >
Rotates are also found at the core of many cryptographic algorithms, if you see that as a potential application area.
Reply by Ulf Samuelsson April 16, 20072007-04-16
> My question : > Do you know of any processor architecture where the carry > of the addition is not stored in a condition code register, > but (instead) the core skips the next instruction(s) ? > > I have recently come to this idea because of many self-imposed > limitations, like the existence of only one write port > to the register set. And skipping is a nice alternative > because the carry bit is often used as a condition for a jump, > so the current solution jumps immediately. >
COP800 , HPC16xxx... -- Best Regards, Ulf Samuelsson This is intended to be my personal opinion which may, or may not be shared by my employer Atmel Nordic AB
Reply by Everett M. Greene April 16, 20072007-04-16
Walter Banks <walter@bytecraft.com> writes:
> Paul Taylor wrote: > > > With regards to SHR, SAR, SHL, ROL, ROR, are ROL and ROR _really_ > > necessary? Not trying to discourage you from implementing them. My reason > > for asking is that I playing with a compact 16-bit design, and I looked at > > those and decided that I could sacrifice them. > > You can certainly sacrifice left operations. Right operations will depend a > lot on the rest of the instruction set. A single barrel rotate can replace > them all.
Except for sign-extended and logical shifts.
Reply by robe...@yahoo.com April 15, 20072007-04-15
On Apr 14, 7:14 pm, whygee <why...@yg.yg> wrote:
> Hello, > > so I'm playing withhttp://f-cpu.seul.org/whygee/vspsim/ > and developing a completely new instruction set, > along with an architecture, tools etc... > in JavaScript (before I translate to C and VHDL). > An overall description of the core is available athttp://f-cpu.seul.org/whygee/vspsim/doc/vsp.html > [note that it is always under construction so some parts don't work] > > My question : > Do you know of any processor architecture where the carry > of the addition is not stored in a condition code register, > but (instead) the core skips the next instruction(s) ? > > I have recently come to this idea because of many self-imposed > limitations, like the existence of only one write port > to the register set. And skipping is a nice alternative > because the carry bit is often used as a condition for a jump, > so the current solution jumps immediately. > > I know of many ISAs and architectures but I have not seen > this before. Does anyone know a similar approach ? > I post this here because this is more likely to be used > in other small and embedded CPUs, > rather than the large, server-scale CPUs of comp.arch. > > YG (you can reply to the address at the top > of the first link of this page)
Reply by Terran Melconian April 15, 20072007-04-15
On 2007-04-15, Paul Taylor <paul_ng_pls_rem@tiscali.co.uk> wrote:
> I figured that the shift operations don't get used that much - mostly > to multiply or divide by two on occasion. However, using this scheme,
How about for implementing multiplication and division? I often use them for serialization and deserialization of I/O data streams when that is being done in software.
Reply by Paul Taylor April 15, 20072007-04-15
On Sun, 15 Apr 2007 10:25:48 -0400, Walter Banks wrote:
 
> I did not explain my barrel shift point earlier. Barrel shift or > rotate is a very effective method of field extraction.
Ah, good point. That was not in my mind earlier. Regards, Paul.