Am 12.10.18 um 22:45 schrieb upsidedown@downunder.com:
> On Fri, 12 Oct 2018 22:06:02 +0200, Philipp Klaus Krause <pkk@spth.de>
> wrote:
> 
>> Am 12.10.2018 um 20:30 schrieb upsidedown@downunder.com:
>>>
>>> The real issue would be the small RAM size.
>>
>> Devices with this architecture go up to 256 B of RAM (but they then cost
>> a few cent more).
>>
>> Philipp
> 
> Did you find the binary encoding of various instruction formats, i.e
> how many bits allocated to the operation code and how many for the
> address field ?
> 
> My initial guess was that the instruction word is simple 8 bit opcode
> + 8 bit address, but the bit and word address limits for the smaller
> models would suggest that for some op-codes, the op-code field might
> be wider than 8 bits and address fields narrower than 8 bits (e.g. bit
> and word addressing).
> 

It is more complicated. Apparently the encoding changed from a 16-bit
instruction word used by older types
(https://www.mikrocontroller.net/topic/461002#5616813) to a 14-bit
instruction word used by newer types
(https://www.mikrocontroller.net/topic/461002#5616603).

Padauk also dropped and added various instructions at some points (e.g.
ldtabh, ldtabl, mul, pushw, popw).

Philipp

Am 08.11.18 um 23:35 schrieb upsidedown@downunder.com:
>>>> And the linker would have to analyze the call graph to always
>>>> call the correct function for each thread. 
>>>
>>> Linker for such small target ?
>>
>> Of course. The support routines the compiler uses reside in some
>> library, the linker links them in if necessary. Also, the larger
>> variants are not that small, with up to 256 B of RAM and 8 KB of ROM.
>> One might want to e.g. have one .c file for handling I&sup2;", one for the
>> soft UART, etc.
> 
> A linker is required, if the libraries are (for copyright reasons)
> delivered as binary object code only.
> 
> However, if the library are delivered as source files and the
> compiler/assembler has even a rudimentary #include mechanism, just
> include those library files you need. With a include or macro
> processor with parameter passing, just invoke same include file or
> macro twice with different parameters for different static variable
> instances.
> 
> Of course, linkers are also needed, if very primitive compilation
> machines are used, such as floppy based Intellecs or Exorcisers. It
> could take a day to compile a large program all the way from sources,
> with multiple floppy changes to get the final absolute file to a
> single floppy, ready to be burnt into EPROMS for an additional hour or
> two. In such environment compiling, linking and burning only the
> source file changed would speed up program development a lot.
> 
> When using a modern PC for compilation, there are no such issues.
> 

Separate compilation and then linking is the normal thing to, and a
common workflow for small devices. This is e.g. how most people use
SDCC, a mainstream free compiler targeting various 8-bit architectures.

That doesn't mean it is the only way (and since SDCC does not have
link-time optimization it might not be the optimal way either). But it
is something people use and expect to work reasonably well.

So for anyone designing an architecture it would be wise to not put too
many obstacles into that workflow.

Philipp

On Thu, 8 Nov 2018 21:56:16 +0100, Philipp Klaus Krause <pkk@spth.de>
wrote:

>Am 08.11.18 um 20:52 schrieb upsidedown@downunder.com:
>>>
>>> But static memory allocation would require one copy of each function per
>>> thread. 
>> 
>> For a foreground/background monitor, the worst case would be two
>> copies of static data, if both threads use the same rubroutine.
>> 
>>> And the linker would have to analyze the call graph to always
>>> call the correct function for each thread. 
>> 
>> Linker for such small target ?
>
>Of course. The support routines the compiler uses reside in some
>library, the linker links them in if necessary. Also, the larger
>variants are not that small, with up to 256 B of RAM and 8 KB of ROM.
>One might want to e.g. have one .c file for handling I&#4294967295;", one for the
>soft UART, etc.

A linker is required, if the libraries are (for copyright reasons)
delivered as binary object code only.

However, if the library are delivered as source files and the
compiler/assembler has even a rudimentary #include mechanism, just
include those library files you need. With a include or macro
processor with parameter passing, just invoke same include file or
macro twice with different parameters for different static variable
instances.

Of course, linkers are also needed, if very primitive compilation
machines are used, such as floppy based Intellecs or Exorcisers. It
could take a day to compile a large program all the way from sources,
with multiple floppy changes to get the final absolute file to a
single floppy, ready to be burnt into EPROMS for an additional hour or
two. In such environment compiling, linking and burning only the
source file changed would speed up program development a lot.

When using a modern PC for compilation, there are no such issues.

Am 08.11.18 um 20:52 schrieb upsidedown@downunder.com:
>>
>> But static memory allocation would require one copy of each function per
>> thread. 
> 
> For a foreground/background monitor, the worst case would be two
> copies of static data, if both threads use the same rubroutine.
> 
>> And the linker would have to analyze the call graph to always
>> call the correct function for each thread. 
> 
> Linker for such small target ?

Of course. The support routines the compiler uses reside in some
library, the linker links them in if necessary. Also, the larger
variants are not that small, with up to 256 B of RAM and 8 KB of ROM.
One might want to e.g. have one .c file for handling I&sup2;", one for the
soft UART, etc.

> 
> With such small processor, just track any dependencies manually.

See above.

> 
>> Function pointers get complicated.
> 
> Do you really insist of using function pointer with such small
> targets?
> 

I want to have C, function pointers are part of it.

>>
>> Unfortunately, reentrancy becomes even harder with
>> hardware-multithreading: 
> 
> With two hardware threads, you would need at most two copies of static
> data.

Padauk still makes one chip with 8 hardware threads (and it looks to me
as if there were more in the past, though they are not currently listed
on their website, one can find them e.g. in their IDE).

> 
>> TO access the stack, one has to construct a
>> pointer to the stack location in a memory location. 
> 
> Why would you want to access the stack ?

For reentrency, so I can use one function implementation for all
threads. It would also be useful to dynamically assign threads to
hardware threads (so no thread is tied to specific hardware, and some OS
schedules them).

> 
> The stack is usable for handling return addresses, but I guess that a
> hardware thread must have its own return address stack  pointer.

Each hardware thread has its flag register (4 bits) accumulator (8
bits), pc (12 bits) and stack pointer (8 bits).

> 
>> That memory location
>> (as any pseudo-registers) is then shared among all running instances of
>> the function. So it needs to be protected (e.g. with a spinlock), making
>> access even more inefficient. And that spinlock will cause issues with
>> interrupts (a solution might be to heavily restrict interrupt routines,
>> essentially allowing not much more than setting some global variables).
> 
> Disabling all interrupts for the duration of some critical operations
> is often enough, but of course, the number of instructions executed
> during interrupt disabled should be minimized.

Disabling interrupts any time a spinlock is held or a thread is wating
for one might be too much, especially if there are many threads, so the
spinlock is held often.

Philipp

On Thu, 8 Nov 2018 13:53:48 +0100, Philipp Klaus Krause <pkk@spth.de>
wrote:

>Am 12.10.18 um 20:39 schrieb upsidedown@downunder.com:
>> On Fri, 12 Oct 2018 10:18:56 +0200, Philipp Klaus Krause <pkk@spth.de>
>> wrote:
>> 
>>> Am 10.10.2018 um 03:05 schrieb Clifford Heath:
>>>> <https://lcsc.com/product-detail/PADAUK_PADAUK-Tech-PMS150C_C129127.html>
>>>> <http://www.padauk.com.tw/upload/doc/PMS150C%20datasheet%20V004_EN_20180124.pdf>
>>>>
>>>>
>>>> OTP, no SPI, UART or I&#4294967295;C, but still...
>>>>
>>>> Clifford Heath
>>>
>>> They even make dual-core variants (the part where the first digit in the
>>> part number is '2'). It seems program counter, stack pointer, flag
>>> register and accumulator are per-core, while the rest, including the ALU
>>> is shared. In particular, the I/O registers are also shared, which means
>>> some multiplier registers would also be - but currently all variants
>>> with integrated multiplier are single-core.
>>> Use of the ALU is shared byt he two cores, alternating by clock cycle.
>>>
>>> Philipp
>> 
>> 
>> Interesting, that would make it easy to run a multitasking RTOS
>> (foreground/background) monitor, which might justify the use of some
>> reentrant library routines :-). But in reality, the available memory
>> (ROM/RAM) is so small so that you could easily manage this with static
>> memory allocations.
>> 
>> 
>
>But static memory allocation would require one copy of each function per
>thread. 

For a foreground/background monitor, the worst case would be two
copies of static data, if both threads use the same rubroutine.

>And the linker would have to analyze the call graph to always
>call the correct function for each thread. 

Linker for such small target ?

With such small processor, just track any dependencies manually.

>Function pointers get complicated.

Do you really insist of using function pointer with such small
targets?

>
>Unfortunately, reentrancy becomes even harder with
>hardware-multithreading: 

With two hardware threads, you would need at most two copies of static
data.

>TO access the stack, one has to construct a
>pointer to the stack location in a memory location. 

Why would you want to access the stack ?

The stack is usable for handling return addresses, but I guess that a
hardware thread must have its own return address stack  pointer.

In fact many minicomputers from the 1960's did not even have a stack
at all. The calling program just stored the return address in the
first word of the subroutine and the at the end o the subroutine,
performed an indirect jump through the first word of the subroutine to
return to the calling program. Of course, this is not re-entrant and
in those days one did not have to worry about multiple CPUs accessing
the same routines:-).

BTW, who needs a program counter (PC), many microprograms run without
a PC, with the next instruction address stored at the end of the long
instruction word :-)

>That memory location
>(as any pseudo-registers) is then shared among all running instances of
>the function. So it needs to be protected (e.g. with a spinlock), making
>access even more inefficient. And that spinlock will cause issues with
>interrupts (a solution might be to heavily restrict interrupt routines,
>essentially allowing not much more than setting some global variables).

Disabling all interrupts for the duration of some critical operations
is often enough, but of course, the number of instructions executed
during interrupt disabled should be minimized. In MACRO-11 assembler,
the standard practice was to start the comment field with a semicolon,
when task switching was disabled with two semicolons and when
interrupt disabled with three semicolons, it was visually easy to
detect when interrupts were disabled and not mess too much with such
code sections.

>
>The there is the trade-off of using one such memory location per
>function vs. per program (the latter reducing memroy usage, but
>resulting in less paralellism).
>
>The pseudo-registers one would want to use are not so much a problem for
>interrupt routines (they would just need saving and thus increase
>interrupt overhead a bit), but for hardware parallelism. Essentially all
>access to them would again have to be protected by a spinlock.
>
>All these problems could have relatively easily been avoided by
>providing an efficient stack-pointer-relative addressing mode. Having a
>few general-purpose or index registers would have somewhat helped as well.
>
>Philipp

Am 08.11.18 um 14:08 schrieb Tauno Voipio:
> 
> 
> And you'll end up with a low-end Cortex ...
> 

A low-end Cortex would still be far heavier than a Padauk variant with
an sp-relative adressing mode or a few registers added.
I think a more multithreading-friendly variant of the Padauk would even
still be simpler than an STM8.
But one could surely create a nice STM8-like (with a few STM8 weaknesses
fixed) processor with hardware multihreading.

Philipp

On 8.11.18 14:53, Philipp Klaus Krause wrote:
> Am 12.10.18 um 20:39 schrieb upsidedown@downunder.com:
>> On Fri, 12 Oct 2018 10:18:56 +0200, Philipp Klaus Krause <pkk@spth.de>
>> wrote:
>>
>>> Am 10.10.2018 um 03:05 schrieb Clifford Heath:
>>>> <https://lcsc.com/product-detail/PADAUK_PADAUK-Tech-PMS150C_C129127.html>
>>>> <http://www.padauk.com.tw/upload/doc/PMS150C%20datasheet%20V004_EN_20180124.pdf>
>>>>
>>>>
>>>> OTP, no SPI, UART or I&sup2;C, but still...
>>>>
>>>> Clifford Heath
>>>
>>> They even make dual-core variants (the part where the first digit in the
>>> part number is '2'). It seems program counter, stack pointer, flag
>>> register and accumulator are per-core, while the rest, including the ALU
>>> is shared. In particular, the I/O registers are also shared, which means
>>> some multiplier registers would also be - but currently all variants
>>> with integrated multiplier are single-core.
>>> Use of the ALU is shared byt he two cores, alternating by clock cycle.
>>>
>>> Philipp
>>
>>
>> Interesting, that would make it easy to run a multitasking RTOS
>> (foreground/background) monitor, which might justify the use of some
>> reentrant library routines :-). But in reality, the available memory
>> (ROM/RAM) is so small so that you could easily manage this with static
>> memory allocations.
>>
>>
> 
> But static memory allocation would require one copy of each function per
> thread. And the linker would have to analyze the call graph to always
> call the correct function for each thread. Function pointers get
> complicated.
> 
> Unfortunately, reentrancy becomes even harder with
> hardware-multithreading: TO access the stack, one has to construct a
> pointer to the stack location in a memory location. That memory location
> (as any pseudo-registers) is then shared among all running instances of
> the function. So it needs to be protected (e.g. with a spinlock), making
> access even more inefficient. And that spinlock will cause issues with
> interrupts (a solution might be to heavily restrict interrupt routines,
> essentially allowing not much more than setting some global variables).
> 
> The there is the trade-off of using one such memory location per
> function vs. per program (the latter reducing memroy usage, but
> resulting in less paralellism).
> 
> The pseudo-registers one would want to use are not so much a problem for
> interrupt routines (they would just need saving and thus increase
> interrupt overhead a bit), but for hardware parallelism. Essentially all
> access to them would again have to be protected by a spinlock.
> 
> All these problems could have relatively easily been avoided by
> providing an efficient stack-pointer-relative addressing mode. Having a
> few general-purpose or index registers would have somewhat helped as well.
> 
> Philipp


And you'll end up with a low-end Cortex ...

-- 

-TV

Am 12.10.18 um 20:39 schrieb upsidedown@downunder.com:
> On Fri, 12 Oct 2018 10:18:56 +0200, Philipp Klaus Krause <pkk@spth.de>
> wrote:
> 
>> Am 10.10.2018 um 03:05 schrieb Clifford Heath:
>>> <https://lcsc.com/product-detail/PADAUK_PADAUK-Tech-PMS150C_C129127.html>
>>> <http://www.padauk.com.tw/upload/doc/PMS150C%20datasheet%20V004_EN_20180124.pdf>
>>>
>>>
>>> OTP, no SPI, UART or I&sup2;C, but still...
>>>
>>> Clifford Heath
>>
>> They even make dual-core variants (the part where the first digit in the
>> part number is '2'). It seems program counter, stack pointer, flag
>> register and accumulator are per-core, while the rest, including the ALU
>> is shared. In particular, the I/O registers are also shared, which means
>> some multiplier registers would also be - but currently all variants
>> with integrated multiplier are single-core.
>> Use of the ALU is shared byt he two cores, alternating by clock cycle.
>>
>> Philipp
> 
> 
> Interesting, that would make it easy to run a multitasking RTOS
> (foreground/background) monitor, which might justify the use of some
> reentrant library routines :-). But in reality, the available memory
> (ROM/RAM) is so small so that you could easily manage this with static
> memory allocations.
> 
> 

But static memory allocation would require one copy of each function per
thread. And the linker would have to analyze the call graph to always
call the correct function for each thread. Function pointers get
complicated.

Unfortunately, reentrancy becomes even harder with
hardware-multithreading: TO access the stack, one has to construct a
pointer to the stack location in a memory location. That memory location
(as any pseudo-registers) is then shared among all running instances of
the function. So it needs to be protected (e.g. with a spinlock), making
access even more inefficient. And that spinlock will cause issues with
interrupts (a solution might be to heavily restrict interrupt routines,
essentially allowing not much more than setting some global variables).

The there is the trade-off of using one such memory location per
function vs. per program (the latter reducing memroy usage, but
resulting in less paralellism).

The pseudo-registers one would want to use are not so much a problem for
interrupt routines (they would just need saving and thus increase
interrupt overhead a bit), but for hardware parallelism. Essentially all
access to them would again have to be protected by a spinlock.

All these problems could have relatively easily been avoided by
providing an efficient stack-pointer-relative addressing mode. Having a
few general-purpose or index registers would have somewhat helped as well.

Philipp

Am 12.10.2018 um 09:44 schrieb David Brown:
> On 12/10/18 08:50, Philipp Klaus Krause wrote:
>> Am 12.10.2018 um 01:08 schrieb Paul Rubin:
>>> upsidedown@downunder.com writes:
>>>> There is a lot of operations that will update memory locations, so why
>>>> would you need a lot of CPU registers.
>>>
>>> Being able to (say) add register to register saves traffic through the
>>> accumulator and therefore instructions.
>>>
>>>> 1 KiB = 0.5 KiW is quite a lot, it is about 10-15 pages of commented
>>>> assembly program listing.
>>>
>>> It would be nice to have a C compiler, and registers help with that.
>>>
>>
>> Looking at the instruction set, it should be possible to make a backend
>> for this in SDCC; the architecture looks more C-friendly than the
>> existing pic14 and pic16 backends. But it surely isn't as nice as stm8
>> or z80.
>> reentrant functions will be inefficent: No registers, and no sp-relative
>> adressing mode. On would want to reserve a few memory locations as
>> pseudo-registers to help with that, but that only goes so far.
>>
> 
> It looks like the lowest 16 memory addresses could be considered
> pseudo-registers - they are the ones that can be used for direct memory
> access rather than needing indirect access.
> 

Considering the multi-core variants of the Padauk &micro;Cs:
Those adresses are shared across all cores. Each core only has its own
A, SP, F, PC.
How do we handle local variables?

Option 1: Make functions non-reentrant. Requires duplication of code (we
need per-thread copies of functions), and link-time analysis to ensure
that each thread only calls the function implementation meant for it.
Functions pointers get complicated.

Option 2: Use an  inefficient combination of thread-local storage and stack.

Since this is a small &micro;C, we need a lot of support functions, which the
compiler inserts (e.g. for multiplication); of course those are affected
by the same problems.

Philipp