Pipelined 6502/z80 with cache and 16x clock multiplier| page 5

Reply by Morten Reistad ●January 2, 20112011-01-02

In article <4d1f64ec$0$3034$afc38c87@news.optusnet.com.au>,
 <kym@kymhorsell.com> wrote:
>In comp.arch MitchAlsup <MitchAlsup@aol.com> wrote:
>...
>> There was a rev of TOP-10 that would timeout when accessing a
>> particular memory (OS) structure on the KIs. Either DEC added another
>> level of indirection, or rearranged the memory footprint so that the
>> timer timeout was exposed. I was going to mention this, but though
>> "just let it go".
>
>I'm not sure what you mean. "A particular memory (OS) structure" -- does
>that mean some specific O/S table had a timeout on it, or does it
>mean that if an indirect was stuck in self ref it would get timed out?
>No matter.
>
>With this nasty @x feature you could easily hook 1000s of 
>locations together and have a long loop from one to the next and back to 
>the first again.  I don't think anyone ever did *that*. But you
>never know who might use such a trick to (e.g.) implement a free list for
>a LISP interpreter.

There is another feature at play here, AFAIR. When the indirect chain
is interrupted the original instruction is stopped, and PC saved; and
a context swap is done. When the interrupt is done, and the machine gets
back to scheduling the instruction again, the whole thing has to start
over, and evaluate it all over again.

With a sufficiently long chain on a sufficiently memory-starved machine
this set of events may never terminate. 

>> robustness

I certainly don't miss the quality issues with hardware from the era
from before risk processors, raids and real networks.

-- mrr

Reply by ●January 2, 20112011-01-02

In article <8e86v7-266.ln1@laptop.reistad.name>,
Morten Reistad  <first@last.name> wrote:
>
>>> robustness
>
>I certainly don't miss the quality issues with hardware from the era
>from before risk processors, raids and real networks.

So you much prefer the current failure modes?  Yes, they are much
rarer, but typically FAR more evil when they occur - just as with
modern versus older automobiles.  If you do a proper cost-benefit
analysis (i.e. using game theory, not benchmarketing), modern
systems aren't as much better as most people think.

Some of that could be improved by proper documentation and not
just some recipes to follow when all goes well, and more could be
improved by putting more resources into better and more pervasive
diagnostics, but some of the degradation is fundamental.  Where
timing problems were rare and obscure, now they are common and
ubiquitous.

Even 40 years ago, it was EXTREMELY rare to have to cancel a whole
project because of a failure mode IN PRODUCTION EQUIPMENT which
couldn't be located or even reduced to a tolerable level, but
nowadays it is merely unusual.  In a few decades, it may even
become common.

Regards,
Nick Maclaren.

Reply by Charlie Root ●January 2, 20112011-01-02

In comp.arch nmm1@cam.ac.uk wrote:
...
> diagnostics, but some of the degradation is fundamental.  Where
> timing problems were rare and obscure, now they are common and
> ubiquitous.
> 
> Even 40 years ago, it was EXTREMELY rare to have to cancel a whole
> project because of a failure mode IN PRODUCTION EQUIPMENT which
> couldn't be located or even reduced to a tolerable level, but
> nowadays it is merely unusual.  In a few decades, it may even
> become common.
...

There is something to this. :)

A couple decades back embedded work was fairly straightforward.
Components may have been trivial and slow 
but because of that hooking them together was generally straightforward
and the "mental model" needed to get things to work as expected
were simple, too. You didn't need (as now) to rely on masses of 
very buggy documentation to make progress.

I remember a few projects in the "early days" making microprocessors (Z80,
6809, 68k, and even the odd 8080 in the *very* early days)
do things they were never "designed" for, and generally ending up with
something that did a job reliably. There were still quite
a few  "undocumented features" you'd run across, but they maybe tended to 
provide shortcuts rather than roadblocks.

Just a couple years back I worked on an embedded system to provide simul data,
SMS and multi-channel voice over a 3g network. Not only was the wireless 
module quirky (I am being charitable) with at least 50% of its 
executive summary functionality undocumented and maybe 
not-entirely-thought-out, but the large multinational responsible seemed 
uncooperative in getting our product past a prototype stage.  If it weren't
for some arm twisting from our arm-twising dept vis a vis some regional
company rep the project would have foundered.
Timing issues abounded, and the basic design of the module seemed designed 
to make the operation unreliable at best.

After various people assured us the provided documentation was
completely up-to-date the regional rep managed to send us
tantilising photocopies of clearly more recent documentation that
described features we needed to co-ordinate operations. Not that
it entirely worked as described. :)

We ended up just having to wear the concurrency issues and put
in a few "grand mal" resets, numerous sleeps and timeouts with 
empiricly-determined max parameters etc etc at judicious points to try 
and discourage and then recover from various races, deadlocks and starvations.

The development of consumer-level products is largely a matter
of stage magic. Provided the end user (or even your supervisor :)
doesn't know exactly what your gadget is doing, it can *appear* to work fine.
As in the music hall, a bit of misdirection in the form
of  a "simplified explanation" or 2, a few flashing leds, 
and a couple of potted "information messages" can convince observers
the product is not only doing its job but miraculously exceeding design specs. 

Just -- *please* -- don't look behind the curtain.

-- 
Generally, an empty answer. Try again.
  -- John Stafford <nhoj@droffats.net>, 08 Dec 2010 10:16:59 -0600

Reply by Morten Reistad ●January 2, 20112011-01-02

In article <ifpugs$v1f$1@gosset.csi.cam.ac.uk>,  <nmm1@cam.ac.uk> wrote:
>In article <8e86v7-266.ln1@laptop.reistad.name>,
>Morten Reistad  <first@last.name> wrote:
>>
>>>> robustness
>>
>>I certainly don't miss the quality issues with hardware from the era
>>from before risk processors, raids and real networks.
>
>So you much prefer the current failure modes?  Yes, they are much
>rarer, but typically FAR more evil when they occur - just as with
>modern versus older automobiles.  If you do a proper cost-benefit
>analysis (i.e. using game theory, not benchmarketing), modern
>systems aren't as much better as most people think.

But if you do a proper systems analysis, they are. Because they
are cheap, you can have multiple systems. With different components.

And we have tools to handle faults. 

We can use raid for disks. And multiple power sources. 

Done right, we can afford to throw one out. 

>Some of that could be improved by proper documentation and not
>just some recipes to follow when all goes well, and more could be
>improved by putting more resources into better and more pervasive
>diagnostics, but some of the degradation is fundamental.  Where
>timing problems were rare and obscure, now they are common and
>ubiquitous.
>
>Even 40 years ago, it was EXTREMELY rare to have to cancel a whole
>project because of a failure mode IN PRODUCTION EQUIPMENT which
>couldn't be located or even reduced to a tolerable level, but
>nowadays it is merely unusual.  In a few decades, it may even
>become common.

One PPOE had a principle of _always_ having separate implementations
of all critical systems, running as live as possible. I learnt a lot 
from that. We even found a floating point bug in hardware.

But the point you are making is important. The open hardware
movements are important, because we need the transparency.

It is not just that the driver works with Linux. It is that you
can actually see what it is doing.

And yes, we have to be a lot more proactive on this front.

-- mrr

Reply by Morten Reistad ●January 2, 20112011-01-02

In article <4d209861$0$3428$afc38c87@news.optusnet.com.au>,
 <kym@kymhorsell.com> wrote:
>In comp.arch nmm1@cam.ac.uk wrote:
>...
>> diagnostics, but some of the degradation is fundamental.  Where
>> timing problems were rare and obscure, now they are common and
>> ubiquitous.
>> 
>> Even 40 years ago, it was EXTREMELY rare to have to cancel a whole
>> project because of a failure mode IN PRODUCTION EQUIPMENT which
>> couldn't be located or even reduced to a tolerable level, but
>> nowadays it is merely unusual.  In a few decades, it may even
>> become common.
>...
>
>There is something to this. :)
>
>A couple decades back embedded work was fairly straightforward.
>Components may have been trivial and slow 
>but because of that hooking them together was generally straightforward
>and the "mental model" needed to get things to work as expected
>were simple, too. You didn't need (as now) to rely on masses of 
>very buggy documentation to make progress.

It is evident that this poster never handled SMD or ESMD disks, 
large x.25 network devices, MAU-based token ring, or pre-internet
multiplexing equipment.

>I remember a few projects in the "early days" making microprocessors (Z80,
>6809, 68k, and even the odd 8080 in the *very* early days)
>do things they were never "designed" for, and generally ending up with
>something that did a job reliably. There were still quite
>a few  "undocumented features" you'd run across, but they maybe tended to 
>provide shortcuts rather than roadblocks.

the 6502 and the other 650x processors had a lot of surprises, and they
were not exactly a showcase in terms of documentation.

>Just a couple years back I worked on an embedded system to provide simul data,
>SMS and multi-channel voice over a 3g network. Not only was the wireless 
>module quirky (I am being charitable) with at least 50% of its 
>executive summary functionality undocumented and maybe 
>not-entirely-thought-out, but the large multinational responsible seemed 
>uncooperative in getting our product past a prototype stage.  If it weren't
>for some arm twisting from our arm-twising dept vis a vis some regional
>company rep the project would have foundered.
>Timing issues abounded, and the basic design of the module seemed designed 
>to make the operation unreliable at best.

Bad designs exist everywhere. But get the contract right, and they have
to deliver, or perish. 

The extreme top-down Telco model for implementation never worked. Not then, 
not now. Read rfc875 for an idelological handle on it.

>After various people assured us the provided documentation was
>completely up-to-date the regional rep managed to send us
>tantilising photocopies of clearly more recent documentation that
>described features we needed to co-ordinate operations. Not that
>it entirely worked as described. :)
>
>We ended up just having to wear the concurrency issues and put
>in a few "grand mal" resets, numerous sleeps and timeouts with 
>empiricly-determined max parameters etc etc at judicious points to try 
>and discourage and then recover from various races, deadlocks and starvations.
>
>The development of consumer-level products is largely a matter
>of stage magic. Provided the end user (or even your supervisor :)
>doesn't know exactly what your gadget is doing, it can *appear* to work fine.
>As in the music hall, a bit of misdirection in the form
>of  a "simplified explanation" or 2, a few flashing leds, 
>and a couple of potted "information messages" can convince observers
>the product is not only doing its job but miraculously exceeding design specs. 
>
>Just -- *please* -- don't look behind the curtain.

Perhaps you are ready for the internet model of consensus and working
systems now?

-- mrr

Reply by Terje Mathisen ●January 3, 20112011-01-03

Morten Reistad wrote:
> In article<e073c9bf-50f4-45e1-97e2-11a5354c980b@g25g2000yqn.googlegroups.com>,
> MitchAlsup<MitchAlsup@aol.com>  wrote:
>> On Dec 30, 7:56 am, Terje Mathisen<"terje.mathisen at tmsw.no">
>> wrote:
>>> Morten Reistad wrote:
>>>> I keep thinking what could be done if a classic machine
>>>> like a PDP11 or a PDP10 was made with a modern process, no
>>>> microcode in core instructions, and we substiute L2 cache
>>>> for main memory, and ram for disk. And hyperchannel for
>>>> I/O.
>>>
>>> Afair, both of them had memory indirect addressing?
>>
>> The PDP-10 had infinite indirect memory addressing--the addressed word
>>from memory contained a bit to indicate if another level of
>> indirection was to be performed.
>
> For the ones who do not know the PDP10 instruction set: The address
> calculation and the instruction execution are totally separate on this
> machine.
>
> You could do stuff like MOVEI  A,10(B), which would add 10 to
> the the value in register B, placing the result in register A.

That seems _very_ similar to

   LEA EAX,[EBX+10]

on an x86, which also has separate address calc and integer execution paths.

Terje
-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Reply by Terje Mathisen ●January 3, 20112011-01-03

Morten Reistad wrote:
> In article<ifpugs$v1f$1@gosset.csi.cam.ac.uk>,<nmm1@cam.ac.uk>  wrote:
>> So you much prefer the current failure modes?  Yes, they are much
>> rarer, but typically FAR more evil when they occur - just as with
>> modern versus older automobiles.  If you do a proper cost-benefit
>> analysis (i.e. using game theory, not benchmarketing), modern
>> systems aren't as much better as most people think.
>
> But if you do a proper systems analysis, they are. Because they
> are cheap, you can have multiple systems. With different components.

The last sentence is the key:

Yes, you _CAN_ have redundant systems with different components, but I 
have yet to see a single vendor who will ceritfy and/or recommend this!

Instead they want you to make sure that the harware and software is as 
identical as possible on each node, significantly increasing the risk of 
a common mode hardware problem hitting all nodes at the same time.

I.e. NetWare's System Fault Tolerant setup mirrored the state between 
two servers, so that the slave could take over more or less immediately 
(i.e. well within the software timeout limits). I always wanted those 
two servers to use totally separate motherboards, cpus, disk and network 
controllers, etc., but was told that the HW had to be identical. :-(

>
> And we have tools to handle faults.
>
> We can use raid for disks. And multiple power sources.
>
> Done right, we can afford to throw one out.
[snip]
> One PPOE had a principle of _always_ having separate implementations
> of all critical systems, running as live as possible. I learnt a lot
> from that. We even found a floating point bug in hardware.

That's very interesting, I'll have to get the full story from you at 
some point in time. :-)

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Reply by ●January 3, 20112011-01-03

In article <gaj7v7-af2.ln1@laptop.reistad.name>,
Morten Reistad  <first@last.name> wrote:
>>>
>>>>> robustness
>>>
>>>I certainly don't miss the quality issues with hardware from the era
>>>from before risk processors, raids and real networks.
>>
>>So you much prefer the current failure modes?  Yes, they are much
>>rarer, but typically FAR more evil when they occur - just as with
>>modern versus older automobiles.  If you do a proper cost-benefit
>>analysis (i.e. using game theory, not benchmarketing), modern
>>systems aren't as much better as most people think.
>
>But if you do a proper systems analysis, they are. Because they
>are cheap, you can have multiple systems. With different components.
>
>And we have tools to handle faults. 
>
>We can use raid for disks. And multiple power sources. 
>
>Done right, we can afford to throw one out. 

I am afraid that you have completely missed the point.  To a very
good first approximation, any problem that is localised within a
single component is trivial; the hard ones are all associated with
the global infrastructure or the interfaces between components.
And remember the 80:20 rule - eliminating the 80% of the problems
that account for only 20% of the cost isn't a great help.

Even worse, almost all of the tools to handle faults are intended
to make it possible for a trained chimpanzee to deal with the 80%
of trivial faults, and completely ignore the 20% of nasty ones.
In a bad case, the ONLY diagnostic information is through the tool,
and it says that there is no problem, that the problem is somewhere
it demonstrably isn't, or is similarly useless.

Let me give you just one example.  A VERY clued-up colleague had
a RAID controller that went sour, so he replaced it.  Unfortunately,
the dying controller had left the system slightly inconsistent, so
the new controller refused to take over and wanted to reinitialise
all of the disks.  Yes, he had a backup, but it would have taken a
week to do a complete reload (which was why he was using a fancy
RAID system in the first place).

He solved the problem by mounting each disk, cleaning it up using
an unrelated 'fsck', manually fiddling a few key files, and then
restarting the controller.  Damn few people CAN do that, because
none of the relevant structure was documented, and the whole process
was unsupported.

I have several times had a problem where even the vendor admitted
defeat, and where a failure to at least bypass the problem would
mean that a complete system would have had to be written off
before going into production.  Each took me over a hundred hours
of hair-tearing.

>One PPOE had a principle of _always_ having separate implementations
>of all critical systems, running as live as possible. I learnt a lot 
>from that. We even found a floating point bug in hardware.

Well, yes.  I regret not having access to a range of systems any
longer - inter alia, it makes it hard to check code for portability.

>But the point you are making is important. The open hardware
>movements are important, because we need the transparency.
>
>It is not just that the driver works with Linux. It is that you
>can actually see what it is doing.
>
>And yes, we have to be a lot more proactive on this front.

I fully agree with that.

Regards,
Nick Maclaren.

Reply by Morten Reistad ●January 3, 20112011-01-03

In article <ifscp5$5l3$1@gosset.csi.cam.ac.uk>,  <nmm1@cam.ac.uk> wrote:
>In article <gaj7v7-af2.ln1@laptop.reistad.name>,
>Morten Reistad  <first@last.name> wrote:
>>>>
>>>>I certainly don't miss the quality issues with hardware from the era
>>>>from before risk processors, raids and real networks.
>>>
>>>So you much prefer the current failure modes?  Yes, they are much
>>>rarer, but typically FAR more evil when they occur - just as with
>>>modern versus older automobiles.  If you do a proper cost-benefit
>>>analysis (i.e. using game theory, not benchmarketing), modern
>>>systems aren't as much better as most people think.
>>
>>But if you do a proper systems analysis, they are. Because they
>>are cheap, you can have multiple systems. With different components.
>>
>>And we have tools to handle faults. 
>>
>>We can use raid for disks. And multiple power sources. 
>>
>>Done right, we can afford to throw one out. 
>
>I am afraid that you have completely missed the point.  To a very
>good first approximation, any problem that is localised within a
>single component is trivial; the hard ones are all associated with
>the global infrastructure or the interfaces between components.
>And remember the 80:20 rule - eliminating the 80% of the problems
>that account for only 20% of the cost isn't a great help.

If you want real redundancy you need sufficient separation
between systems, and transparancy in failover methods.

Today this eliminates tightly coupled systems, like raid controllers, 
"intelligent" switches and fancy hardware failovers. 

Raid controllers, etherchannel, multiple power supplies and
separate processors are used, but the performance issues are
as important as redundancy. The redundancy, or really, extra
uptime these bring are "nice to have", but not to depend on.

For real redundancy you need separate power, like in different
main station feed, at least, separate network, physical 
separation.

>Even worse, almost all of the tools to handle faults are intended
>to make it possible for a trained chimpanzee to deal with the 80%
>of trivial faults, and completely ignore the 20% of nasty ones.
>In a bad case, the ONLY diagnostic information is through the tool,
>and it says that there is no problem, that the problem is somewhere
>it demonstrably isn't, or is similarly useless.
>
>Let me give you just one example.  A VERY clued-up colleague had
>a RAID controller that went sour, so he replaced it.  Unfortunately,
>the dying controller had left the system slightly inconsistent, so
>the new controller refused to take over and wanted to reinitialise
>all of the disks.  Yes, he had a backup, but it would have taken a
>week to do a complete reload (which was why he was using a fancy
>RAID system in the first place).
>
>He solved the problem by mounting each disk, cleaning it up using
>an unrelated 'fsck', manually fiddling a few key files, and then
>restarting the controller.  Damn few people CAN do that, because
>none of the relevant structure was documented, and the whole process
>was unsupported.

A "trust me" tool, tightly coupled to the system, without
transparancy, from a single vendor. 

>I have several times had a problem where even the vendor admitted
>defeat, and where a failure to at least bypass the problem would
>mean that a complete system would have had to be written off
>before going into production.  Each took me over a hundred hours
>of hair-tearing.

At what point in the deployment did this show up?

>>One PPOE had a principle of _always_ having separate implementations
>>of all critical systems, running as live as possible. I learnt a lot 
>>from that. We even found a floating point bug in hardware.
>
>Well, yes.  I regret not having access to a range of systems any
>longer - inter alia, it makes it hard to check code for portability.

I keep Linux, Freebsd and Openbsd around. And in the cases where
we only have Linux support I deliberatly install 64-bit systems
in location A and 32-bit in location B. 

>>But the point you are making is important. The open hardware
>>movements are important, because we need the transparency.
>>
>>It is not just that the driver works with Linux. It is that you
>>can actually see what it is doing.
>>
>>And yes, we have to be a lot more proactive on this front.
>
>I fully agree with that.

-- mrr

Reply by ●January 3, 20112011-01-03

On Mon, 03 Jan 2011 11:01:44 +0100, Terje Mathisen <"terje.mathisen at
tmsw.no"> wrote:

>Morten Reistad wrote:
>> In article<ifpugs$v1f$1@gosset.csi.cam.ac.uk>,<nmm1@cam.ac.uk>  wrote:
>>> So you much prefer the current failure modes?  Yes, they are much
>>> rarer, but typically FAR more evil when they occur - just as with
>>> modern versus older automobiles.  If you do a proper cost-benefit
>>> analysis (i.e. using game theory, not benchmarketing), modern
>>> systems aren't as much better as most people think.
>>
>> But if you do a proper systems analysis, they are. Because they
>> are cheap, you can have multiple systems. With different components.
>
>The last sentence is the key:
>
>Yes, you _CAN_ have redundant systems with different components, but I 
>have yet to see a single vendor who will ceritfy and/or recommend this!

One of my customers is using their systems capable of running on
various HW platforms (both on Big as well as Little Endian) on
different base operating systems and there should not be much problems
in implementing the same functionality on different HW.

Using platform diversity did not create much interest, since after
all, the same application level software would be used. 

The only publicly known truly redundant software that I have heard of
is the US space shuttle with more os less triple (voting) flight
control computers and with a 4th independent computer programmed by a
different team capable of (only) landing the space shuttle.