On 6/8/2015 6:45 AM, Boudewijn Dijkstra wrote:> Op Thu, 04 Jun 2015 19:34:41 +0200 schreef Don Y <this@is.not.me.com>: >> On 6/4/2015 6:08 AM, Boudewijn Dijkstra wrote: >>> Op Wed, 03 Jun 2015 10:15:09 +0200 schreef Don Y <this@is.not.me.com>: >>>> The tougher issue is testing "live" memory in systems that >>>> are "up" 24/7/365... >>> >>> High-reliability systems often employ Hamming codes (for booleans and enums) >>> and inverted shadow copies for other values (which are checkedon each access). >> >> These are SoC's (augmented with external memory) so ECC isn't usually >> supported. > > I wasn't talking about ECC. I meant in software. Which is overkill for most > applications.I use that approach for (nonvolatile) configuration memory -- primarily as a safeguard against power collapsing unexpectedly in the middle of an (atomic) update of one or more configuration parameters. But, I haven't logged such an event in many years leading me to think it is overkill. [OTOH, as it is only used for moving parameters from the nonvolatile store into the *working* (configuration) store, it's a one-time hit that doesn't add much to code size or execution time -- it runs once during IPL and once again at shutdown... both times where it isn't really noticeable] I need something that will attest to the integrity of the memory subsystem as a whole, not just the nonvolatile portion of it.
Memory testing
Started by ●June 3, 2015
Reply by ●June 8, 20152015-06-08
Reply by ●June 9, 20152015-06-09
Hi Don, On Friday, June 5, 2015 at 4:46:19 AM UTC-4, Don Y wrote:> On 6/4/2015 11:42 PM, upsidedown@downunder.com wrote: > > On Wed, 03 Jun 2015 01:15:09 -0700, Don Y <this@is.not.me.com> wrote: > > > > > >> The tougher issue is testing "live" memory in systems that > >> are "up" 24/7/365... > > > > Of course, POST is done only once at the first (and hopefully the only > > time) for a few decades. > > As stated elsewhere, individual nodes are powered up and down routinely > within the normal operation of the system. So, it is possible for POST > _on_a_specific_node_ to be run often (i.e., as often as power is cycled to > that particular node).Wait, WHAT? the topic began with a 24/7/365 requirement and no mention of distributed load sharing systems. If you can run POST "often" then what new policy are you really looking for? The Security system musty be 100% available so there you use redundant systems (maybe 3 with a voting protocol) Power cycling one of the three on a schedule doesn't seem too bad. You want security with high availability, you can't get by on the cheap. So I don't think I see the problem. []> > This is actually an amusing concept. Ask folks when they consider > their ECC memory system to be "compromised" and you'll never get a > firm answer. E.g., how many bus errors do you consider as sufficient > to leave you wondering if the ECC is actually *detecting* all errors > (let alone *correcting* "some")? How do you know that (detected) errors > are completely localized and have no other consequences? > > <shrug> > > In my case, I treat errors as indicative of a failure. Most probably > something in the power conditioning and not a "wear" error in a device. > Leaving it unchecked will almost certainly result in more errors popping > up -- some of which I will likely NOT be able to detect. > > E.g., a POST error in DRAM causes me to fall back to recovery routines > that operate out of (internal) SRAM. A failure in SRAM similarly > causes DRAM to be used to the exclusion of SRAM. A failure in both > means SoL! >.. and our sponsor, Duct Tape would like to remind you: in the long run, ALL solutions are temporary.> Regardless, in these degraded modes, the goal is only to *report* > errors and support some limited remote diagnostics -- not to attempt > to *operate* in the presence of a known problem.So mostly not High availability. Seeking something beyond POST may just be overkill (except for that security system). ed
Reply by ●June 10, 20152015-06-10
Hi Ed, On 6/9/2015 2:32 PM, Ed Prochak wrote:> On Friday, June 5, 2015 at 4:46:19 AM UTC-4, Don Y wrote: >> On 6/4/2015 11:42 PM, upsidedown@downunder.com wrote: >>> On Wed, 03 Jun 2015 01:15:09 -0700, Don Y <this@is.not.me.com> wrote: >>> >>> >>>> The tougher issue is testing "live" memory in systems that are "up" >>>> 24/7/365... >>> >>> Of course, POST is done only once at the first (and hopefully the only >>> time) for a few decades. >> >> As stated elsewhere, individual nodes are powered up and down routinely >> within the normal operation of the system. So, it is possible for POST >> _on_a_specific_node_ to be run often (i.e., as often as power is cycled >> to that particular node). > > Wait, WHAT? the topic began with a 24/7/365 requirement and no mention of > distributed load sharing systems.The *system* runs 24/7/365: "... issue is testing "live" memory in systems that are 'up' 24/7/365..." ------------------------------------------^^^^^^^> If you can run POST "often" then what new policy are you really looking > for?I can't run POST on any particular (i.e., randomly/periodically chosen) portion of the system at any given time. I can run POST (or BIST) on *certain* pars (nodes) of the system at *selected* times. E.g., if I am not presently "using water" (irrigation, domestic water, etc.), then the node that is responsible for monitoring and controlling water use can be commanded to run a POST (or BIST) -- after ensuring the I/O's are "locked" in some appropriate state(s). [For example, make sure the main water supply valve is "open", irrigation valves are "closed", etc.] Likewise, if the security cameras covering the back yard are not needed during daylight hours, those nodes can be powered up, tested, then powered down until they *will* be needed. Looking at different (smaller) time intervals, I can probably cheat and arrange for the HVAC node to be tested IMMEDIATELY AFTER the house has reached its heat/cool temperature -- on the assumption that the furnace/ACbrrr will *not* be needed in the N seconds/minutes that it takes for that test to complete (again, after first locking down the I/O's to some sort of "safe" state). OTOH, the database server is the sole repository of persistent data in the system. Taking *it* offline means the system AS A WHOLE needs to be essentially quiescent -- there's no way for a node to inquire of settings, make changes to settings, respond to changes in the environment, etc. if the DB server is not responsive. [And, the DB server has gobs of resources so testing there tends to be far more time consuming]> The Security system musty be 100% available so there you use redundant > systems (maybe 3 with a voting protocol) Power cycling one of the three on a > schedule doesn't seem too bad. You want security with high availability, you > can't get by on the cheap. > > So I don't think I see the problem.I don't have any explicit redundancy in the system. E.g., if the door camera bites the shed, it's gone. No way to recover that lost functionality (without the user replacing the node). Likewise, if the node that handles water usage/metering craps out, those functions are gone until the hardware is replaced (e.g., perhaps the water supply turns *off* when you'd like it to remain on; or, perhaps it is locked on even in the event of a detectable plumbing failure, etc.). As there are no "backups" for individual I/O's, I rely on runtime testing to identify problems *before* they interfere with operation -- to give the user a "head's up" before the system encounters a failure IN IT'S INTENDED USE OF THAT FEATURE. E.g., turn the security cameras on during the day, verify that the images returned by each are "nominal" (i.e., the tree that used to be in the center of the scene is still visible there). If not, you can alert the user before the system *requires* those cameras to be operational (to perform their security monitoring functions). Consider the irrigation system: it may be days or even weeks for certain irrigation valves to be "needed". Yet, a wire could get cut or shorted -- or a valve mechanically inoperative -- at any time while the valve is "dormant". Waiting to detect that problem until the system decides that the valve *must* be energized means you're already too late: why didn't you fix it two days earlier when it *failed* (but, wasn't yet NEEDED)?>> This is actually an amusing concept. Ask folks when they consider their >> ECC memory system to be "compromised" and you'll never get a firm answer. >> E.g., how many bus errors do you consider as sufficient to leave you >> wondering if the ECC is actually *detecting* all errors (let alone >> *correcting* "some")? How do you know that (detected) errors are >> completely localized and have no other consequences? >> >> <shrug> >> >> In my case, I treat errors as indicative of a failure. Most probably >> something in the power conditioning and not a "wear" error in a device. >> Leaving it unchecked will almost certainly result in more errors popping >> up -- some of which I will likely NOT be able to detect. >> >> E.g., a POST error in DRAM causes me to fall back to recovery routines >> that operate out of (internal) SRAM. A failure in SRAM similarly causes >> DRAM to be used to the exclusion of SRAM. A failure in both means SoL! > > ... and our sponsor, Duct Tape would like to remind you: in the long run, > ALL solutions are temporary.Permanent Temporary Fixes.>> Regardless, in these degraded modes, the goal is only to *report* errors >> and support some limited remote diagnostics -- not to attempt to *operate* >> in the presence of a known problem. > > So mostly not High availability. Seeking something beyond POST may just be > overkill (except for that security system).Availability is a relative concept. If you wanted to flush a toilet and the water happened to be "off" because that node was busy doing a self-test, it's not "the end of the world"... but it would surely be annoying -- and NOTICEABLE. Likewise, if someone came to the front door and the doorbell didn't "ring" because *that* node happened to be running a memory test... Or, missing an incoming telephone call while the phone system was running diagnostics. Or, someone opening (and then closing) a door to gain entry to the premises while the node charged with watching those events was "preoccupied" with testing. Etc. How many of these are "inconveniences" is debatable: if you went to make a call with your cell phone and found it was "busy, testing", SHIRLEY you *could* wait a bit while that testing, finishes, right? Are *all* your phone calls so terribly urgent that they can't wait?? OTOH, even having to wait a second more than normal while it *aborts* the test (and reloads the application) would probably be noticeable to you ("Damn phone is ALWAYS 'testing'!") My goal is to highly integrate this system with day to day living (or "business", etc.). As such, if "some" component is always (i.e., "often") claiming to be 'busy, testing', it can be counterproductive. (using the "splash screen" diversion to hide your activities wears thin, quickly) So, being able to "hide" these sorts of activities in ways of which the user is unaware becomes a significant design goal...
Reply by ●June 11, 20152015-06-11
On Wednesday, June 10, 2015 at 2:01:53 AM UTC-4, Don Y wrote:> Hi Ed, > > On 6/9/2015 2:32 PM, Ed Prochak wrote: > > On Friday, June 5, 2015 at 4:46:19 AM UTC-4, Don Y wrote: > >> On 6/4/2015 11:42 PM, upsidedown@downunder.com wrote: > >>> On Wed, 03 Jun 2015 01:15:09 -0700, Don Y <this@is.not.me.com> wrote: > >>> > >>> > >>>> The tougher issue is testing "live" memory in systems that are "up" > >>>> 24/7/365... > >>> > >>> Of course, POST is done only once at the first (and hopefully the only > >>> time) for a few decades. > >> > >> As stated elsewhere, individual nodes are powered up and down routinely > >> within the normal operation of the system. So, it is possible for POST > >> _on_a_specific_node_ to be run often (i.e., as often as power is cycled > >> to that particular node). > > > > Wait, WHAT? the topic began with a 24/7/365 requirement and no mention of > > distributed load sharing systems. > > The *system* runs 24/7/365: > "... issue is testing "live" memory in systems that are 'up' 24/7/365..." > ------------------------------------------^^^^^^^But the "system" consists of multiple nodes. The nodes are where the memory that you want tested exists. You can restart a node (triggering a POST), so FOR THAT NODE (yes the nodes are not totally interchangeable) you do not need an elaborate run-time memory test process.> > > If you can run POST "often" then what new policy are you really looking > > for? > > I can't run POST on any particular (i.e., randomly/periodically chosen) > portion of the system at any given time. I can run POST (or BIST) on > *certain* pars (nodes) of the system at *selected* times. > > E.g., if I am not presently "using water" (irrigation, domestic water, etc.), > then the node that is responsible for monitoring and controlling water use > can be commanded to run a POST (or BIST) -- after ensuring the I/O's are > "locked" in some appropriate state(s). > > [For example, make sure the main water supply valve is "open", irrigation > valves are "closed", etc.]I never suggested the testing had to be random. Periodic is fine, just like your scheduled check up on your car. (Hi Joe, I brought in the car for its 60,000 mile service check)> > Likewise, if the security cameras covering the back yard are not needed > during daylight hours, those nodes can be powered up, tested, then powered > down until they *will* be needed.A lot of break-ins happen in daylight hours. But you are muddying the waters I think. Are you writing POST for DRAM in the camera? or in the node that reads the video from the camera?> > Looking at different (smaller) time intervals, I can probably cheat and > arrange for the HVAC node to be tested IMMEDIATELY AFTER the house has > reached its heat/cool temperature -- on the assumption that the furnace/ACbrrr > will *not* be needed in the N seconds/minutes that it takes for that test > to complete (again, after first locking down the I/O's to some sort of "safe" > state).That's doable. so another node (or set of nodes?) that doesn't need fancy run-time memory testing.> > OTOH, the database server is the sole repository of persistent data > in the system. Taking *it* offline means the system AS A WHOLE needs > to be essentially quiescent -- there's no way for a node to inquire of > settings, make changes to settings, respond to changes in the environment, > etc. if the DB server is not responsive.Then you designed a distributed system, but still have a single point of failure for the system. Why does the HVAC need to query the DB continuously? Even if the settings change periodically (different tempts for different times of day, different days, and even different seasons), does not mean the settings change minute by minute. So what if the temp setting changes at 5:05PM instead of 5:00PM. How do you do DB maintenance?> > [And, the DB server has gobs of resources so testing there tends to > be far more time consuming]Agreed, but have you measured how long?> > > The Security system musty be 100% available so there you use redundant > > systems (maybe 3 with a voting protocol) Power cycling one of the three on a > > schedule doesn't seem too bad. You want security with high availability, you > > can't get by on the cheap. > > > > So I don't think I see the problem. > > I don't have any explicit redundancy in the system. E.g., if the door > camera bites the shed, it's gone. No way to recover that lost functionality > (without the user replacing the node). Likewise, if the node that handles > water usage/metering craps out, those functions are gone until the hardware > is replaced (e.g., perhaps the water supply turns *off* when you'd like it to > remain on; or, perhaps it is locked on even in the event of a detectable > plumbing failure, etc.).The topic was memory testing, not the I/O. Don't muddy your own topic.> > As there are no "backups" for individual I/O's, I rely on runtime testing > to identify problems *before* they interfere with operation -- to give > the user a "head's up" before the system encounters a failure IN IT'S > INTENDED USE OF THAT FEATURE. > > E.g., turn the security cameras on during the day, verify that the > images returned by each are "nominal" (i.e., the tree that used to > be in the center of the scene is still visible there). If not, you > can alert the user before the system *requires* those cameras to be > operational (to perform their security monitoring functions). > > Consider the irrigation system: it may be days or even weeks for > certain irrigation valves to be "needed". Yet, a wire could get > cut or shorted -- or a valve mechanically inoperative -- at any time > while the valve is "dormant". Waiting to detect that problem until > the system decides that the valve *must* be energized means you're > already too late: why didn't you fix it two days earlier when it > *failed* (but, wasn't yet NEEDED)? >The topic was memory testing, not the I/O. Don't muddy your own topic. Start a new thread for I/O, predictive maintenance.> >> This is actually an amusing concept. Ask folks when they consider their > >> ECC memory system to be "compromised" and you'll never get a firm answer. > >> E.g., how many bus errors do you consider as sufficient to leave you > >> wondering if the ECC is actually *detecting* all errors (let alone > >> *correcting* "some")? How do you know that (detected) errors are > >> completely localized and have no other consequences? > >> > >> <shrug> > >> > >> In my case, I treat errors as indicative of a failure. Most probably > >> something in the power conditioning and not a "wear" error in a device. > >> Leaving it unchecked will almost certainly result in more errors popping > >> up -- some of which I will likely NOT be able to detect. > >> > >> E.g., a POST error in DRAM causes me to fall back to recovery routines > >> that operate out of (internal) SRAM. A failure in SRAM similarly causes > >> DRAM to be used to the exclusion of SRAM. A failure in both means SoL! > > > > ... and our sponsor, Duct Tape would like to remind you: in the long run, > > ALL solutions are temporary. > > Permanent Temporary Fixes. > > >> Regardless, in these degraded modes, the goal is only to *report* errors > >> and support some limited remote diagnostics -- not to attempt to *operate* > >> in the presence of a known problem. > > > > So mostly not High availability. Seeking something beyond POST may just be > > overkill (except for that security system). > > Availability is a relative concept.Well, actually you never provided your availability requirement other than the vague 24/7/365 quip in the first post.> > If you wanted to flush a toilet and the water happened to be "off" > because that node was busy doing a self-test, it's not "the end of > the world"... but it would surely be annoying -- and NOTICEABLE.But because there is water in the tank, it would work once.> > Likewise, if someone came to the front door and the doorbell didn't > "ring" because *that* node happened to be running a memory test...then I think you have bigger system design problems. (Over engineering)> > Or, missing an incoming telephone call while the phone system was > running diagnostics.That is another one like Security where you cannot geet by one the cheap (single node)> > Or, someone opening (and then closing) a door to gain entry to the > premises while the node charged with watching those events was > "preoccupied" with testing.100% availability requires some redundancy.> > Etc. > > How many of these are "inconveniences" is debatable: if you went to > make a call with your cell phone and found it was "busy, testing", > SHIRLEY you *could* wait a bit while that testing, finishes, right? Are > *all* your phone calls so terribly urgent that they can't wait?? >The debateable point is this: exactly what is the availability requirement for each subsystem? For example the system I am working on has a requirement that it is unavailable for clinical use less than a small number of hours per year (not counting scheduled maintenance).> OTOH, even having to wait a second more than normal while it *aborts* > the test (and reloads the application) would probably be noticeable to > you ("Damn phone is ALWAYS 'testing'!") > > My goal is to highly integrate this system with day to day living > (or "business", etc.). As such, if "some" component is always > (i.e., "often") claiming to be 'busy, testing', it can be > counterproductive. (using the "splash screen" diversion to hide > your activities wears thin, quickly) >You got that right!> So, being able to "hide" these sorts of activities in ways of which the > user is unaware becomes a significant design goal...And needs to be addressed, but it is a different topic. have a great day.
Reply by ●June 11, 20152015-06-11
On 6/11/2015 6:17 AM, Ed Prochak wrote: [attrs elided]>>>>>> The tougher issue is testing "live" memory in systems that are >>>>>> "up" 24/7/365... >>>>> >>>>> Of course, POST is done only once at the first (and hopefully the >>>>> only time) for a few decades. >>>> >>>> As stated elsewhere, individual nodes are powered up and down >>>> routinely within the normal operation of the system. So, it is >>>> possible for POST _on_a_specific_node_ to be run often (i.e., as often >>>> as power is cycled to that particular node). >>> >>> Wait, WHAT? the topic began with a 24/7/365 requirement and no mention >>> of distributed load sharing systems. >> >> The *system* runs 24/7/365: "... issue is testing "live" memory in systems >> that are 'up' 24/7/365...." >> ------------------------------------------^^^^^^^ > > But the "system" consists of multiple nodes. The nodes are where the memory > that you want tested exists. You can restart a node (triggering a POST), so > FOR THAT NODE (yes the nodes are not totally interchangeable) you do not > need an elaborate run-time memory test process.The node and all the I/O's that it services are unavailable during POST. As I said, previously, POST wants to achieve a balance between thoroughness and expediency -- any time spent *in* POST increases the time before the node can be brought on-line for its normal operation. BIST takes the attitude that testing is the operational mode of the node -- so, like POST, the node's normal functions are not provided to the system. Run-time testing (of all components in a node) attempts to juggle both criteria -- testing *and* operation.>>> If you can run POST "often" then what new policy are you really looking >>> for? >> >> I can't run POST on any particular (i.e., randomly/periodically chosen) >> portion of the system at any given time. I can run POST (or BIST) on >> *certain* pars (nodes) of the system at *selected* times. >> >> E.g., if I am not presently "using water" (irrigation, domestic water, >> etc.), then the node that is responsible for monitoring and controlling >> water use can be commanded to run a POST (or BIST) -- after ensuring the >> I/O's are "locked" in some appropriate state(s). >> >> [For example, make sure the main water supply valve is "open", irrigation >> valves are "closed", etc.] > > I never suggested the testing had to be random. Periodic is fine, just like > your scheduled check up on your car. (Hi Joe, I brought in the car for its > 60,000 mile service check)But you (I) can't even guarantee any particular periodicity -- that was the point of my "randomly/periodically" comment. "Testing" is just another workload that has to be scheduled based on its needs and impositions on (portions of) the system.>> Likewise, if the security cameras covering the back yard are not needed >> during daylight hours, those nodes can be powered up, tested, then >> powered down until they *will* be needed. > > A lot of break-ins happen in daylight hours.Our back yard is protected and "supervised" -- threats would come from the front of the building. Some other homeowner (business owner) may have the exact opposite set of circumstances. As such, the "testing" workload has to adapt to the other uses that each particular node is called on to perform as defined in *that* particular system (not something that is known at compile-time)> But you are muddying the waters > I think. Are you writing POST for DRAM in the camera? or in the node that > reads the video from the camera?I verify that the camera's functionality will be available to the system. This means: - the PTZ mount will respond to motion commands - the camera will deliver a "video signal" - the video signal will represent the image of the "scene" before the camera (i.e., if there was a tree in the scene the last time the camera was verified as operational, that tree should still be there!) - the memory into which that image will be analyzed (motion detection) etc.>> Looking at different (smaller) time intervals, I can probably cheat and >> arrange for the HVAC node to be tested IMMEDIATELY AFTER the house has >> reached its heat/cool temperature -- on the assumption that the >> furnace/ACbrrr will *not* be needed in the N seconds/minutes that it takes >> for that test to complete (again, after first locking down the I/O's to >> some sort of "safe" state). > > That's doable. so another node (or set of nodes?) that doesn't need fancy > run-time memory testing. > >> OTOH, the database server is the sole repository of persistent data in the >> system. Taking *it* offline means the system AS A WHOLE needs to be >> essentially quiescent -- there's no way for a node to inquire of settings, >> make changes to settings, respond to changes in the environment, etc. if >> the DB server is not responsive. > > Then you designed a distributed system, but still have a single point of > failure for the system.The system degrades. The functionality that user A considers important may not be the same that user B desires. If a user wants the DBMS to be redundantly implemented, he adds another (or several) other instances to the system. I've put a lot of effort into eeking out every last bit of *system* reliability from the components as it degrades. E.g., if external DRAM dies, a node can degrade to a mode whereby it's virtualized I/O's are serviced by code running on some other node -- possibly one that was powered up in response to the detected memory failure on that node! OTOH, if the *user* considers that functionality to be "disposable", then no other node need "sacrifice" resources to address that failure; wait for the user to install a replacement!> Why does the HVAC need to query the DB continuously? Even if the settings > change periodically (different tempts for different times of day, different > days, and even different seasons), does not mean the settings change minute > by minute. So what if the temp setting changes at 5:05PM instead of > 5:00PM..The DB server is the ONLY source of persistent store in the system. As such, *everything* get's its marching orders (indirectly) from tables in the DBMS. And, as the HVAC *observes* conditions, the only place where those observations can be *stored* is in the DBMS. I.e., I don't say "set the temperature to X degrees at time T" but, rather, "at time T, the user wants the temperature to be X" -- the system sorts out what it has to do (and when) in order to achieve that goal. It does this by learning how the building reacts (e.g., to outdoor conditions) and how the plant compensates (when commanded). Additionally, if the HVAC node invokes a service (possibly on another node) and that service requires something of the DBMS, then you also have an indirect dependency relationship. E.g., if the HVAC needs to load the "evaporative cooling module" (a "module" being a piece of code), that is fetched from the only PERSISTENT STORAGE in the system: the DBMS. If a node is "brought up", the code that runs *in* that node is similarly supplied by the DBMS ("ROMS" just contain bootstraps).> How do you do DB maintenance?The DB isn't visible to the user. Each application that needs access to some particular set of tables/relations accesses and maintains those. How do you maintain the data/tables you have in your product's *RAM*? (Ans: the producers and consumers of those data do the maintenance!)>> [And, the DB server has gobs of resources so testing there tends to be far >> more time consuming] > > Agreed, but have you measured how long?There's 16G of DRAM in the DBMS server along with gobs of spinning media. How long does it take to do a *comprehensive* test of your PC and its components?>>> The Security system musty be 100% available so there you use redundant >>> systems (maybe 3 with a voting protocol) Power cycling one of the three >>> on a schedule doesn't seem too bad. You want security with high >>> availability, you can't get by on the cheap. >>> >>> So I don't think I see the problem. >> >> I don't have any explicit redundancy in the system. E.g., if the door >> camera bites the shed, it's gone. No way to recover that lost >> functionality (without the user replacing the node). Likewise, if the >> node that handles water usage/metering craps out, those functions are gone >> until the hardware is replaced (e.g., perhaps the water supply turns *off* >> when you'd like it to remain on; or, perhaps it is locked on even in the >> event of a detectable plumbing failure, etc.). > > The topic was memory testing, not the I/O. Don't muddy your own topic.Each node is implemented as printed circuit boards. On those boards are components. Some of those components switch coils that gate the flow of water through pipes. Some of those components drive motors that position cameras. Some components sense temperature, humidity, etc. And, SOME STORE DATA (i.e., DRAM). Any component can fail! Testing I/O's is not a "special case" any more than testing *memory* is a "special case". The goal is to ensure the hardware can perform the tasks it will be asked to perform when called upon to perform them.>> As there are no "backups" for individual I/O's, I rely on runtime testing >> to identify problems *before* they interfere with operation -- to give the >> user a "head's up" before the system encounters a failure IN IT'S INTENDED >> USE OF THAT FEATURE. >> >> E.g., turn the security cameras on during the day, verify that the images >> returned by each are "nominal" (i.e., the tree that used to be in the >> center of the scene is still visible there). If not, you can alert the >> user before the system *requires* those cameras to be operational (to >> perform their security monitoring functions). >> >> Consider the irrigation system: it may be days or even weeks for certain >> irrigation valves to be "needed". Yet, a wire could get cut or shorted -- >> or a valve mechanically inoperative -- at any time while the valve is >> "dormant". Waiting to detect that problem until the system decides that >> the valve *must* be energized means you're already too late: why didn't >> you fix it two days earlier when it *failed* (but, wasn't yet NEEDED)? > > The topic was memory testing, not the I/O. Don't muddy your own topic. > > Start a new thread for I/O, predictive maintenance.It's the exact same issue! Components are components. Does a user care if the DRAM in his phone system died vs. a protection network from a lightning strike on the PSTN interface? As far as he is concerned, "My phone is broke!" Letting him know he's got a potential problem brewing BEFORE he is victimized by it makes for a friendlier device. Even if the remedial action he takes is to UNPLUG the phone interface and connect a WE station set to the lines, directly!>>>> This is actually an amusing concept. Ask folks when they consider >>>> their ECC memory system to be "compromised" and you'll never get a >>>> firm answer. E.g., how many bus errors do you consider as sufficient >>>> to leave you wondering if the ECC is actually *detecting* all errors >>>> (let alone *correcting* "some")? How do you know that (detected) >>>> errors are completely localized and have no other consequences? >>>> >>>> <shrug> >>>> >>>> In my case, I treat errors as indicative of a failure. Most probably >>>> something in the power conditioning and not a "wear" error in a >>>> device.. Leaving it unchecked will almost certainly result in more >>>> errors popping up -- some of which I will likely NOT be able to >>>> detect. >>>> >>>> E.g., a POST error in DRAM causes me to fall back to recovery >>>> routines that operate out of (internal) SRAM. A failure in SRAM >>>> similarly causes DRAM to be used to the exclusion of SRAM. A failure >>>> in both means SoL! >>> >>> ... and our sponsor, Duct Tape would like to remind you: in the long >>> run, ALL solutions are temporary. >> >> Permanent Temporary Fixes. >> >>>> Regardless, in these degraded modes, the goal is only to *report* >>>> errors and support some limited remote diagnostics -- not to attempt >>>> to *operate* in the presence of a known problem. >>> >>> So mostly not High availability. Seeking something beyond POST may just >>> be overkill (except for that security system). >> >> Availability is a relative concept. > > Well, actually you never provided your availability requirement other than > the vague 24/7/365 quip in the first post.Because that is something that the user defines. I drive very little. I can tolerate a vehicle being "down" for a week at a time without noticeably impacting my lifestyle. My neighbor drives a *lot*! He can't tolerate "several hours" without a vehicle (and gets a loaner any time his car is in for *any* service -- even an oil change!) We have lots of citrus trees. A failure in the irrigation system means we'd have to drag out a garden hose and manually irrigate if we couldn't get the system repaired in a few days. My (other) neighbor lets his fruit rot on the trees... if HIS irrigation system failed, he wouldn't even notice!>> If you wanted to flush a toilet and the water happened to be "off" because >> that node was busy doing a self-test, it's not "the end of the world"... >> but it would surely be annoying -- and NOTICEABLE. > > But because there is water in the tank, it would work once."If you wanted to wash your hands after going to the bathroom..." "If you wanted to take a shower..." "If you wanted to do laundry..." "If you wanted a glass of drinking water..." "If ..." :>>> Likewise, if someone came to the front door and the doorbell didn't "ring" >> because *that* node happened to be running a memory test... > > then I think you have bigger system design problems. (Over engineering)So, a doorbell should have DEDICATED wires, transformer and annunciator? And, if the residents are *deaf*, they should install visual annunciators in every room of the house (lest they not be able to see the lamp flashing in the living room while they are located in one of the bedrooms -- or *asleep*?). And, if they happen to be out in the back yard, gardening? If a semi-trailer shows up at a loading dock and "rings the bell" to gain entry -- but there isn't a *dedicated* attendant just sitting there all day waiting for deliveries -- should there be bells located throughout the facility in every place the attendant might happen to be (bathroom, front office, stock room, etc.)? OTOH, if a "system" can notice that "doorbell ring" and notify the responsible party WHEREVER HE MAY BE, then there is no need to bother everyone else in the facility with these events (like paging systems in days of old)>> Or, missing an incoming telephone call while the phone system was running >> diagnostics. > > That is another one like Security where you cannot geet by one the cheap > (single node)But you *can* -- if you can run diagnostics while the node is still providing its core functionality! If you require the node to be power cycled to enter POST -- or, commanded to enter BIST -- then you leave the system without that functionality even though there is not a real *failure* present (e.g., if that node *had* a genuine failure, then you're SoL; but, if it doesn't have a catastrophic failure yet is "busy, testing", you don't want the system to behave as if that node was "broken/unavailable".>> Or, someone opening (and then closing) a door to gain entry to the >> premises while the node charged with watching those events was >> "preoccupied" with testing. > > 100% availability requires some redundancy.Do you *own* anything that guarantees 100% availability? (Cell)Phone? Thermostat? Vehicle? PC? Lightbulb? etc. You tolerate some potential risk to greatly offset added cost AND COMPLEXITY! How many folks *don't* do regular backups on their PC's -- despite the value of that content? How many hours without power before the perishables in your refrigerator (or freezer) are "suspect"? How many folks pull the failing batteries out of their smoke detectors (potentially putting their lives at risk) just to silence the annoying "dying battery" chirp? Why not keep spare batteries on hand?? People make their own decisions as to where to spend their dollars and risk. We have a "wired" station set as a backup to the cordless phones and a cell phone as a backup to the land-line. Yet, we can still find ourselves without phone service depending on what sort of "problem" manifests upstream from us.>> Etc. >> >> How many of these are "inconveniences" is debatable: if you went to make >> a call with your cell phone and found it was "busy, testing", SHIRLEY you >> *could* wait a bit while that testing, finishes, right? Are *all* your >> phone calls so terribly urgent that they can't wait?? > > The debateable point is this: exactly what is the availability requirement > for each subsystem?That's up to the user. It's impractical to offer a system of this scale with every possible set of priorities to address every possible set of constraints that any *potential* user might envision. Look around your house. What "backup" do you have for your garage door opener (imagine if it fails while you are *outside*)? Doorbell? Thermostat? Irrigation system? Furnace/ACbrrr? Hot water heater? TV? "HiFi"? Phone? Alarm system? All of these things *do* fail. Yet, how many folks have a "hot spare" on hand? Or, even a *cold* spare? The difference is "failures" are things that users can address -- even if not desired: time to buy a new <whatever>. Artificially induced "unavailability" ('busy, testing') has the potential to be far more frequent than the once-in-a-product's-lifetime "sorry, this is broken"!> For example the system I am working on has a requirement that it is > unavailable for clinical use less than a small number of hours per year (not > counting scheduled maintenance).I have no scheduled maintenance. Nodes are added by connecting them to a switch. Software is updated by adding entries to tables in the DBMS. As nodes can and do come on-line and off-line regularly, changes and enhancements seemlessly merge with the existing components. Reliability is addressed by keeping spares -- for whatever YOU consider to be important. But, the system needs to be able to tell you when those spares are (or may be) needed! You don't want to watch a tree start dropping fruit before you discover that the irrigation valve that services that tree (or, perhaps the entire irrigation controller!) is malfunctioning.>> OTOH, even having to wait a second more than normal while it *aborts* the >> test (and reloads the application) would probably be noticeable to you >> ("Damn phone is ALWAYS 'testing'!") >> >> My goal is to highly integrate this system with day to day living (or >> "business", etc.). As such, if "some" component is always (i.e., "often") >> claiming to be 'busy, testing', it can be counterproductive. (using the >> "splash screen" diversion to hide your activities wears thin, quickly) > > You got that right! > >> So, being able to "hide" these sorts of activities in ways of which the >> user is unaware becomes a significant design goal... > > And needs to be addressed, but it is a different topic.I disagree. That's the point of run-time (memory, in this case) testing!> have a great day.Time for bed.







