Software design for redundant systems

Started by Unknown January 14, 2005

Is there any good literature out there pertaining to software design of
redundant systems?  I have some ideas, but I am not sure if they are
adequate, or even correct.  I am thinking along the lines of separating
the redundancy logic from the business logic.  In today's world no
software application is an island unto itself.  Software applications
communicate with each other via some sort of IPC, or by accessing some
shared data (e.g. shared memory and file).  Also, software applications
start timers, and a lot of processing is triggered by the triggering of
the timers.

I am thinking of maintaining the notion of redundancy state in the
supporting software, outside the business logic of the application.
What I mean specifically is this: I will create wrapper functions
around IPC system calls, IO calls and timer calls.  Inside those
wrapper functions, I will maintain the notion of redundancy state.  For
example, if the redundancy state is standby, the wrapper functions for
IPC will not send out any messages, the wrapper functions for shared
memory access will not access the shared memory, the wrapper functions
for file access will not access shared files and the wrapper functions
for timers will not set any timers.  The advantage that I see with this
approach is that the business logic of the application is completely
oblivious to the redundancy state.  When the redundancy state switches
to active, lo and behold all these wrapper functions are turned ON, and
they begin to work normally.

An alternate approach is to make a call to a function which returns
immediately on the active side, but blocks on the standby side, up
until the redundancy state changes to active.  While easier to
implement, the disadvantage of this approach is that upon switchover,
control will resume only from this point onwards.

No discussion on redundancy is complete without a discussion on data
synchronization and the need for checkpointing.  Data synchronization
of persistent data seems to be a lot easier than data synchronization
of memory-resident data.  In the former case we could potentially rely
on external utilities and operating system capabilities (e.g.
timestamps on files) maintaining this synchronization, using some
criteria (e.g. time based or number of updates).

For synchronization of memory-resident data, I have the following in
mind.  I "register" a certain region of process memory with a
"memory duplication service".  This service runs on the active and
standby side in its own thread.  Any data that is written anywhere in
this region of memory on the active side gets copied to the standby
side.  Of course the physical memory address values inside the two
instances of the applications (primary and secondary) will be
different, but within these address spaces relative offsets will be the
same (after all it is the same software that runs in both active and
standby mode).  To duplicate some data from active to standby, you
merely need to provide its offset from the beginning and its size.  If
more than one region of memory are "registered", the memory region
identifier may also need to be provided.

I have tried to look far and wide to see if there are any standards for
redundancy management.  The only standard that I have found so far is
X.751 from ITU-T.  However, this standard only deals with the
management aspect of redundancy management.  Unfortunately this
document reads like scripture ---- extremely cryptic that takes at
least a few readings before you get it.  For example it took me a long
time to realize that PRIMARY and SECONDARY are roles in the fallback
relationship, while BACKEDUP and BACKUP are roles in the backup
relationship.  I had initially assumed them to be synonymous.

To wrap up, I would appreciate if someone could provide some software
strategies for building redundant systems.


> To wrap up, I would appreciate if someone could provide some software > strategies for building redundant systems. >
There are a number of design patterns that describes different strategies for redundancy. I would start by checking them out.