Reply by D Yuniskis April 4, 20112011-04-04
Hi Shane,

On 3/31/2011 5:45 AM, Shane williams wrote:

[much elided]

> Hmm, ok, if the byte count goes wrong as well I guess it could - I
It can change -- or *not*! When you miss a start bit, all bets are off because your receiver is no longer in sync with your data. E.g., if you miss a start bit and are transmitting the value 0xFF (with no parity), then the line just looks COMPLETELY IDLE for one whole character time. OTOH, if you miss the start bit and are sending 0x55, then you could "receive" any of a number of different values in place of that 55...
> didn't think of that. The ddcmp protocol actually has a 10 byte > header (can't remember if I mentioned this) with a separate crc for > the header. The count byte for the data is in the header. I suspect > the chance of it morphing into something valid would be pretty low in > our case - e.g. one particular byte must always have the value 0x01.
When you are operating in an environment in which errors are not The Exception, it is hard to make *any* assumptions.
> So the detection of 1,2 and 3 bit errors by crc16-ccitt doesn't allow > for start bit and stop bit errors? I never thought of that either.
Because the start and stop bits are "out of band" (unless missing one puts them *in* band -- for another character time!)
>>>>> However we may end up with 3 ports per node making it a collection of >>>>> rings or a mesh. The loading at the slowest baud rate is approx 10% >> >> Sorry, the subject wasn't clear in my question<:-( >> I mean, if you were to stick with the slowest rate, your >> "10%" number *suggests* you have lots of margin -- why >> push for a higher rate with the potential for more >> problems? > > To get faster request/ response - a shorter propagation delay.
But there are other ways to do that. E.g., passing along the message before it is completely received, etc.
>> So, for each ring, you WON'T receive a message until you have >> transmitted any previous message? Alternatively, you won't >> transmit a message until your receiver is finished? > > This is true for each uart.
But, is it true for each *ring*? I.e., in A->B->C->D->E-> you have stated that D won't be receiving while it is *sending* to E. This implies C won't be sending (to D) in this time. But, that doesn't preclude *B* from sending to C in this time! I.e., can there be more than one message circulating in each ring? If so, and the baud rate can be changed, how can you guarantee that messages don't start "rear ending" the ones ahead of them? I.e., if D downgrades its baudrate, any message that B is sending (to C) looks like it is "speeding"...
>> What prevents two messages from being "in a ring" at the same >> time (by accident)? I.e., without violating the above, it >> seems possible that node 18 can be sending to node 19 (while >> 19 is NOT sending to 20 and 17 is not sending to 18) at the >> same time that node 3 is sending to node 4 (while neither 2 >> nor 4 are actively transmitting). > > I don't follow this. It's not a bus. 18 and 19 can talk to each > other and no-one else hears.
See above.
>> Since this *seems* possible, how can you be sure one message >> doesn't get delayed slightly so that the second message ends >> up catching up to it? (i.e., node 23 has no way of knowing >> that node 24 is transmitting to 25 so 23 *could* start sending >> a message to 24 that 24 fails to notice -- in whole or in >> part -- because 24 is preoccupied with its outbound message) > > I have a feeling there's a misunderstanding here - not sure what > though.
See above. This is where the use of coins/tokens on a graph can be useful -- you can see how the messages can potentially interact with each other.
>> Number the nodes 1 - 10 (sequentially). >> The CW node has 1 sending to 2, 2 sending to 3, ... 10 sending to 1. >> The CW node has 10 sending to 9, 9 sending to 8, ... 1 sending to 10. >> The nodes operate concurrently. > > Yes, they do. > >> So, assume 7 originates a message -- destined for 3. In the CW ring, >> it is routed as 7, 8, 9, 10, 1, 2, 3. In the CCW ring, it is routed >> (simultaneously) as 7, 6, 5, 4, 3. >> >> *If* it progresses node to node at the exact same rates in each >> ring (this isn't guaranteed but "close enough for gummit work"), >> then it arrives in 8& 6 at the same time, 9& 5, 10& 4, 1& 3, >> 2& 2 (though different "rings"), 3& 1, etc. (note I have assumed, >> here, that it continues around until reaching it's originator... >> but, that's not important). > > ok - it actually dies at around about the 2&2 , 3&1 stage
So there is no way of a sender knowing that a recipient got a message intended for it?
>> Now, at node 9, if the CW ring decides that the baudrate needs to be >> changed and it thinks "now is a good time to do so" (because it has >> *just* passed it's CW message on to node 10), that action effectively >> interrupts any traffic in the CW ring (until the other nodes make >> the similar baudrate adjustment in the CW direction). > > No, the baud rate between any two nodes is independent of any other > two nodes. I'm missing something here.
When the baud rate changes between two particular (adjacent) nodes, there is effectively a discontinuity introduced. As you said, "the baud rate between any two nodes is independent of any other two nodes" so other nodes can be talking at FASTER (or slower) speeds. The time it takes to pass a message between any two nodes can then vary. Time is universally shared among all nodes. If D->E runs at 1200 baud and all other nodes are running at 57600 baud, then a message from A can get to B and then to C and then ... in the time it takes D to push a similarly sized message out to *E*. I.e., C has no way of knowing if D is ready to *listen* to C, yet, since C has no way of knowing if D has finished transmitting to E. C can't rely on the fact the time that was required for it to receive it's incoming message (from B) would be sufficient for D to have passed *its* message along!
>>>> Regarding the "other" way to split the Tx&Rx... have the Tx >>>> always talk to the downstream neighbor and Rx the upstream >>>> IN THE SAME RING. In this case, changes to Tx+Rx baudrates >>>> apply only to a certain ring. So, you can change baudrate >>>> when it is convenient (temporally) for that *ring*. >> >>>> But, now the two rings are potentially operating at different >>>> rates. So, the "other" ring will eventually ALSO have to >>>> have its baudrate adjusted to match (or, pass different traffic) >> >>> I think there must be a misunderstanding somewhere - not sure where. >> >> You can wire two UARTs to give you two rings in TWO DIFFERENT WAYS. >> Look at a segment of the ring with three nodes: >> >> ------> 1 AAAA 1 --------> 1 BBBB 1 --------> 1 CCCC 1 -------> >> AAAA BBBB CCCC >> <------ 2 AAAA 2 <-------< 2 BBBB 2 <-------- 2 CCCC 2 <------- >> >> vs. >> >> ------> 1 AAAA 2 --------> 1 BBBB 2 --------> 1 CCCC 2 -------> >> AAAA BBBB CCCC >> <------ 1 AAAA 2 <-------< 1 BBBB 2 <-------- 1 CCCC 2 <------- >> >> where the numbers identify the UARTs associated with each signal. > > We do the second case, except sometimes they mis-wire it so that uart > 2 on B connects to uart 2 on C when it should connect to uart 1 on C - > but this doesn't matter (much) currently.
OK, so when you change the baudrate on a UART, you interrupt traffic in *both* rings between that node and it's neighbor. E.g., when the *one* UART that connects B to C changes baudrate, then nothing can be flowing from B to C *or* C to B (i.e., *both* rings are involved)
>> [assume tx and rx baudrates are driven by the same baudrate generator >> so there is a BRG1 and BRG2 in each node] > > Yes, there is. > >> In the first case, when you change the baudrate of a UART at some >> particular node, the baudrate for that segment in *the* ring that >> the UART services (left-to-right ring vs right-to-left ring) changes. >> So, you must change *after* you have finished transmitting and you >> will no longer be able to receive until the node upstream from you >> also changes baudrate. >> >> In the second case, when you change the baudrate of a UART at some >> particular node, the baudrate for all communications with that >> particular neighbor (to the left or to the right) changes. So, >> *both* rings are "broken" until that neighbor makes the comparable >> change. > > Yes. > >> Look at each scenario and its consequences while messages are >> circulating (in both rings!). Changing data rates can be a very >> disruptive thing as it forces the ring(s) to be emptied; some >> minimum guaranteed quiescent period to provide a safety factor >> (that no messages are still in transit); the actual change >> to be effected; a quiescent period to ensure all nodes are >> at the new speed; *then* you can start up again. > > Why do all nodes have to be at the same speed?
They don't! But, if they aren't, then its more difficult to ensure that messages don't "collide". I.e., if someone downstream from you starts operating at a slower rate, then messages that *you* are sending can end up "there" before it is ready for them. You *can* make this work. But, there are lots of ways it can *break*. That was Vladimir's point (elsewhere, up-thread). Especially if you are *expecting* to be operating (even temporarily) at the fringe of reliable communication!
>> While it sounds "child-like", you might find making a drawing >> and moving some coins (tokens) around the rings as if they were >> messages. It helps to picture what changes to the rings' >> operation you can make and *when*. >> >> Either try NOT to change baudrates *or* change them at times >> that can be determined a priori.
Reply by Shane williams April 1, 20112011-04-01
On Apr 2, 7:18=A0am, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane, > > On 3/31/2011 4:49 AM, Shane williams wrote: > > >> Furthermore, if errors *are* The Exception, then you can consider > >> running the interface(s) in full duplex mode and starting to pass > >> a packet along to your successor *before* it is completely > >> received. =A0This effectively reduces the size of the message (N) > >> in the above calculation. > > > Yep, we looked at full duplex but we need to allow 2 wire > > connections. =A0We're also considering splitting the longer messages > > into shorter ones. > > I'm confused. =A0Why do you think "full duplex" and "2 wire" are > contradictions?
I'm confused too. Perhaps I haven't explained well enough. We have two "logical rings" and one physical ring. So with your diagram of nodes A,B,C, there are two wires going from B to C and two completely separate wires going from B to A. Each device receives its own transmission. Between B and C, only one device can transmit at a time i.e. half duplex. B's transmissions to C are for the CCW ring and C's transmissions to B are for the CW ring.
> > I am not telling you to add or change any wiring/hardware. > Each node has two inputs (one for the CW ring and another for > the CCW ring) and two outputs (ditto). > > What I am saying is that you start propagating an incoming packet > *before* it is completely received! >
[snip]
> > You are running Tx and Rx at the same time but not really > in the traditional application of "full duplex". =A0You overlap > transmission (propagation) of the message with it's reception. >
The hardware transmits and receives a whole message at a time for us which saves a lot of interrupt overhead so we can't start transmitting early with the current hardware..
> >>> Some of the devices that connect to the multi-drop network are old an=
d
> >>> low-powered and a token ring was too much overhead at the time. =A0Al=
so
> >>> we require redundancy which needs 4 wires for multi-drop but only 2 > >>> wires for the ring. > > >> How do you mean "4 wires" vs. "2 wires"? =A0You can run a 485 network > >> with a single differential pair. =A0A 232-ish approach requires a Tx > >> and Rx conductor (for each ring). =A0So, you could implement two > >> 485 busses for the same conductor count as your dual UART rings. > > > With the ring, the system still operates when there is a break or > > short somewhere. > > With a SINGLE RING, you can't make that claim -- since there > is no way to get messages "across" the break. =A0I.e., if there > is a break between nodes 3 & 4, then 1 can talk to 2 and 2 can > talk to 3 -- but 3 can't talk to 4 *and* 2 can't REPLY to 1 > nor can 3 reply to 2 (or 1), etc.
Messages go in both directions around the ring even though it's only 2 wire. If there's a break between 3 and 4, messages still get from 3 to 4 because they go round the other way as well. Every message placed on the ring goes in both the CW and CCW directions, except when two nodes are having a conversation with each other.
> > You only get continued operation if *both* rings are "wired" and > a break is confined to a single ring (you can support some breaks > in *both* rings if you allow messages to transit from one ring to > the other -- but this gets hokey) > > > With multidrop and 2 wires, a short takes down the whole bus > > It takes out that *one* bus. =A0But, you have a second -- using > a second pair of conductors (same number of wires that your > dual ring requires!)
No, our "dual" ring only requires 2 wires.
> > > so we have 4 wires between devices with each pair supposed > > to be routed on a different path. > > You can run the second multidrop bus "on a different path" > just as well as you can run the CCW ring's cabling on that > same "different path". =A0I don't see why you think multidrop > is more vulnerable or takes more wires/hardware? > > > =A0The signal is duplicated on both > > pairs of wire, except for when we're checking the integrity of each > > pair individually. > > Huh???
Sorry, we have a special board that we drive from a single uart and it duplicates the tx onto each pair or wires and for rx it combines the two signals to give the uart rx a single character stream.
Reply by D Yuniskis April 1, 20112011-04-01
Hi Shane,

On 3/31/2011 4:49 AM, Shane williams wrote:
> On Mar 31, 7:14 pm, D Yuniskis<not.going.to...@seen.com> wrote:
>> If we assume N is size of packet (in "characters"), then RTT is >> 64 * [(N / 5760) + P] where P is time spent processing message >> packet on each node before passing it along. >> >> 1.8s/64 = [(N / 5760) + P] = ~30ms >> >> Guessing at a message size of ~100 bytes (characters) suggests the >> "transmission time" component of this is ~17ms -- leaving 13ms as >> a guess at the processing time, P. > > No. The request/ response we need to speed up has perhaps 30 bytes > out-going and 200 bytes returned. > I estimated 10 milliseconds per node for the out-going message and 40 > ms for the return message. 32 times 10 plus 32 times 40 is approx 1.6 > seconds plus some extra processing time at each end. My estimate is > probably a little bit low.
As message size goes up (transmission time increases), you have all the more incentive to passing the message along before it is completely received. If, e.g., you can reduce the effective "hold-over" time at each node to 10 bytes (from 200), then you can trim 30ms from that 40 you have estimated (190 bytes at 5760 bytes/sec). Since this savings happens at each node, your RTT drops by almost a second (30 ms * 32 nodes).
>> If you cut this to ~0, then you achieve your 1 sec goal (almost >> exactly). >> >> This suggests that eliminating/simplifying any error detection >> so that incoming messages can *easily* be propagated is a goal >> to pursue. If you can improve the reliability of the comm link >> so that errors are the *exception* (i.e., unexpected), then >> you can simplify the effort required to "handle" those errors. >> >> Furthermore, if errors *are* The Exception, then you can consider >> running the interface(s) in full duplex mode and starting to pass >> a packet along to your successor *before* it is completely >> received. This effectively reduces the size of the message (N) >> in the above calculation. > > Yep, we looked at full duplex but we need to allow 2 wire > connections. We're also considering splitting the longer messages > into shorter ones.
I'm confused. Why do you think "full duplex" and "2 wire" are contradictions? I am not telling you to add or change any wiring/hardware. Each node has two inputs (one for the CW ring and another for the CCW ring) and two outputs (ditto). What I am saying is that you start propagating an incoming packet *before* it is completely received! So, instead of (effectively): count = 0 do { buffer[count++] = get_byte() } while (count < MESSAGE_SIZE) // have now gobbled up entire incoming message! if (message_is_for_me(buffer)) { process_mesage(buffer, MESSAGE_SIZE) } else { // not for me so pass the whole message on... transmit_message(buffer, MESSAGE_SIZE) } do something like: count = 0 do { buffer[count++] = get_byte() } while (count < HEADER_SIZE) // now have JUST the header/routing portion of the message if (message_is_for_me(buffer)) { // gather up the rest of this message as it belongs to me! do { buffer[count++] = get_byte() } while (count < MESSAGE_SIZE) // have now gobbled up entire incoming message so deal with it! process_message(buffer, MESSAGE_SIZE) } else { // not intended for me so pass what I have, so far, along transmit_message(buffer, HEADER_SIZE) do { // and, pass along each subsequent byte as it is received transmit(get_byte()) } while (++count < MESSAGE_SIZE) } [this is written poorly just to illustrate what should be happening] You are running Tx and Rx at the same time but not really in the traditional application of "full duplex". You overlap transmission (propagation) of the message with it's reception.
>> E.g., if you can hold just *10* bytes of the message before >> deciding to pass it along, then the "transmission time" >> component drops to 1.7ms. Your RTT is then 0.1 sec!
>>>>> We actually already do have an RS485 multi-drop version of this >>>>> protocol but it's non-deterministic and doesn't work very well. I >>>>> don't really want to go into that... >> >>>> It's relatively easy to get deterministic behavior from a 485 >>>> deployment. And, depending on the *actual* operating conditions >>>> of the current ring implementation, could probably achieve >>>> lower latencies at lower baudrates. >> >>> Some of the devices that connect to the multi-drop network are old and >>> low-powered and a token ring was too much overhead at the time. Also >>> we require redundancy which needs 4 wires for multi-drop but only 2 >>> wires for the ring. >> >> How do you mean "4 wires" vs. "2 wires"? You can run a 485 network >> with a single differential pair. A 232-ish approach requires a Tx >> and Rx conductor (for each ring). So, you could implement two >> 485 busses for the same conductor count as your dual UART rings. > > With the ring, the system still operates when there is a break or > short somewhere.
With a SINGLE RING, you can't make that claim -- since there is no way to get messages "across" the break. I.e., if there is a break between nodes 3 & 4, then 1 can talk to 2 and 2 can talk to 3 -- but 3 can't talk to 4 *and* 2 can't REPLY to 1 nor can 3 reply to 2 (or 1), etc. You only get continued operation if *both* rings are "wired" and a break is confined to a single ring (you can support some breaks in *both* rings if you allow messages to transit from one ring to the other -- but this gets hokey)
> With multidrop and 2 wires, a short takes down the whole bus
It takes out that *one* bus. But, you have a second -- using a second pair of conductors (same number of wires that your dual ring requires!)
> so we have 4 wires between devices with each pair supposed > to be routed on a different path.
You can run the second multidrop bus "on a different path" just as well as you can run the CCW ring's cabling on that same "different path". I don't see why you think multidrop is more vulnerable or takes more wires/hardware?
> The signal is duplicated on both > pairs of wire, except for when we're checking the integrity of each > pair individually.
Huh???
Reply by ChrisQ April 1, 20112011-04-01
Shane williams wrote:

> > Off the top of your head, do you have any idea what the execution time > to do huffman compression of 200 bytes of text would be on the 32C87? >
We did profile the code at one stage, using a scope on a port line that was triggered on entry and exit from the function, but don't have the results to hand, only that it was fast enough... Regards, Chris
Reply by Shane williams April 1, 20112011-04-01
On Mar 31, 5:26=A0am, ChrisQ <m...@devnull.com> wrote:
> Shane williams wrote:
> > I don't know much about compression but it sounds too CPU intensive > > for the 68302? =A0What micro are you using? > > The project used the Renesas 32C87 series from Hitachi. Not such an > elegant arch as 68k, but almost certainly faster than a '302. Depending > on the data, simple compression like huffman encoding can work quite > well, but another way might be to simplify / reorganise the frame format > or data within it, so you can send fewer bytes... >
Off the top of your head, do you have any idea what the execution time to do huffman compression of 200 bytes of text would be on the 32C87?
Reply by Shane williams March 31, 20112011-03-31
On Mar 31, 6:14=A0am, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane, > > On 3/30/2011 5:12 AM, Shane williams wrote: > > > >> E.g., a start bit error is considerably different than a > >> *data* bit error (think about it). > > > Hmm. =A0I forgot about that. =A0A start or stop bit error means the who=
le
> > message is rejected which is good. > > My point was that if you *miss* a start bit, then you have -- at > the very least -- missed the "first" bit of the message (because, > if it was MARKING, the UART just ignored it and, if it was SPACING, > the UART thought *it* was the start bit). =A0If you are pushing > bytes (characters) down the wire at the maximum data rate (minimal > stop time between characters), then you run the risk of part of > the *next* character being "shifted" into this "misaligned" first > character. =A0I.e., it gets really difficult to figure out *if* > your code will be able to detect an error (because the received > byte "looks wrong") or if, BY CHANCE, the bit patterns can conspire > to look like a valid "something else".
Hmm, ok, if the byte count goes wrong as well I guess it could - I didn't think of that. The ddcmp protocol actually has a 10 byte header (can't remember if I mentioned this) with a separate crc for the header. The count byte for the data is in the header. I suspect the chance of it morphing into something valid would be pretty low in our case - e.g. one particular byte must always have the value 0x01. So the detection of 1,2 and 3 bit errors by crc16-ccitt doesn't allow for start bit and stop bit errors? I never thought of that either.
> > >>> However we may end up with 3 ports per node making it a collection of > >>> rings or a mesh. =A0The loading at the slowest baud rate is approx 10=
%
> > >> [scratches head] then why are you worrying about running at > >> a higher rate? > > > Because not all sites can wire as a mesh. =A0The third port is optional > > but helps the propagation delay a lot. > > Sorry, the subject wasn't clear in my question =A0<:-( > I mean, if you were to stick with the slowest rate, your > "10%" number *suggests* you have lots of margin -- why > push for a higher rate with the potential for more > problems?
To get faster request/ response - a shorter propagation delay.
> > >> Latency might be a reason -- assuming you > >> don't circulate messages effectively as they pass *through* > >> a node. =A0But, recall that you only have to pass through > >> 32 nodes, worst case, to get *a* copy of a message to any > >> other node... > > >>> for 64 nodes. =A0If we decide to allow mixed baud rates, each node wi=
ll
> >>> have the ability to tell its adjacent nodes to slow down when its > >>> message queue gets to a certain level, allowing it to cope with a > >>> brief surge in messages. > > >> Depending on how you chose to allocate the Tx&Rx devices in each > >> link -- and, whether or not your baudrate generator allows > >> the Tx and Rx to run at different baudrates -- you have to: > >> * make sure your Tx FIFO (hardware and software) is empty before > >> =A0 =A0 changing Tx baudrate > >> * make sure your "neighbor" isn't sending data to you when you > >> =A0 =A0 change your Rx baudrate (!) > > > This is assured. =A0It's half duplex and the hardware sends a whole > > message at a time. > > So, for each ring, you WON'T receive a message until you have > transmitted any previous message? =A0Alternatively, you won't > transmit a message until your receiver is finished?
This is true for each uart.
> > What prevents two messages from being "in a ring" at the same > time (by accident)? =A0I.e., without violating the above, it > seems possible that node 18 can be sending to node 19 (while > 19 is NOT sending to 20 and 17 is not sending to 18) at the > same time that node 3 is sending to node 4 (while neither 2 > nor 4 are actively transmitting).
I don't follow this. It's not a bus. 18 and 19 can talk to each other and no-one else hears.
> > Since this *seems* possible, how can you be sure one message > doesn't get delayed slightly so that the second message ends > up catching up to it? =A0(i.e., node 23 has no way of knowing > that node 24 is transmitting to 25 so 23 *could* start sending > a message to 24 that 24 fails to notice -- in whole or in > part -- because 24 is preoccupied with its outbound message)
I have a feeling there's a misunderstanding here - not sure what though.
> > >> Consider that a link (a connection to *a* neighbor) that "gives you > >> problems" will probably (?) cause problems in all communications > >> with that neighbor (Tx& =A0Rx). =A0So, you probably want to tie the > >> Tx and Rx channels of *one* device to that neighbor (vs. splitting > >> the Rx with the upstream and Tx with the downstream IN A GIVEN RING) > > > Not sure I follow but a single uart does both the tx and rx to the > > same neighbor. > > >> [this may seem intuitive -- or not! =A0For the *other* case, see end] > > >> Now, when you change the Rx baudrate for the upstream CW neighbor, > >> you are also (?) changing the Tx baudrate for the downstream CCW > >> neighbor (the "neighbor" is the same physical node in each case). > > > Yes > > >> Also, you have to consider if you will be changing the baudrate > >> for the "other" ring simultaneously (so you have to consider the > >> RTT in your switching calculations). > > > What is RTT? > > Round Trip Time (sorry :< ) =A0I.e., you (each of your nodes) has > to be aware of the time it takes a message to (hopefully) make > it around the ring. > > >> Chances are (bet dollars to donuts?), the two rings are in different > >> points of their message exchange (since the distance from message > >> originator to that particular node is different in the CW ring > >> vs. the CCW ring). =A0I.e., this may be a convenient time to change > >> the baudrate (thereby INTERRUPTING the flow of data around the ring) > >> for the CW ring -- but, probably *not* for the CCW ring. > > > I'm lost here. > > Number the nodes 1 - 10 (sequentially). > The CW node has 1 sending to 2, 2 sending to 3, ... 10 sending to 1. > The CW node has 10 sending to 9, 9 sending to 8, ... 1 sending to 10. > The nodes operate concurrently.
Yes, they do.
> > So, assume 7 originates a message -- destined for 3. =A0In the CW ring, > it is routed as 7, 8, 9, 10, 1, 2, 3. =A0In the CCW ring, it is routed > (simultaneously) as 7, 6, 5, 4, 3. > > *If* it progresses node to node at the exact same rates in each > ring (this isn't guaranteed but "close enough for gummit work"), > then it arrives in 8 & 6 at the same time, 9 & 5, 10 & 4, 1 & 3, > 2 & 2 (though different "rings"), 3 & 1, etc. (note I have assumed, > here, that it continues around until reaching it's originator... > but, that's not important).
ok - it actually dies at around about the 2&2 , 3&1 stage
> > Now, at node 9, if the CW ring decides that the baudrate needs to be > changed and it thinks "now is a good time to do so" (because it has > *just* passed it's CW message on to node 10), that action effectively > interrupts any traffic in the CW ring (until the other nodes make > the similar baudrate adjustment in the CW direction).
No, the baud rate between any two nodes is independent of any other two nodes. I'm missing something here.
> > But, there is a message circulating in the CCW ring -- it was just > transmitted from node 5 to 4 (while 9 was sending to 10). =A0It will > eventually be routed to node 9 as it continues it's way around the > CCW ring. =A0But, *it* is moving at the original baudrate (in the CCW > ring) while node 9 is now operating at the *new* baudrate (in the > CW ring). =A0So, any new traffic in the CW ring will run around > that ring at a different rate than the CCW traffic. =A0If you only > allow one message to be active in each ring at any given time, then > this will "resolve itself" one RTT later. =A0But, if the "other" > ring never decides to change baudrates... ? > > And, if it *does* change baudrates at the same time as the "first" > ring, then you have to wait for the CW message to have been > completely propagated *and* the CCW message as well before making > the change. =A0I.e., you have to let both rings go idle before > risking the switch (or, take considerable care to ensure that > a switch doesn't happen DOWNstream of a circulating message) > > >> [recall, changing baudrate is probably going to result in lots > >> of errors for the communications to/from the affected neighbor(s)] > > >> So, you really have to wait for the entire ring to become idle > >> before you change baudrates -- and then must have all nodes do > >> so more or less concurrently (for that ring). =A0If you've split the > >> Tx and Rx like I described, then this must also happen on the > >> "other" ring at the same time. > > >> Regarding the "other" way to split the Tx&Rx... have the Tx > >> always talk to the downstream neighbor and Rx the upstream > >> IN THE SAME RING. =A0In this case, changes to Tx+Rx baudrates > >> apply only to a certain ring. =A0So, you can change baudrate > >> when it is convenient (temporally) for that *ring*. > > >> But, now the two rings are potentially operating at different > >> rates. =A0So, the "other" ring will eventually ALSO have to > >> have its baudrate adjusted to match (or, pass different traffic) > > > I think there must be a misunderstanding somewhere =A0- not sure where. > > You can wire two UARTs to give you two rings in TWO DIFFERENT WAYS. > Look at a segment of the ring with three nodes: > > =A0 =A0------> 1 AAAA 1 --------> 1 BBBB 1 --------> 1 CCCC 1 -------> > =A0 =A0 =A0 =A0 =A0 =A0 =A0AAAA =A0 =A0 =A0 =A0 =A0 =A0 =A0 BBBB =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 CCCC
> =A0 =A0<------ 2 AAAA 2 <-------< 2 BBBB 2 <-------- 2 CCCC 2 <------- > > vs. > > =A0 =A0------> 1 AAAA 2 --------> 1 BBBB 2 --------> 1 CCCC 2 -------> > =A0 =A0 =A0 =A0 =A0 =A0 =A0AAAA =A0 =A0 =A0 =A0 =A0 =A0 =A0 BBBB =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 CCCC
> =A0 =A0<------ 1 AAAA 2 <-------< 1 BBBB 2 <-------- 1 CCCC 2 <------- > > where the numbers identify the UARTs associated with each signal.
We do the second case, except sometimes they mis-wire it so that uart 2 on B connects to uart 2 on C when it should connect to uart 1 on C - but this doesn't matter (much) currently.
> > [assume tx and rx baudrates are driven by the same baudrate generator > so there is a BRG1 and BRG2 in each node]
Yes, there is.
> > In the first case, when you change the baudrate of a UART at some > particular node, the baudrate for that segment in *the* ring that > the UART services (left-to-right ring vs right-to-left ring) changes. > So, you must change *after* you have finished transmitting and you > will no longer be able to receive until the node upstream from you > also changes baudrate. > > In the second case, when you change the baudrate of a UART at some > particular node, the baudrate for all communications with that > particular neighbor (to the left or to the right) changes. =A0So, > *both* rings are "broken" until that neighbor makes the comparable > change.
Yes.
> > Look at each scenario and its consequences while messages are > circulating (in both rings!). =A0Changing data rates can be a very > disruptive thing as it forces the ring(s) to be emptied; some > minimum guaranteed quiescent period to provide a safety factor > (that no messages are still in transit); the actual change > to be effected; a quiescent period to ensure all nodes are > at the new speed; *then* you can start up again.
Why do all nodes have to be at the same speed?
> > While it sounds "child-like", you might find making a drawing > and moving some coins (tokens) around the rings as if they were > messages. =A0It helps to picture what changes to the rings' > operation you can make and *when*. > > Either try NOT to change baudrates *or* change them at times > that can be determined a priori.
Reply by Shane williams March 31, 20112011-03-31
On Mar 31, 7:14=A0pm, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane, > > On 3/30/2011 3:56 PM, Shane williams wrote: > > [8<] > > >>> I'm trying to improve the propagation delay of messages around the > > >> What sort of times are you seeing, presently? =A0At which baudrates? > >> How much *better* do they need to be (or, would you *like* them > >> to be)? > > > We've limited the ring to 32 nodes up till now at 57600 baud. =A0The > > nominal target now is 64 nodes for which I've calculated a worst case > > request response time of approx 1.8 seconds with no retries. =A0I would > > like it to be about one second. > > OK, back-of-napkin guesstimates... > > Assume 10b character frames transmitted "flat out" (no time between > end of stop bit and beginning of next start bit). =A0So, 5760 characters > per second is data rate.
Yes.
> > If we assume N is size of packet (in "characters"), then RTT is > 64 * [(N / 5760) + P] where P is time spent processing message > packet on each node before passing it along. > > 1.8s/64 =3D [(N / 5760) + P] =3D ~30ms > > Guessing at a message size of ~100 bytes (characters) suggests the > "transmission time" component of this is ~17ms -- leaving 13ms as > a guess at the processing time, P.
No. The request/ response we need to speed up has perhaps 30 bytes out-going and 200 bytes returned. I estimated 10 milliseconds per node for the out-going message and 40 ms for the return message. 32 times 10 plus 32 times 40 is approx 1.6 seconds plus some extra processing time at each end. My estimate is probably a little bit low.
> > If you cut this to ~0, then you achieve your 1 sec goal (almost > exactly). > > This suggests that eliminating/simplifying any error detection > so that incoming messages can *easily* be propagated is a goal > to pursue. =A0If you can improve the reliability of the comm link > so that errors are the *exception* (i.e., unexpected), then > you can simplify the effort required to "handle" those errors. > > Furthermore, if errors *are* The Exception, then you can consider > running the interface(s) in full duplex mode and starting to pass > a packet along to your successor *before* it is completely > received. =A0This effectively reduces the size of the message (N) > in the above calculation.
Yep, we looked at full duplex but we need to allow 2 wire connections. We're also considering splitting the longer messages into shorter ones.
> > E.g., if you can hold just *10* bytes of the message before > deciding to pass it along, then the "transmission time" > component drops to 1.7ms. =A0Your RTT is then 0.1 sec! > > Alternatively, you can spend a few ms processing in each node > and still beat your 1 sec goal -- *or*, drop the data rate by > a factor of 5 or 6 and still hit the 1 sec goal! > > [remember, this is back-of-napkin calculation so I don't claim > it accurately reflects *your* operating environment. =A0rather, it > puts some options in perspective...] > >
[snip]
> > >>> We actually already do have an RS485 multi-drop version of this > >>> protocol but it's non-deterministic and doesn't work very well. =A0I > >>> don't really want to go into that... > > >> It's relatively easy to get deterministic behavior from a 485 > >> deployment. =A0And, depending on the *actual* operating conditions > >> of the current ring implementation, could probably achieve > >> lower latencies at lower baudrates. > > > Some of the devices that connect to the multi-drop network are old and > > low-powered and a token ring was too much overhead at the time. =A0Also > > we require redundancy which needs 4 wires for multi-drop but only 2 > > wires for the ring. > > How do you mean "4 wires" vs. "2 wires"? =A0You can run a 485 network > with a single differential pair. =A0A 232-ish approach requires a Tx > and Rx conductor (for each ring). =A0So, you could implement two > 485 busses for the same conductor count as your dual UART rings.
With the ring, the system still operates when there is a break or short somewhere. With multidrop and 2 wires, a short takes down the whole bus so we have 4 wires between devices with each pair supposed to be routed on a different path. The signal is duplicated on both pairs of wire, except for when we're checking the integrity of each pair individually.
Reply by Shane williams March 31, 20112011-03-31
On Mar 31, 9:04=A0pm, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane, > > On 3/30/2011 3:39 PM, Shane williams wrote: > > >>>> *And exactly where the frame is removed from the ring is a design > >>>> question. =A0Often the sender removes frames it had sent, when they =
make
> >>>> their way back around, in which case the critical item for removal i=
s
> >>>> the source address (and usually in that case the destination node se=
ts
> >>>> a "copied" bit in the trailer, thus verifying physical transmission =
of
> >>>> the frame to the destination). > > >>> In our case, there are two logical rings and each message placed on > >>> the ring is sent in both directions. =A0When the two messages meet up > >>> approximately half-way round they annihilate each other - but if the > >>> annihilation fails, the sender removes them as well. > > >> How does the sender *recognize* them as "his to annihilate"? > >> I.e., if the data can be corrupted, so can the "sender ID"! > >> The problem with a ring is that it has no "end" so things have > >> the potential to go 'round and 'round and 'round and... > > > The message would have to get corrupted undetected every time around > > the ring to go round forever. > > No. =A0Once a frame is corrupted, it is no longer recognizable as > its original *intent*. =A0(see below)
I'm not following. If a message is corrupted but still looks like a valid message then the corrupted message will still get removed from the ring after it's been around once because each node will find in its list of "recent" messages.
> > > Each device that puts a message on the ring puts his own address at > > the start plus a one byte incrementing sequence number. =A0Each node > > Right. =A0So what happens if the address gets corrupted? =A0Or the > sequence number? =A0Once it is corrupted, each successive node will > pass along the corrupted version AS IF it was a regular message > (best case, you can detect it as "corrupted" and remove it from > the ring -- but you won't know how to decide *which* message was > then "deleted")
Each message placed on the ring is duplicated so that one copy goes CW and one goes CCW until they meet up. Both copies would have to be lost/ damaged for the message not to get right around. Critical data is actually refreshed in this system.
> > > keeps a list of address/ sequence #/ received time of the last X > > messages received. =A0If it's seen the address/ seq# before within a > > certain time, it removes the message from the ring. > > But you don't know how any of these things will be "corrupted". > You can only opt to remove a message that you are "suspicious of". > And that implies that your error detection scheme is robust enough > that *all* errors (signs of corruption) are detectable. > > If you are setting out with the expectation that you *will* be > operating with a real (non zero) error rate, what can you do > to assure yourself that you are catching *all* errors? > > >> If you unilaterally drop any message found to be corrupted, then > >> you have no way of knowing if it was received by its intended > >> recipient (since you don't know who it's recipient is -- or was). > >> If you await acknowledgment (and retry until you receive it), > >> then you run the risk of a message being processed more than > >> once. =A0 etc. > > > Every message a node transmits has an incrementing "ack" byte that the > > next node sends back in its next message. =A0If the ack byte doesn't > > come back correctly the message is sent again. > > Again, how do you know that the message isn't corrupted to distort > the ACK and "whatever else"? =A0I.e., so that the message is no longer > recognizable as it's original form -- yet looks like a valid message > (or *not*).
Do you mean that the message gets ackd but it was actually damaged? The other copy of the message would have to get damaged too.
> > How do you know that this is not a case of the message arriving > correctly but the ACK being corrupted? > > > If the ack is lost and > > a retry is sent, the receiver throws the message away because he's > > already seen it. > > All of these things are predicated on the assumption that > errors are rare. =A0So, the chance of a message (forward or ACK) > being corrupted *and* a followup/reply being corrupted is > "highly unlikely". > > If you're looking at error rates high enough that you are > trying to detect errors as significant as "19 bytes out of 60", > then how much confidence can you have in *any* of the messages?
I'm not sure where I said 19 out of 60 - it was something to do with the Reed Solomon thing I think. I think I was trying to find out how many bytes of overhead there would be for significantly better error detection, however the CPU overhead is too great. A fellow engineer had speculated that we could live with 10% errors at the faster baud rate. My recommendation is now going to be that we have to see no increase in errors (i.e. more or less no errors) to stay at the faster rate - if we do this at all. It's been suggested that the nature of the errors due to running too fast on un-twisted non-shielded cable will make it immediately obvious that we're going too fast. We'll also add some 55 and AA bytes to the "idle" packets exchanged between nodes to help detect errors. I guess if real noise occurs and we're running at the faster rate, we won't know whether to drop back to the slower rate or not.
Reply by D Yuniskis March 31, 20112011-03-31
Hi Shane,

On 3/30/2011 3:39 PM, Shane williams wrote:

>>>> *And exactly where the frame is removed from the ring is a design >>>> question. Often the sender removes frames it had sent, when they make >>>> their way back around, in which case the critical item for removal is >>>> the source address (and usually in that case the destination node sets >>>> a "copied" bit in the trailer, thus verifying physical transmission of >>>> the frame to the destination). >> >>> In our case, there are two logical rings and each message placed on >>> the ring is sent in both directions. When the two messages meet up >>> approximately half-way round they annihilate each other - but if the >>> annihilation fails, the sender removes them as well. >> >> How does the sender *recognize* them as "his to annihilate"? >> I.e., if the data can be corrupted, so can the "sender ID"! >> The problem with a ring is that it has no "end" so things have >> the potential to go 'round and 'round and 'round and... > > The message would have to get corrupted undetected every time around > the ring to go round forever.
No. Once a frame is corrupted, it is no longer recognizable as its original *intent*. (see below)
> Each device that puts a message on the ring puts his own address at > the start plus a one byte incrementing sequence number. Each node
Right. So what happens if the address gets corrupted? Or the sequence number? Once it is corrupted, each successive node will pass along the corrupted version AS IF it was a regular message (best case, you can detect it as "corrupted" and remove it from the ring -- but you won't know how to decide *which* message was then "deleted")
> keeps a list of address/ sequence #/ received time of the last X > messages received. If it's seen the address/ seq# before within a > certain time, it removes the message from the ring.
But you don't know how any of these things will be "corrupted". You can only opt to remove a message that you are "suspicious of". And that implies that your error detection scheme is robust enough that *all* errors (signs of corruption) are detectable. If you are setting out with the expectation that you *will* be operating with a real (non zero) error rate, what can you do to assure yourself that you are catching *all* errors?
>> If you unilaterally drop any message found to be corrupted, then >> you have no way of knowing if it was received by its intended >> recipient (since you don't know who it's recipient is -- or was). >> If you await acknowledgment (and retry until you receive it), >> then you run the risk of a message being processed more than >> once. etc. > > Every message a node transmits has an incrementing "ack" byte that the > next node sends back in its next message. If the ack byte doesn't > come back correctly the message is sent again.
Again, how do you know that the message isn't corrupted to distort the ACK and "whatever else"? I.e., so that the message is no longer recognizable as it's original form -- yet looks like a valid message (or *not*). How do you know that this is not a case of the message arriving correctly but the ACK being corrupted?
> If the ack is lost and > a retry is sent, the receiver throws the message away because he's > already seen it.
All of these things are predicated on the assumption that errors are rare. So, the chance of a message (forward or ACK) being corrupted *and* a followup/reply being corrupted is "highly unlikely". If you're looking at error rates high enough that you are trying to detect errors as significant as "19 bytes out of 60", then how much confidence can you have in *any* of the messages?
Reply by D Yuniskis March 31, 20112011-03-31
Hi Shane,

On 3/30/2011 3:56 PM, Shane williams wrote:

[8<]

>>> I'm trying to improve the propagation delay of messages around the >> >> What sort of times are you seeing, presently? At which baudrates? >> How much *better* do they need to be (or, would you *like* them >> to be)? > > We've limited the ring to 32 nodes up till now at 57600 baud. The > nominal target now is 64 nodes for which I've calculated a worst case > request response time of approx 1.8 seconds with no retries. I would > like it to be about one second.
OK, back-of-napkin guesstimates... Assume 10b character frames transmitted "flat out" (no time between end of stop bit and beginning of next start bit). So, 5760 characters per second is data rate. If we assume N is size of packet (in "characters"), then RTT is 64 * [(N / 5760) + P] where P is time spent processing message packet on each node before passing it along. 1.8s/64 = [(N / 5760) + P] = ~30ms Guessing at a message size of ~100 bytes (characters) suggests the "transmission time" component of this is ~17ms -- leaving 13ms as a guess at the processing time, P. If you cut this to ~0, then you achieve your 1 sec goal (almost exactly). This suggests that eliminating/simplifying any error detection so that incoming messages can *easily* be propagated is a goal to pursue. If you can improve the reliability of the comm link so that errors are the *exception* (i.e., unexpected), then you can simplify the effort required to "handle" those errors. Furthermore, if errors *are* The Exception, then you can consider running the interface(s) in full duplex mode and starting to pass a packet along to your successor *before* it is completely received. This effectively reduces the size of the message (N) in the above calculation. E.g., if you can hold just *10* bytes of the message before deciding to pass it along, then the "transmission time" component drops to 1.7ms. Your RTT is then 0.1 sec! Alternatively, you can spend a few ms processing in each node and still beat your 1 sec goal -- *or*, drop the data rate by a factor of 5 or 6 and still hit the 1 sec goal! [remember, this is back-of-napkin calculation so I don't claim it accurately reflects *your* operating environment. rather, it puts some options in perspective...]
> 64 nodes max is a bit arbitrary so > what I'm really trying to do is get the best performance that's > reasonably achievable. Some sites have well over 64 devices but not > all on the same ring. > >>> ring without requiring the customer to fit twisted pair cable >>> everywhere and I'm also trying to improve the error monitoring so we >>> can signal when a connection isn't performing well enough without >>> creating nuisance faults, hence my interest in the error detection >>> capability of crc16-ccitt. >> >>> We actually already do have an RS485 multi-drop version of this >>> protocol but it's non-deterministic and doesn't work very well. I >>> don't really want to go into that... >> >> It's relatively easy to get deterministic behavior from a 485 >> deployment. And, depending on the *actual* operating conditions >> of the current ring implementation, could probably achieve >> lower latencies at lower baudrates. > > Some of the devices that connect to the multi-drop network are old and > low-powered and a token ring was too much overhead at the time. Also > we require redundancy which needs 4 wires for multi-drop but only 2 > wires for the ring.
How do you mean "4 wires" vs. "2 wires"? You can run a 485 network with a single differential pair. A 232-ish approach requires a Tx and Rx conductor (for each ring). So, you could implement two 485 busses for the same conductor count as your dual UART rings.