forward error correction on ADSP21020| page 2

Reply by Tim Wescott ●March 5, 20122012-03-05

On Mon, 05 Mar 2012 11:44:43 +0100, alb wrote:

> On 3/2/2012 8:38 PM, Tim Wescott wrote:
>> On Fri, 02 Mar 2012 14:03:07 +0100, alb wrote:
>> 
>>> On 3/2/2012 12:52 PM, Stef wrote:
>>>> In comp.arch.embedded,
>>>> alb <alessandro.basili@cern.ch> wrote:
>>>>> Hi everyone,
>>>>>
>>>>> in the system I am using there is an ADSP21020 connected to an FPGA
>>>>> which is receiving data from a serial port. The FPGA receives the
>>>>> serial bytes and sets an interrupt and a bit in a status register
>>>>> once the byte is ready in the output register (one 'start bit' and
>>>>> one 'stop bit'). The DSP can look at the registers simply reading
>>>>> from a mapped port and we can choose either polling the status
>>>>> register or using the interrupt.
>>>>>
>>>>> Unfortunately this is just on paper. The real world is much more
>>>>> different since the FPGA receiver is apparently 'losing' bits. When
>>>>> we send a "packet" (a sequence of bytes) what we can observe with
>>>>> the scope it that sometimes the interrupts are not equally spaced in
>>>>> time and there is one byte less w.r.t. what we send. So we suspect
>>>>> that the receiver has started on the wrong 'start bit', hence
>>>>> screwing up everything.
>>>>>
>>>>> The incidence of this error looks like dependent on the length of
>>>>> the packet we send, leading to think that due to some
>>>>> synchronization problem the uart looses the sync (maybe timing
>>>>> issues on the fpga).
>>>>>
>>>>> Given the fact that we cannot change the fpga, I came up with the
>>>>> idea to use some forward error correction (FEC) encoding to overcome
>>>>> this issue, but if my diagnosis is correct it looks like that the
>>>>> broken sequence of bytes is not only missing some bytes, it will
>>>>> certainly have the bit shifted (starting on wrong 'start bit') with
>>>>> some bits inserted ('start bit' and 'stop bit' will be part of the
>>>>> data) and I'm not sure if there exists some technique which may
>>>>> recover such a broken sequence.
>>>>>
>>>>> On top of it I don't have any feeling how much would cost (in terms
>>>>> of memory and cpu resources) any type of FEC decoding on the DSP.
>>>>>
>>>>> Any suggestions and/or ideas?
>>>>
>>>> Is this a continuous stream of bits, with no pauses between bytes?
>>>> Looks like the start bit detection does not re-adjust it's timing to
>>>> the actual edge of the next start bit. With small diffferences in
>>>> bitrate, this causes the receiver to fall out of sync as you found.
>>>
>>> in within a "packet" there's should be no pause between bytes, I will
>>> check though. There might be a small difference in bitrate, maybe I
>>> would need to verify how much.
>>>
>>>
>>>> Obviously, the best solution is to fix the FPGA as it is 'broken'. Is
>>>> there no way to fix it or get it fixed?
>>>
>>> The FPGA, is flying in space, together with the rest of the equipment.
>>> We cannot reprogram it, we can only replace the software in the DSP,
>>> with non-trivial effort.
>>>
>>>
>>>> Can you change the sender of the data? If so, you can set it to 2
>>>> stop bits. This can allow the receive to re-sync every byte. If
>>>> possible, I do try to set my transmitters to 2 stop bits and
>>>> receivers to 1. This can prevent trouble like this but costs a little
>>>> bandwidth.
>>>>
>>>>
>>> We are currently investigating it, the transmitter is controlled by an
>>> 8051 and in principle we should have control over it. Your idea is to
>>> use the second stop bit to allow better synching and hopefully not
>>> lose the following start bit, correct?
>>>
>>>> Another option would be to tweak the bitrates. It seems your sender
>>>> is now a tiny bit on the fast side w.r.t. the receiver. Maybe you can
>>>> slown down the clock on your sender by 1 or 2 percent? Try to get an
>>>> accurate measurement of the bitrate on both sides before you do
>>>> anything.
>>>>
>>>>
>>> We can certainly measure the transmission rate. I am not sure we can
>>> tweak the bitrates to that level. The current software on the 8051
>>> supports several bitrates (19.2, 9.6, 4.8, 2.4 Kbaud) but I'm afraid
>>> those options are somehow hardcoded in the transmitter. Certainly it
>>> would be worth having a look.
>> 
>> Go over the FPGA code with a fine-toothed comb -- whatever you're
>> doing, it won't help if the FPGA doesn't support it.
> 
> Ok, a colleague of mine went through it and indeed the start-bit logic
> is faulty, since it is looking for a negative transition but without the
> signal being synchronized with the internal clock (don't ask me how that
> is possible!).
> 
> Given this type of error the 0xFF byte will be lost completely, since
> there are no other start-bit to sync on within the byte, while in other
> cases it may resync with a '0' bit in within the byte.

This may be your answer -- instead of two stop bits, use a protocol that 
sends eight data bits but with the most significant bit always 0.  This 
will make your life difficult when you go to unwind real 8-bit data, but 
it can be done.

While you're at it, if the connection is two-way you might want to 
implement a BEC scheme, but if the failures are data-pattern dependent 
any correction scheme that doesn't randomize the data is going to cause 
you problems.

-- 
Tim Wescott
Control system and signal processing consulting
www.wescottdesign.com

Reply by Charles Bryant ●March 5, 20122012-03-05

In article <9rjl80FvkaU1@mid.individual.net>,
alb  <alessandro.basili@cern.ch> wrote:
}On 3/3/2012 4:32 AM, Charles Bryant wrote:
.. data loss due to failure to see some start bits ...
}> One possible solution might be possible to do the UART receive function
}> in software (this depends very much on how the hardware works). By
}> setting the baud rate on the FPGA over 10x the true speed, it sees
}> every bit as either a 0xff or a 0x00 character. If you can react to
}> the interrupt fast enough and read a suitable clock, you can then
}> decode the bits in software. Of course if the FPGA is failing to
}> deliver characters, this is no better.
}
}the 0xFF has a good chance to go completely lost. The method you suggest
}may reduce the problem of recognizing bytes to the problem of delivering
}the bytes. Then extra encoding should be added to recover the loss of bytes.

If you can set the receiver clock fast enough you won't lose any
bytes. For example, if you set it to 16x, and suppose the true
bit-stream is 0000101101 (ASCII 'h'). Then the apparent bit-stream is

0000000000000000000000000000000000000000000000000000000000000000111111
1111111111000000000000000011111111111111111111111111111111000000000000
00001111111111111111 

(wrapped for convenience). Assuming this starts after lots of '1'
bits, and receiving a '0' stop bit is merely reported as a framing
error and doesn't affect synchronisation, this gets interpreted as:

s........Ss........Ss........Ss........Ss........Ss........Ss........S
0000000000000000000000000000000000000000000000000000000000000000111111

__________00________00________00________00________00________00________f8


s........Ss........Ss........Ss........Ss........Ss........Ss........S
1111111111000000000000000011111111111111111111111111111111000000000000

____________________00________e0____________________________3f________00

s........Ss........S
00001111111111111111 

__________f8


When you get an interrupt reporting a character, you note the time
since the last such interrupt (I believe your CPU has a built-in timer
which can count at high speed and which might be useful for this).
Then you work out approximately what bits must have been received
based on both the character and the time. Since each real bit is seen
as sixteen bits, even if one is missed, this only introduces a small
error, so although you don't get an exact match to any valid pattern,
you're much closer to one than any other.

Specifically, if an interrupt is T bit-times since the last one, then
there must have been T-10 one bits, a zero bit, the bits in the character
received, and a stop bit (0 if framing error was reported, 1 otherwise).

When a 0 bit is missed, then there were T-1 ones, two zeros (the
missed zero must be the first of these), the bits in the character,
and the stop bit. But since sixteen of these bits make one real bit,
the difference between T and T-1 is never big enough to flip a real bit.

Having said all that, you might not be able to change the receive
clock without also changing the transmit clock, in which case it won't
help. Similarly, if you can't time the interrupts to sufficient
accuracy, it won't work.

}If you plot the number of failed packets [1] with the position in the
}packet which had the problem, you will see an almost linear increasing
}curve, hence the probability to have problems is higher if the packet is
}longer.
}
}At the moment we don't have any re-transmitting mechanism and the rate
}of loss is ~0.5% on a 100 bytes packet. We want to exploit the 4K buffer
}on the transmitter side in order not to add too much overhead, but it
}looks like the rate of loss will be higher with bigger packets.
}
}[1] we send a packet and echo it back and compare the values.

That suggests that the receiver only resynchronises in a gap. The
solution suggested elsewhere of two stop bits sounds very promising
(some UARTs can do 1.5 stop bits and that might be enough). Otherwise
a simple re-transmission scheme tailored to the fault might be good.
Here is an example:

Each packet starts and ends with FF. Unlike typical framing schemes,
the start and end cannot be shared. This guarantees that an error is
confined to one packet.

Other than the start and end bytes, all bytes in the packet are
escaped. (e.g. byte FF becomes CB 34, CB becomes CB 00).

The last two bytes in the packet are a CRC.

The byte before the CRC is an acknowledgement number.

If a packet is at least four bytes, the first byte is the packet
sequence number.

This gives a packet like this:

	FF SS DD DD DD...DD AA CC CC FF
	   ^^^^^^^^^^^^^^^^^^^^^^^^^ these are escaped as necessary

When the receiver gets a packet if the CRC is bad, or if the sequence
number is not the next expected, ignore the packet. Otherwise accept
the data or ack.

The transmitter sends continuously. If it has no data to send, it
sends just ACKs (i.e. packets with just AA CC CC). Otherwise it sends
a packet with this loop:
	1) send the packet (SS DD...DD AA CC CC)
	2) send ack (AA CC CC) until we have received at least X bytes
	and a valid packet
	3) if our sent packet has been acknowledged, this one is done
	4) else goto 1

Step 2 avoid the need for a timer. The value X depends on the round-trip
delay. Since the AA field is at the *end* of a packet, we know when we
receive a packet that it reflects the remote receiver's last packet
at a time that is a fixed interval in the past. e.g. if the round-trip
time is five bytes (to allow for buffering in the UART etc) then when
we send the FF framing byte we know that any AA we receive in the net
five bytes could not possible acknowledge the packet we just sent, if
the remote happened to be just about to send the AA field, we might get
AA CC CC FF, so any packet which ends more than 9 bytes after we sent
the FF of our packet should acknowledge our packet, so we would use a
value of about 10 to make re-transmissions be as prompt as possible.
(This depends on the protocol being implemented at a
character-by-character level at each end. If you had a higher-level view
whereby hardware was given a complete packet to send at once, the ACK
no longer benefits frombeing at the end and the timing is more
complex).

Obviously the SS numbers in one direction have AA numbers going in the
opposite direction.

The overhead of this scheme is 129/128p + 4+X for a packet size of p.
It could be made lower by allowing more than one packet to be in
flight at once, though that makes it more complex and costs more when
a re-transmission is needed, unless you add even more complexity and
have selective acknowledgements.

If you have a suitable timer available, the sending of ACKs in step 2
can be omitted (e.g. possibly saving power by not running the UART
continuously).

Reply by alb ●March 6, 20122012-03-06

On 3/5/2012 4:07 PM, Tim Wescott wrote:
[...]
>> The whole
>> design and production process, which consists of several test campaigns
>> on different quality models (Engineering, Qualification and Flight),
>> should have ensured this level of functionality. The reason why it
>> failed is most probably due to a poor level of quality control of the
>> process. Just as an example we are missing test reports of the system,
>> as well as Checksums for the FPGA firmware.
> 
> Testing is only the most visible and least effective of all means to 
> insure quality.  It is a net with whale-sized holes, with which you go 
> out and attempt to catch all the minnows in the sea by making repeated 
> passes.

Nice imagery, even though I don't see which other tool you would have to
ensure functionality other than testing the specs. Certainly I can
understand that a big design effort may help cutting down the time once
you do system integration, but certainly it does not remove the need to
do your testing.

> 
> And all too often, quality programs end up being blown off by management 
> and/or design personnel as being an unnecessary expense, or a personal 
> affront.  Or the QA department gets staffed with martinets or rubber-
> stamp bozos or whatever.  Because -- as you are seeing -- sometimes 
> quality problems don't show up until long after everyone has gotten 
> rewarded for doing a good job.

Saying a program is not needed because it is too often compromised by
other factors does not prove the program is not needed. On the contrary,
if management does not get in the way, a quality program can certainly
reduce the level of uncertainties.
The quality program does not necessary means that it has to be enforced
by the department of defense. Peer reviews and open standards may help a
lot here (I understand that it might not always be applicable). And
specifically in our case this is how we usually go.

Unfortunately it was not the case with the system I'm dealing with
currently and indeed it was due to an overlook of the management which
was too focused on higher priority tasks which at that time sucked in
all the resources.

> 
> Humans just aren't made to build high-reliability systems, so an 
> organization really needs to swim upstream to make it happen.

IMO humans have reached a level of reliability which is far beyond
imagination, through standards, processes and certainly money.

> 
>> As a side note: IMO the capability to reprogram an FPGA onboard is built
>> when your needs are changing with time, not to fix some stupid UART
>> receiver.
> 
> Well, time has marched on, and your needs have certainly changed.
> 

Again I tend to disagree, my needs are exactly the same as 10 years ago,
when the specs were laid down and we wanted to have a UART (there was no
option saying "better be working").

Reply by Tim Wescott ●March 6, 20122012-03-06

On Tue, 06 Mar 2012 11:20:29 +0100, alb wrote:

> On 3/5/2012 4:07 PM, Tim Wescott wrote: [...]
>>> The whole
>>> design and production process, which consists of several test
>>> campaigns on different quality models (Engineering, Qualification and
>>> Flight), should have ensured this level of functionality. The reason
>>> why it failed is most probably due to a poor level of quality control
>>> of the process. Just as an example we are missing test reports of the
>>> system, as well as Checksums for the FPGA firmware.
>> 
>> Testing is only the most visible and least effective of all means to
>> insure quality.  It is a net with whale-sized holes, with which you go
>> out and attempt to catch all the minnows in the sea by making repeated
>> passes.
> 
> Nice imagery, even though I don't see which other tool you would have to
> ensure functionality other than testing the specs. Certainly I can
> understand that a big design effort may help cutting down the time once
> you do system integration, but certainly it does not remove the need to
> do your testing.
> 
> 
>> And all too often, quality programs end up being blown off by
>> management and/or design personnel as being an unnecessary expense, or
>> a personal affront.  Or the QA department gets staffed with martinets
>> or rubber- stamp bozos or whatever.  Because -- as you are seeing --
>> sometimes quality problems don't show up until long after everyone has
>> gotten rewarded for doing a good job.
> 
> Saying a program is not needed because it is too often compromised by
> other factors does not prove the program is not needed. On the contrary,
> if management does not get in the way, a quality program can certainly
> reduce the level of uncertainties.
> The quality program does not necessary means that it has to be enforced
> by the department of defense. Peer reviews and open standards may help a
> lot here (I understand that it might not always be applicable). And
> specifically in our case this is how we usually go.
> 
> Unfortunately it was not the case with the system I'm dealing with
> currently and indeed it was due to an overlook of the management which
> was too focused on higher priority tasks which at that time sucked in
> all the resources.

You are mistaking "testing program" for "quality program".  A testing 
program is a _part_ of a quality program, but a quality strategy that 
states only "test the hell out of it once it's done" is little better 
than "launch it and find out".

And I wasn't saying that a testing program isn't an essential part of a 
quality program -- I was saying that the statement "but we tested it" is, 
in my book, tantamount to "we painted it and polished it".  If what you 
painted and polished is just dried up dog turds, then no matter how shiny 
the paint is it's still dog turds underneath.

A good quality program is one that comes in many steps, with each step 
having the goal that the _next_ step isn't going to find any problems.

Design reviews at every step, conducted by people who are competent and 
engaged, prevents far more bugs than testing finds.  Unit testing, while 
being something called "testing", is often not included in a "testing 
program" that just does black-box testing on the completed system.  A 
comprehensive _quality_ program has all of these and more, and does not 
equate to just testing.

>> Humans just aren't made to build high-reliability systems, so an
>> organization really needs to swim upstream to make it happen.
> 
> IMO humans have reached a level of reliability which is far beyond
> imagination, through standards, processes and certainly money.

Yes, and we still manage to screw up, sometimes.  Like in your case.

Clocking a UART on the wrong part of the incoming serial stream is 
something that the designer shouldn't have done at all.  Then it should 
have been caught in a design review before an FPGA was ever programmed.  
The fact that it wasn't mean that many people were just not on the ball 
in that case.  The designer got it wrong, the team who were supposed to 
review his work didn't, or didn't do it thoroughly enough, the _real_ 
quality program that's supposed to make sure that the design reviews 
happen correctly didn't.  Then the testing of the UART functionality _by 
itself_ in the FPGA either wasn't thorough enough or wasn't done at all, 
etc., etc.

>>> As a side note: IMO the capability to reprogram an FPGA onboard is
>>> built when your needs are changing with time, not to fix some stupid
>>> UART receiver.
>> 
>> Well, time has marched on, and your needs have certainly changed.
>> 
>> 
> Again I tend to disagree, my needs are exactly the same as 10 years ago,
> when the specs were laid down and we wanted to have a UART (there was no
> option saying "better be working").

Yet here you are, with a crying need for an FPGA mod.  "Perceived need", 
then, perhaps.

-- 
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com

Reply by alb ●March 7, 20122012-03-07

On 3/6/2012 2:51 AM, Charles Bryant wrote:
[...]
> 
> If you can set the receiver clock fast enough you won't lose any
> bytes. For example, if you set it to 16x, and suppose the true
> bit-stream is 0000101101 (ASCII 'h'). Then the apparent bit-stream is
> 
> 0000000000000000000000000000000000000000000000000000000000000000111111
> 1111111111000000000000000011111111111111111111111111111111000000000000
> 00001111111111111111 
> 
> (wrapped for convenience). Assuming this starts after lots of '1'
> bits, and receiving a '0' stop bit is merely reported as a framing
> error and doesn't affect synchronisation, this gets interpreted as:
> 
> s........Ss........Ss........Ss........Ss........Ss........Ss........S
> 0000000000000000000000000000000000000000000000000000000000000000111111
> 
> __________00________00________00________00________00________00________f8
> 
> 
> s........Ss........Ss........Ss........Ss........Ss........Ss........S
> 1111111111000000000000000011111111111111111111111111111111000000000000
> 
> ____________________00________e0____________________________3f________00
> 
> s........Ss........S
> 00001111111111111111 
> 
> __________f8
> 

Unfortunately we cannot change the receiver clock. But I'll stick with
you for the sake of discussion.

> 
> When you get an interrupt reporting a character, you note the time
> since the last such interrupt (I believe your CPU has a built-in timer
> which can count at high speed and which might be useful for this).
> Then you work out approximately what bits must have been received
> based on both the character and the time. Since each real bit is seen
> as sixteen bits, even if one is missed, this only introduces a small
> error, so although you don't get an exact match to any valid pattern,
> you're much closer to one than any other.

This approach is nice if and only if the transmitter does not introduce
extra gaps in between bytes, which may make your job to take time into
account a little bit more complex. And the problem is still not solved,
since the receiver is missing the start bit (transition high-low).

> 
> Specifically, if an interrupt is T bit-times since the last one, then
> there must have been T-10 one bits, a zero bit, the bits in the character
> received, and a stop bit (0 if framing error was reported, 1 otherwise).
> 
> When a 0 bit is missed, then there were T-1 ones, two zeros (the
> missed zero must be the first of these), the bits in the character,
> and the stop bit. But since sixteen of these bits make one real bit,
> the difference between T and T-1 is never big enough to flip a real bit.
> 
> Having said all that, you might not be able to change the receive
> clock without also changing the transmit clock, in which case it won't
> help. Similarly, if you can't time the interrupts to sufficient
> accuracy, it won't work.

Timing the distance between interrupt may be tricky since at 19.2 Kbaud
a 52us interval is needed for each bit, hence you would need your timer
to run faster than that... Considering that the fastest interrupt
service routine introduce ~3.5us of overhead it looks like you won't
have much more time to spend for the rest of the application.

> 
> }If you plot the number of failed packets [1] with the position in the
> }packet which had the problem, you will see an almost linear increasing
> }curve, hence the probability to have problems is higher if the packet is
> }longer.
> }
> }At the moment we don't have any re-transmitting mechanism and the rate
> }of loss is ~0.5% on a 100 bytes packet. We want to exploit the 4K buffer
> }on the transmitter side in order not to add too much overhead, but it
> }looks like the rate of loss will be higher with bigger packets.
> }
> }[1] we send a packet and echo it back and compare the values.
> 
> That suggests that the receiver only resynchronises in a gap. The
> solution suggested elsewhere of two stop bits sounds very promising
> (some UARTs can do 1.5 stop bits and that might be enough). Otherwise
> a simple re-transmission scheme tailored to the fault might be good.
> Here is an example:
> 
> Each packet starts and ends with FF. Unlike typical framing schemes,
> the start and end cannot be shared. This guarantees that an error is
> confined to one packet.
> 
> Other than the start and end bytes, all bytes in the packet are
> escaped. (e.g. byte FF becomes CB 34, CB becomes CB 00).

I think I lost you here. Why FF becomes CB 34?
> 
> The last two bytes in the packet are a CRC.
> 
> The byte before the CRC is an acknowledgement number.
> 
> If a packet is at least four bytes, the first byte is the packet
> sequence number.
> 
> This gives a packet like this:
> 
> 	FF SS DD DD DD...DD AA CC CC FF
> 	   ^^^^^^^^^^^^^^^^^^^^^^^^^ these are escaped as necessary
> 
> When the receiver gets a packet if the CRC is bad, or if the sequence
> number is not the next expected, ignore the packet. Otherwise accept
> the data or ack.
> 
> The transmitter sends continuously. If it has no data to send, it
> sends just ACKs (i.e. packets with just AA CC CC). Otherwise it sends
> a packet with this loop:
> 	1) send the packet (SS DD...DD AA CC CC)
> 	2) send ack (AA CC CC) until we have received at least X bytes
> 	and a valid packet
> 	3) if our sent packet has been acknowledged, this one is done
> 	4) else goto 1
> 
> Step 2 avoid the need for a timer. The value X depends on the round-trip
> delay. Since the AA field is at the *end* of a packet, we know when we
> receive a packet that it reflects the remote receiver's last packet
> at a time that is a fixed interval in the past. e.g. if the round-trip
> time is five bytes (to allow for buffering in the UART etc) then when
> we send the FF framing byte we know that any AA we receive in the net
> five bytes could not possible acknowledge the packet we just sent, if
> the remote happened to be just about to send the AA field, we might get
> AA CC CC FF, so any packet which ends more than 9 bytes after we sent
> the FF of our packet should acknowledge our packet, so we would use a
> value of about 10 to make re-transmissions be as prompt as possible.
> (This depends on the protocol being implemented at a
> character-by-character level at each end. If you had a higher-level view
> whereby hardware was given a complete packet to send at once, the ACK
> no longer benefits frombeing at the end and the timing is more
> complex).

This is an interesting request/acknowledge scheme, but we want to avoid
to change the transmitter side to include this scheme and since the
transmitter side has many other activities to perform we are not sure
about the impact of this protocol in the overall scheduling. If we find
we cannot avoid that certainly we will go along that path.

> 
> Obviously the SS numbers in one direction have AA numbers going in the
> opposite direction.
> 
> The overhead of this scheme is 129/128p + 4+X for a packet size of p.
> It could be made lower by allowing more than one packet to be in
> flight at once, though that makes it more complex and costs more when
> a re-transmission is needed, unless you add even more complexity and
> have selective acknowledgements.
> 
> If you have a suitable timer available, the sending of ACKs in step 2
> can be omitted (e.g. possibly saving power by not running the UART
> continuously).

What we have found instead is an encoding scheme which will allow us to
recover the synchronization loss of the receiver.
You have three levels, a character one, a packet one and a command one.
A character is everything between a start and a stop bit, while a packet
is a sequence of character of maximum number of characters (fixed by
receiver software buffer).
The byte encoding for the character looks like the following:
                                             ____
____| st | rs | b0 | b1 | b2 | b3 | sh | cb |
^^^^                                         ^^^^
||||                                         ++++ ---> stop bit
++++ ------------------------------------------------> start bit

where the meaning of each bit is the following:

st = 1 (sticky to '1')
rs = resynchronization bit
bn = body
sh = shift
cb = control bit

The shift bit signals to the receiver if the character has to be right
shifted by 2 before being used (this we will see happens when the real
start bit is lost and the uart synchs on the rs bit).
The control bit selects either 'control' character or 'data' character.
A data character has 4 valid bits which will be part transferred to the
command level to build the 'telecommand'.
A control character has 16 types available (bn will encode the meaning),
out of which I can think of three types:

bn = '0000': BOP (begin of packet)
bn = '0001': EOP (end of packet)
bn = '1111': NUL (null character)

In the event of a NUL char the rs will be fixed to '1', in all other
cases it will be '0'. This is done to have NUL = 0xFF which is needed to
resynch on the correct start bit after the first desync occurred. An FF
will be dropped by the receiver if a start bit has been missed already,
otherwise it will be dropped by the software.

A packet will look like this:

BOP | DAT | DAT | ... | NUL | DAT | DAT | ... | NULL | ... | EOP

where the number of DAT between NULs may be short enough to eliminate
the possibility to have two start bit miss before the NUL.
The BOP and EOP will help at the packet level to control the
transmission, while only DAT will be part of the command level where we
can include a length and a crc to cross-check the integrity of the data.

We found that with a NUL every 8 DAT we can send reliably ~2000 bytes
without any loss and with an error recovery of about 40% (number of
bytes which had to be shifted before use).

When the receiver fails to sync on the first start bit, it has good
chance to sync on the rs bit, hence the stop bit will be received as the
shift bit. This means that the character has to be shifted before use.
In the event of a NUL character we artificially introduce a long gap
which will help synchronizing on a real start bit. When there is no miss
of the start bit and the NUL character is received, the shift bit is
artificially set to one, which may add an unnecessary shift operation,
but then the control character will be still a NUL and discarded.

This mechanism hides the complexity at the character level and packet
level, while the command level remains the same as a fully functional
uart. I don't have any other idea on additional control characters but
lots of possibilities may arise.

At the packet level if a EOP is lost the software will wait for the next
BOP and the number of NUL only increases level of confidence that the
packet will arrive. Critical information may be heavily redundant (to
the extent of a NUL each second character), while less important
commanding may be less redundant.

Of course the overhead is not negligible but the somewhere we knew we
had to pay.

Reply by Tim Wescott ●March 7, 20122012-03-07

On Wed, 07 Mar 2012 15:16:55 +0100, alb wrote:

> On 3/6/2012 2:51 AM, Charles Bryant wrote: [...]
>> 
>> If you can set the receiver clock fast enough you won't lose any bytes.
>> For example, if you set it to 16x, and suppose the true bit-stream is
>> 0000101101 (ASCII 'h'). Then the apparent bit-stream is
>> 
>> 0000000000000000000000000000000000000000000000000000000000000000111111
>> 1111111111000000000000000011111111111111111111111111111111000000000000
>> 00001111111111111111
>> 
>> (wrapped for convenience). Assuming this starts after lots of '1' bits,
>> and receiving a '0' stop bit is merely reported as a framing error and
>> doesn't affect synchronisation, this gets interpreted as:
>> 
>> s........Ss........Ss........Ss........Ss........Ss........Ss........S
>> 0000000000000000000000000000000000000000000000000000000000000000111111
>> 
>> 
__________00________00________00________00________00________00________f8
>> 
>> 
>> s........Ss........Ss........Ss........Ss........Ss........Ss........S
>> 1111111111000000000000000011111111111111111111111111111111000000000000
>> 
>> 
____________________00________e0____________________________3f________00
>> 
>> s........Ss........S
>> 00001111111111111111
>> 
>> __________f8
>> 
>> 
> Unfortunately we cannot change the receiver clock. But I'll stick with
> you for the sake of discussion.
> 
> 
>> When you get an interrupt reporting a character, you note the time
>> since the last such interrupt (I believe your CPU has a built-in timer
>> which can count at high speed and which might be useful for this). Then
>> you work out approximately what bits must have been received based on
>> both the character and the time. Since each real bit is seen as sixteen
>> bits, even if one is missed, this only introduces a small error, so
>> although you don't get an exact match to any valid pattern, you're much
>> closer to one than any other.
> 
> This approach is nice if and only if the transmitter does not introduce
> extra gaps in between bytes, which may make your job to take time into
> account a little bit more complex. And the problem is still not solved,
> since the receiver is missing the start bit (transition high-low).
> 
> 
>> Specifically, if an interrupt is T bit-times since the last one, then
>> there must have been T-10 one bits, a zero bit, the bits in the
>> character received, and a stop bit (0 if framing error was reported, 1
>> otherwise).
>> 
>> When a 0 bit is missed, then there were T-1 ones, two zeros (the missed
>> zero must be the first of these), the bits in the character, and the
>> stop bit. But since sixteen of these bits make one real bit, the
>> difference between T and T-1 is never big enough to flip a real bit.
>> 
>> Having said all that, you might not be able to change the receive clock
>> without also changing the transmit clock, in which case it won't help.
>> Similarly, if you can't time the interrupts to sufficient accuracy, it
>> won't work.
> 
> Timing the distance between interrupt may be tricky since at 19.2 Kbaud
> a 52us interval is needed for each bit, hence you would need your timer
> to run faster than that... Considering that the fastest interrupt
> service routine introduce ~3.5us of overhead it looks like you won't
> have much more time to spend for the rest of the application.
> 
> 
>> }If you plot the number of failed packets [1] with the position in the
>> }packet which had the problem, you will see an almost linear increasing
>> }curve, hence the probability to have problems is higher if the packet
>> is }longer.
>> }
>> }At the moment we don't have any re-transmitting mechanism and the rate
>> }of loss is ~0.5% on a 100 bytes packet. We want to exploit the 4K
>> buffer }on the transmitter side in order not to add too much overhead,
>> but it }looks like the rate of loss will be higher with bigger packets.
>> }
>> }[1] we send a packet and echo it back and compare the values.
>> 
>> That suggests that the receiver only resynchronises in a gap. The
>> solution suggested elsewhere of two stop bits sounds very promising
>> (some UARTs can do 1.5 stop bits and that might be enough). Otherwise a
>> simple re-transmission scheme tailored to the fault might be good. Here
>> is an example:
>> 
>> Each packet starts and ends with FF. Unlike typical framing schemes,
>> the start and end cannot be shared. This guarantees that an error is
>> confined to one packet.
>> 
>> Other than the start and end bytes, all bytes in the packet are
>> escaped. (e.g. byte FF becomes CB 34, CB becomes CB 00).
> 
> I think I lost you here. Why FF becomes CB 34?
>> 
>> The last two bytes in the packet are a CRC.
>> 
>> The byte before the CRC is an acknowledgement number.
>> 
>> If a packet is at least four bytes, the first byte is the packet
>> sequence number.
>> 
>> This gives a packet like this:
>> 
>> 	FF SS DD DD DD...DD AA CC CC FF
>> 	   ^^^^^^^^^^^^^^^^^^^^^^^^^ these are escaped as necessary
>> 
>> When the receiver gets a packet if the CRC is bad, or if the sequence
>> number is not the next expected, ignore the packet. Otherwise accept
>> the data or ack.
>> 
>> The transmitter sends continuously. If it has no data to send, it sends
>> just ACKs (i.e. packets with just AA CC CC). Otherwise it sends a
>> packet with this loop:
>> 	1) send the packet (SS DD...DD AA CC CC) 2) send ack (AA CC CC) 
until
>> 	we have received at least X bytes and a valid packet
>> 	3) if our sent packet has been acknowledged, this one is done 4) 
else
>> 	goto 1
>> 
>> Step 2 avoid the need for a timer. The value X depends on the
>> round-trip delay. Since the AA field is at the *end* of a packet, we
>> know when we receive a packet that it reflects the remote receiver's
>> last packet at a time that is a fixed interval in the past. e.g. if the
>> round-trip time is five bytes (to allow for buffering in the UART etc)
>> then when we send the FF framing byte we know that any AA we receive in
>> the net five bytes could not possible acknowledge the packet we just
>> sent, if the remote happened to be just about to send the AA field, we
>> might get AA CC CC FF, so any packet which ends more than 9 bytes after
>> we sent the FF of our packet should acknowledge our packet, so we would
>> use a value of about 10 to make re-transmissions be as prompt as
>> possible. (This depends on the protocol being implemented at a
>> character-by-character level at each end. If you had a higher-level
>> view whereby hardware was given a complete packet to send at once, the
>> ACK no longer benefits frombeing at the end and the timing is more
>> complex).
> 
> This is an interesting request/acknowledge scheme, but we want to avoid
> to change the transmitter side to include this scheme and since the
> transmitter side has many other activities to perform we are not sure
> about the impact of this protocol in the overall scheduling. If we find
> we cannot avoid that certainly we will go along that path.
> 
> 
>> Obviously the SS numbers in one direction have AA numbers going in the
>> opposite direction.
>> 
>> The overhead of this scheme is 129/128p + 4+X for a packet size of p.
>> It could be made lower by allowing more than one packet to be in flight
>> at once, though that makes it more complex and costs more when a
>> re-transmission is needed, unless you add even more complexity and have
>> selective acknowledgements.
>> 
>> If you have a suitable timer available, the sending of ACKs in step 2
>> can be omitted (e.g. possibly saving power by not running the UART
>> continuously).
> 
> What we have found instead is an encoding scheme which will allow us to
> recover the synchronization loss of the receiver. You have three levels,
> a character one, a packet one and a command one. A character is
> everything between a start and a stop bit, while a packet is a sequence
> of character of maximum number of characters (fixed by receiver software
> buffer).
> The byte encoding for the character looks like the following:
>                                              ____
> ____| st | rs | b0 | b1 | b2 | b3 | sh | cb | ^^^^                      
>                   ^^^^ ||||                                         ++++
> ---> stop bit ++++ ------------------------------------------------>
> start bit
> 
> where the meaning of each bit is the following:
> 
> st = 1 (sticky to '1')
> rs = resynchronization bit
> bn = body
> sh = shift
> cb = control bit
> 
> The shift bit signals to the receiver if the character has to be right
> shifted by 2 before being used (this we will see happens when the real
> start bit is lost and the uart synchs on the rs bit). The control bit
> selects either 'control' character or 'data' character. A data character
> has 4 valid bits which will be part transferred to the command level to
> build the 'telecommand'. A control character has 16 types available (bn
> will encode the meaning), out of which I can think of three types:
> 
> bn = '0000': BOP (begin of packet)
> bn = '0001': EOP (end of packet)
> bn = '1111': NUL (null character)
> 
> In the event of a NUL char the rs will be fixed to '1', in all other
> cases it will be '0'. This is done to have NUL = 0xFF which is needed to
> resynch on the correct start bit after the first desync occurred. An FF
> will be dropped by the receiver if a start bit has been missed already,
> otherwise it will be dropped by the software.
> 
> A packet will look like this:
> 
> BOP | DAT | DAT | ... | NUL | DAT | DAT | ... | NULL | ... | EOP
> 
> where the number of DAT between NULs may be short enough to eliminate
> the possibility to have two start bit miss before the NUL. The BOP and
> EOP will help at the packet level to control the transmission, while
> only DAT will be part of the command level where we can include a length
> and a crc to cross-check the integrity of the data.
> 
> We found that with a NUL every 8 DAT we can send reliably ~2000 bytes
> without any loss and with an error recovery of about 40% (number of
> bytes which had to be shifted before use).
> 
> When the receiver fails to sync on the first start bit, it has good
> chance to sync on the rs bit, hence the stop bit will be received as the
> shift bit. This means that the character has to be shifted before use.
> In the event of a NUL character we artificially introduce a long gap
> which will help synchronizing on a real start bit. When there is no miss
> of the start bit and the NUL character is received, the shift bit is
> artificially set to one, which may add an unnecessary shift operation,
> but then the control character will be still a NUL and discarded.
> 
> This mechanism hides the complexity at the character level and packet
> level, while the command level remains the same as a fully functional
> uart. I don't have any other idea on additional control characters but
> lots of possibilities may arise.
> 
> At the packet level if a EOP is lost the software will wait for the next
> BOP and the number of NUL only increases level of confidence that the
> packet will arrive. Critical information may be heavily redundant (to
> the extent of a NUL each second character), while less important
> commanding may be less redundant.
> 
> Of course the overhead is not negligible but the somewhere we knew we
> had to pay.

I know it sounds like an oxymoron, but that's a really elegant kludge.  
That you had to do it at all makes it a kludge -- but it looks like you 
did a good job with it within the confines of what you had to work with.

-- 
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com

Reply by Charles Bryant ●March 7, 20122012-03-07

In article <9rp8unF1i9U1@mid.individual.net>,
alb  <alessandro.basili@cern.ch> wrote:
}On 3/6/2012 2:51 AM, Charles Bryant wrote:
}[...]
.. running rx at much higher clock ...
}This approach is nice if and only if the transmitter does not introduce
}extra gaps in between bytes, which may make your job to take time into
}account a little bit more complex. And the problem is still not solved,
}since the receiver is missing the start bit (transition high-low).

Reading your solution, I think I may have been wrong in my assumption
about how the receiver worked. I assumed that when it missed the start
bit, if another zero bit arrived it would see that (i.e.
level-triggered), rather than needing a 1 and subsequent 0, so indeed
my solution would not work (nor would the suggestion of using two
stop bits).

}Timing the distance between interrupt may be tricky since at 19.2 Kbaud
}a 52us interval is needed for each bit, hence you would need your timer
}to run faster than that...

The ADSP-21020 TCOUNT register runs at the processor clock speed, so
at a typical speed of 20MHz it increments every 50ns. And it can be
read in a single cycle.

} Considering that the fastest interrupt
}service routine introduce ~3.5us of overhead it looks like you won't
}have much more time to spend for the rest of the application.

Unless you're running the processor at a very slow speed you should be
able to make an ISR take a lot less time than that.

}The byte encoding for the character looks like the following:
}                                             ____
}____| st | rs | b0 | b1 | b2 | b3 | sh | cb |
.. rest omitted ...

That looks very good. I'm sure that theoretically it would be possible
to design something with lower overhead, but if you can tolerate that
amount of overhead, it is simple enough to see that it obviously
works, while a more complex solution might have an obscure flaw.

Reply by lang...@fonz.dk ●March 7, 20122012-03-07

On 5 Mar., 11:44, alb <alessandro.bas...@cern.ch> wrote:
> On 3/2/2012 8:38 PM, Tim Wescott wrote:
>
>
>
>
>
>
>
>
>
> > On Fri, 02 Mar 2012 14:03:07 +0100, alb wrote:
>
> >> On 3/2/2012 12:52 PM, Stef wrote:
> >>> In comp.arch.embedded,
> >>> alb <alessandro.bas...@cern.ch> wrote:
> >>>> Hi everyone,
>
> >>>> in the system I am using there is an ADSP21020 connected to an FPGA
> >>>> which is receiving data from a serial port. The FPGA receives the
> >>>> serial bytes and sets an interrupt and a bit in a status register once
> >>>> the byte is ready in the output register (one 'start bit' and one
> >>>> 'stop bit'). The DSP can look at the registers simply reading from a
> >>>> mapped port and we can choose either polling the status register or
> >>>> using the interrupt.
>
> >>>> Unfortunately this is just on paper. The real world is much more
> >>>> different since the FPGA receiver is apparently 'losing' bits. When we
> >>>> send a "packet" (a sequence of bytes) what we can observe with the
> >>>> scope it that sometimes the interrupts are not equally spaced in time
> >>>> and there is one byte less w.r.t. what we send. So we suspect that the
> >>>> receiver has started on the wrong 'start bit', hence screwing up
> >>>> everything.
>
> >>>> The incidence of this error looks like dependent on the length of the
> >>>> packet we send, leading to think that due to some synchronization
> >>>> problem the uart looses the sync (maybe timing issues on the fpga).
>
> >>>> Given the fact that we cannot change the fpga, I came up with the idea
> >>>> to use some forward error correction (FEC) encoding to overcome this
> >>>> issue, but if my diagnosis is correct it looks like that the broken
> >>>> sequence of bytes is not only missing some bytes, it will certainly
> >>>> have the bit shifted (starting on wrong 'start bit') with some bits
> >>>> inserted ('start bit' and 'stop bit' will be part of the data) and I'm
> >>>> not sure if there exists some technique which may recover such a
> >>>> broken sequence.
>
> >>>> On top of it I don't have any feeling how much would cost (in terms of
> >>>> memory and cpu resources) any type of FEC decoding on the DSP.
>
> >>>> Any suggestions and/or ideas?
>
> >>> Is this a continuous stream of bits, with no pauses between bytes?
> >>> Looks like the start bit detection does not re-adjust it's timing to
> >>> the actual edge of the next start bit. With small diffferences in
> >>> bitrate, this causes the receiver to fall out of sync as you found.
>
> >> in within a "packet" there's should be no pause between bytes, I will
> >> check though. There might be a small difference in bitrate, maybe I
> >> would need to verify how much.
>
> >>> Obviously, the best solution is to fix the FPGA as it is 'broken'. Is
> >>> there no way to fix it or get it fixed?
>
> >> The FPGA, is flying in space, together with the rest of the equipment.
> >> We cannot reprogram it, we can only replace the software in the DSP,
> >> with non-trivial effort.
>
> >>> Can you change the sender of the data? If so, you can set it to 2 stop
> >>> bits. This can allow the receive to re-sync every byte. If possible, I
> >>> do try to set my transmitters to 2 stop bits and receivers to 1. This
> >>> can prevent trouble like this but costs a little bandwidth.
>
> >> We are currently investigating it, the transmitter is controlled by an
> >> 8051 and in principle we should have control over it. Your idea is to
> >> use the second stop bit to allow better synching and hopefully not lose
> >> the following start bit, correct?
>
> >>> Another option would be to tweak the bitrates. It seems your sender is
> >>> now a tiny bit on the fast side w.r.t. the receiver. Maybe you can
> >>> slown down the clock on your sender by 1 or 2 percent? Try to get an
> >>> accurate measurement of the bitrate on both sides before you do
> >>> anything.
>
> >> We can certainly measure the transmission rate. I am not sure we can
> >> tweak the bitrates to that level. The current software on the 8051
> >> supports several bitrates (19.2, 9.6, 4.8, 2.4 Kbaud) but I'm afraid
> >> those options are somehow hardcoded in the transmitter. Certainly it
> >> would be worth having a look.
>
> > Go over the FPGA code with a fine-toothed comb -- whatever you're doing,
> > it won't help if the FPGA doesn't support it.
>
> Ok, a colleague of mine went through it and indeed the start-bit logic
> is faulty, since it is looking for a negative transition but without the
> signal being synchronized with the internal clock (don't ask me how that
> is possible!).
>
> Given this type of error the 0xFF byte will be lost completely, since
> there are no other start-bit to sync on within the byte, while in other
> cases it may resync with a '0' bit in within the byte.
>

I trying real hard to understand what it is you are saying

if the uart cannot find the edge of the start bit and then sample 8
bits
correctly with out more edges to resync the baudrates would have
differ
quite bit, like several %

-Lasse

Reply by alb ●March 8, 20122012-03-08

On 3/8/2012 4:40 AM, langwadt@fonz.dk wrote:
>> Ok, a colleague of mine went through it and indeed the start-bit logic
>> is faulty, since it is looking for a negative transition but without the
>> signal being synchronized with the internal clock (don't ask me how that
>> is possible!).
>>
>> Given this type of error the 0xFF byte will be lost completely, since
>> there are no other start-bit to sync on within the byte, while in other
>> cases it may resync with a '0' bit in within the byte.
>>
> 
> I trying real hard to understand what it is you are saying
> 
> if the uart cannot find the edge of the start bit and then sample 8
> bits
> correctly with out more edges to resync the baudrates would have
> differ
> quite bit, like several %
> 

It depends on the size of the packet (*) that you send. For a 100 bytes
packet would be like ~0.5%, while with a 2KB packet is ~50%.

The receiver does not resynchronize the input signal with its internal
clock and the condition to have a start bit is set when the negated
signal and the clocked signal are both 1.

here is a simplified snippet in vhdl:
> process (clk)
> begin
>   if rising_edge (clk) then
>     input_d <= input;
>   end if;
> end process;
> 
> process (clk)
> begin 
>   if rising_edge (clk) then
>     start_bit <= not input and input_d;
>   end if;
> end process;


Since the 'not input' is not synchronized with the internal clock, the
start_bit ff may not have the hold time satisfied hence the miss of the
start bit.

> -Lasse
> 

(*)  a packet is a continuous stream of characters (**)
(**) a character is what is between a start and a stop bit

Reply by alb ●March 8, 20122012-03-08

On 3/7/2012 7:47 PM, Tim Wescott wrote:
[...]
> 
> I know it sounds like an oxymoron, but that's a really elegant kludge.  
> That you had to do it at all makes it a kludge -- but it looks like you 
> did a good job with it within the confines of what you had to work with.

I agree that is a shame we had to introduce this additional layers, but
if you look on the command level everything would look the same and all
the kludge is left in the other levels where the dirty work is being done.

After all there's always somebody who's doing the dirty work, likely in
this case we may have found a solution that lives the dirt down at the
bottom.

So far we have not found pitfalls, but we will post it if that is the case.

Al

Previous 123 Next

forward error correction on ADSP21020

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group