Forums

Ethernet programming question

Started by sburck March 27, 2012
I have a system based on the Freescale MPC5200B, running VxWorks, which has
a strange problem.

The system has a set of self tests which opens a socket server to the
world.  There is some PC software which opens a socket on the server for
communication, and then the user on the PC can run batteries of tests on
the card.  

The system also has a firmware uploader.  It works in the opposite way -
the PC software opens a socket server. and the embedded client connects to
it, after which the PC sends it new application software which the embedded
software burns.

Now the problem:  On 99% of the cards produced, this works fine.  On a few
of the cards, the uploader works, but the self-test software fails - the
firmware calls accept() and never returns, and never sees the PC trying to
open the sockets.  It is consistent - the cards that don't work, don't
work, and the others do.

You might say that this isn't exactly a software problem, and I'd agree,
but the company I made the software is asking me:  What device is failing
that they need to replace?  If the uploader is working, the Ethernet
drivers and processor are working. 

So, now the explanation is passed, and here is the quesstion:  What can
fail in a call to socket "accept()" that is different from the "connect()"
in the opposite direction?  If accept (blocking) never returns, what is
happening there that could point to the problem on the 'bad' set of cards?

Thanks	   
					
---------------------------------------		
Posted through http://www.EmbeddedRelated.com
On 2012-03-27, sburck <steve@n_o_s_p_a_m.outsourcerers.com> wrote:
> I have a system based on the Freescale MPC5200B, running VxWorks, which has > a strange problem. > > The system has a set of self tests which opens a socket server to the > world. There is some PC software which opens a socket on the server for > communication, and then the user on the PC can run batteries of tests on > the card. > > The system also has a firmware uploader. It works in the opposite way - > the PC software opens a socket server. and the embedded client connects to > it, after which the PC sends it new application software which the embedded > software burns. > > Now the problem: On 99% of the cards produced, this works fine. On a few > of the cards, the uploader works, but the self-test software fails - the > firmware calls accept() and never returns, and never sees the PC trying to > open the sockets. It is consistent - the cards that don't work, don't > work, and the others do. > > You might say that this isn't exactly a software problem, and I'd agree, > but the company I made the software is asking me: What device is failing > that they need to replace? If the uploader is working, the Ethernet > drivers and processor are working. >
On the contrary, my initial suspicions _are_ in the area of software (or at least network configuration) problems.
> So, now the explanation is passed, and here is the quesstion: What can > fail in a call to socket "accept()" that is different from the "connect()" > in the opposite direction? If accept (blocking) never returns, what is > happening there that could point to the problem on the 'bad' set of cards? >
I am assuming, with your use of accept and friends, that this is a TCP/IP based socket server and not some lower level or specialist protocol. What is the network configuration on the failing devices ? Does the Freescale board respond to ping requests ? (Assuming you are not dropping ICMP packets for security reasons.) My initial investigations would be in determining that the IP address, subnet and gateway (if any) on the Freescale board all match what you think they do. If you are reusing the same IP address on different boards, are you making sure that the arp cache on any relevant network device is flushed after swapping the boards and _before_ trying to connect to the new board ? Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world
On Mar 27, 1:22=A0pm, "sburck" <steve@n_o_s_p_a_m.outsourcerers.com>
wrote:
> I have a system based on the Freescale MPC5200B, running VxWorks, which h=
as
> a strange problem. > > The system has a set of self tests which opens a socket server to the > world. =A0There is some PC software which opens a socket on the server fo=
r
> communication, and then the user on the PC can run batteries of tests on > the card. > > The system also has a firmware uploader. =A0It works in the opposite way =
-
> the PC software opens a socket server. and the embedded client connects t=
o
> it, after which the PC sends it new application software which the embedd=
ed
> software burns. > > Now the problem: =A0On 99% of the cards produced, this works fine. =A0On =
a few
> of the cards, the uploader works, but the self-test software fails - the > firmware calls accept() and never returns, and never sees the PC trying t=
o
> open the sockets. =A0It is consistent - the cards that don't work, don't > work, and the others do. > > You might say that this isn't exactly a software problem, and I'd agree, > but the company I made the software is asking me: =A0What device is faili=
ng
> that they need to replace? =A0If the uploader is working, the Ethernet > drivers and processor are working. > > So, now the explanation is passed, and here is the quesstion: =A0What can > fail in a call to socket "accept()" that is different from the "connect()=
"
> in the opposite direction? =A0If accept (blocking) never returns, what is > happening there that could point to the problem on the 'bad' set of cards=
?
> > Thanks > > --------------------------------------- > Posted throughhttp://www.EmbeddedRelated.com
You are looking at the high level only, while this is likely only a low level detectable problem. I have used the 5200B - and written all stuff for it from scratch, including the SDMA (or Bestcomm, as they have it also) microcode, and I can say the part works fine but there are plenty of details, known and unknown silicon bugs etc. Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
On Mar 27, 12:22=A0pm, "sburck" <steve@n_o_s_p_a_m.outsourcerers.com>
wrote:
> I have a system based on the Freescale MPC5200B, running VxWorks, which h=
as
> a strange problem. > > The system has a set of self tests which opens a socket server to the > world. =A0There is some PC software which opens a socket on the server fo=
r
> communication, and then the user on the PC can run batteries of tests on > the card. > > The system also has a firmware uploader. =A0It works in the opposite way =
-
> the PC software opens a socket server. and the embedded client connects t=
o
> it, after which the PC sends it new application software which the embedd=
ed
> software burns. > > Now the problem: =A0On 99% of the cards produced, this works fine. =A0On =
a few
> of the cards, the uploader works, but the self-test software fails - the > firmware calls accept() and never returns, and never sees the PC trying t=
o
> open the sockets. =A0It is consistent - the cards that don't work, don't > work, and the others do. > > You might say that this isn't exactly a software problem, and I'd agree, > but the company I made the software is asking me: =A0What device is faili=
ng
> that they need to replace? =A0If the uploader is working, the Ethernet > drivers and processor are working. > > So, now the explanation is passed, and here is the quesstion: =A0What can > fail in a call to socket "accept()" that is different from the "connect()=
"
> in the opposite direction? =A0If accept (blocking) never returns, what is > happening there that could point to the problem on the 'bad' set of cards=
?
> > Thanks >
Accept() is waiting indifinetly for a connexion request from the other end. If no connection request is received or if the connexion request is discarded, the call to accept() will never return. Here are ten reasons why a connection request might be discarded. http://www-01.ibm.com/support/docview.wss?uid=3Dswg21063711
Wow - so many responses after a few hours.  I'll respond to them in order:

Simon - Your TCDP/IP assumption is correct .  And all the boards share a
single, reused, fixed IP address, subnet, no gateway.  The only point of
the ethernet connection here is for manufacturing and test purposes - the
board uses an external communication bus (not on the 5200) for operational
communication.  The tester setup is like such.  Power down the system
(except for the PC running the tester software).  Attach a card to the test
bed and power it up.  Run the uploader and put the test application
software on the card.  Cycle Power again, run on the PC the test software
and connect to run the tests.  The same cards (after talking with the
people running the tests, it's better than 99% - only two specific cards
are failing out of a batch of 500).  No ARP cache to clear in this
situation.

Dimiter - Yes, I know, I'm just trying to see if there's anything known
about the accept() call that is activating something so I can give the
manufacturing floor a pointer on what is wrong with the specific cards.

Lanarcam - That was the kind of information I think I was looking for, but
the link is no good.  If you could recheck and repost, I'd appreciate it.	 
 
					
---------------------------------------		
Posted through http://www.EmbeddedRelated.com
On Mar 27, 6:01=A0pm, "sburck" <steve@n_o_s_p_a_m.outsourcerers.com>
wrote:
> Wow - so many responses after a few hours. =A0I'll respond to them in ord=
er:
> > Simon - Your TCDP/IP assumption is correct . =A0And all the boards share =
a
> single, reused, fixed IP address, subnet, no gateway. =A0The only point o=
f
> the ethernet connection here is for manufacturing and test purposes - the > board uses an external communication bus (not on the 5200) for operationa=
l
> communication. =A0The tester setup is like such. =A0Power down the system > (except for the PC running the tester software). =A0Attach a card to the =
test
> bed and power it up. =A0Run the uploader and put the test application > software on the card. =A0Cycle Power again, run on the PC the test softwa=
re
> and connect to run the tests. =A0The same cards (after talking with the > people running the tests, it's better than 99% - only two specific cards > are failing out of a batch of 500). =A0No ARP cache to clear in this > situation. > > Dimiter - Yes, I know, I'm just trying to see if there's anything known > about the accept() call that is activating something so I can give the > manufacturing floor a pointer on what is wrong with the specific cards. > > Lanarcam - That was the kind of information I think I was looking for, bu=
t
> the link is no good. =A0If you could recheck and repost, I'd appreciate i=
t. The link works for me, but you can Google for: "TCP/IP discards SYN packet from client IBM"
On 2012-03-27, sburck <steve@n_o_s_p_a_m.outsourcerers.com> wrote:
> Wow - so many responses after a few hours. I'll respond to them in order: > > Simon - Your TCDP/IP assumption is correct . And all the boards share a > single, reused, fixed IP address, subnet, no gateway. The only point of > the ethernet connection here is for manufacturing and test purposes - the > board uses an external communication bus (not on the 5200) for operational > communication. The tester setup is like such. Power down the system > (except for the PC running the tester software). Attach a card to the test > bed and power it up. Run the uploader and put the test application > software on the card. Cycle Power again, run on the PC the test software > and connect to run the tests. The same cards (after talking with the > people running the tests, it's better than 99% - only two specific cards > are failing out of a batch of 500). No ARP cache to clear in this > situation. >
In theory, there is potentially a ARP cache on the PC as well which could come into play as a issue (unless the PC itself is power cycled between tests) when the PC tries to initiate a outgoing connection to the board. However, given the numbers involved (only 2 failures out of 500) I now strongly suspect this has nothing to do with it. My recommendation now is that you install Wireshark on your test PC (or maybe a second PC) and watch what happens on the network when your PC based test program tries to talk to the board. Compare this to what happens when you run your test program against a known good board. Wild speculation time: it sounds like you are using a TFTP loader present on the board to do the initial firmware load. do you have a way of dumping the loaded image back to the test PC using the onboard loader so that you can compare it to the original image ? It's always possible that the image written to the board is bad due to faulty flash on the board. Do you have any other output (say serial port or LED activity) which indicates that the loaded image is actually running on the board ? Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world
On 2012-03-27, Simon Clubley <clubley@remove_me.eisner.decus.org-Earth.UFP> wrote:
> > In theory, there is potentially a ARP cache on the PC as well which could > come into play as a issue (unless the PC itself is power cycled between > tests) when the PC tries to initiate a outgoing connection to the board. >
And yes, before anyone comments :-), since the OP is using the same IP address for the firmware load as when running his program, any ARP cache on the PC should be updated by the firmware load, so that cannot be it. My recommendation is to run Wireshark and see if that tells you anything. Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world
sburck wrote:
> I have a system based on the Freescale MPC5200B, running VxWorks, which has > a strange problem. > > The system has a set of self tests which opens a socket server to the > world. There is some PC software which opens a socket on the server for > communication, and then the user on the PC can run batteries of tests on > the card. > > The system also has a firmware uploader. It works in the opposite way - > the PC software opens a socket server. and the embedded client connects to > it, after which the PC sends it new application software which the embedded > software burns. > > Now the problem: On 99% of the cards produced, this works fine. On a few > of the cards, the uploader works, but the self-test software fails - the > firmware calls accept() and never returns, and never sees the PC trying to > open the sockets. It is consistent - the cards that don't work, don't > work, and the others do. > > You might say that this isn't exactly a software problem, and I'd agree, > but the company I made the software is asking me: What device is failing > that they need to replace? If the uploader is working, the Ethernet > drivers and processor are working. > > So, now the explanation is passed, and here is the quesstion: What can > fail in a call to socket "accept()" that is different from the "connect()" > in the opposite direction? If accept (blocking) never returns, what is > happening there that could point to the problem on the 'bad' set of cards? > > Thanks > > --------------------------------------- > Posted through http://www.EmbeddedRelated.com
Use the nonblocking version of accept() and dump ifShow(), arpShow() and other such when it fails. -- Les Cargill
More comments on the comments:

First, they called me from downstairs:  Whoever logged the cards logged
them incorrectly, this is only happening on one card, not two.

Lanarcam - found the article, and it is interesting.  (1-6 and 8-10) don't
seem to apply to what's going on here, but (7) is very interesting:  

7.  If the stack is unable to allocate storage to represent the connection,
the SYN is discarded. This should not occur.

I'll tell them to replace memories on the card based on that and hope for
the best.

If that fails, we'll start with Wireshark and see if we can see anything. 
I don't want to go changing the test firmware for something that is only
part of the manufacturing and test part of the code, the costs of
re-running all the verification tests alone would be too much, not to
mention the headache of reburning all these cards.	   
					
---------------------------------------		
Posted through http://www.EmbeddedRelated.com