20040311: Datoo (LSU Noaaport) failure

>From: Robert Leche <address@hidden>
>Organization: LSU
>Keywords: 200403111652.i2BGqgrV001903 NOAAPORT ingest MPS300


>The LSU NOAA port failed yesterday requiring an IDD restart to restore 
>operation. I captured the output of the commstats program with hope of 
>understanding why the process failed. Commstats
>datoo:ldm) 2 % ./commstats -s storm -S mps
>System Report for Server: storm
>mbuflist:     current 951,  min 855,    max     1024
>mcllist:      current 150,  min 132,    max     150
>dblk 1 (64):  current 4,    max 5,      tot     16,     fail  0
>dblk 2(5200): current 1,    max 570,    tot     600,    fail  8909428
>total dblks:  current 5,    max 573,    tot     616,    fail  8909428
>total mblks:  current 5,    max 573,    tot     616,    fail  0
>memory avail: current 63628,min 63628
>ldmadmin watch indicated no data was being received at this point. After 
>restarting ldmadmin, commstat reported:
>(datoo:ldm) 8 % ./commstats -s storm -S mps
>System Report for Server: storm
>mbuflist:     current 912, min  855, max  1024
>mcllist:      current 142, min  132, max  150
>dblk 1 (64):  current 6,   max  7,   tot  16,     fail    0
>dblk 2 (5200):current 566, max  570, tot  600,    fail    8913451
>total dblks:  current 572, max  573, tot  616,    fail    8913451
>total mblks:  current 572, max  573, tot  616,    fail    0
>memory avail: current 63488,min 63488
>(datoo:ldm) 9 %
>We are currently receiving data, but I am concerned about the process 
>stopping, and the fail counter in the commstat report. The fail counter 
>is incrementing even now after the IDD restart.

The fail counter will increase as products come in when they are not
being read out of the unit.  After your restart of the LDM, the
fail numbers should stop increasing _if_ the data reception is good.

>  What is causing the counter to advance? Do we have a satellite signal 
>strength problem?

If your numbers continue to increase significantly even while the
LDM is running and, therefore, hdlcrecv is reading from the MPS300,
then it does indicate that you havce some sort of a signal problem.
The way you reported the numbers above, however, leads me to think
that the increase in the fail count was due to products coming in
while hdlcrecv was _not_ reading from the box.

>Or is
>this an indication of problems between the MPS300 box and Datoo? My 
>guess is this concerning the signal quality in the receiver.

Not reading from the box while data is coming in can result in the
fail count increasing.  The fact that your LDM stopped/was hung/whatever
says that hdlcrecv was not reading from the box, so we would expect
the fail counts to increase.

Please check the fail counts every few minutes to see if they are
continuing to increase even though the LDM is running and data
is being ingested.  If they do, then the best guess is a signal
problem, or, perhaps, a cabling problem.

