[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #WKI-561206]: Fwd: Beta LDM to test?



Mike,

> I have a situation on the COD Noaaport servers Gilbert has been testing LDM
> on that I wanted to bring to your attentions.  We have two Noaaport
> servers, and to my knowledge each are running the latest beta version
> (looks like 6.13.7.62).

More like "alpha".

> On a downstream server running LDM-6.13.6 I've
> been monitoring bandwidth, and I've been noticing sharp increases in
> traffic coming from Unidata and Wisc.edu idd servers via LDM.  When I look
> at the logs, this downstream server isn't able to connect to either of the
> two Noaaport servers, so it fails over to retrieve data from the other
> sites.  (noaaport1.cod.edu[9551] WARN error.c:236:err_log() Couldn't
> connect to LDM on noaaport1.cod.edu using either port 388 or portmapper; :
> RPC: Remote system error - Connection refused)

A "connection refused" message usually means that there's no appropriate ALLOW 
entry in the upstream LDM's configuration-file.

We've also seen that message due to a firewall or intrusion 
detection/prevention system (IDS, IPS). In such cases, one can try using 
telnet(1) and/or ncat(1):

    telnet noaaport1.cod.edu 388
    ncat noaaport1.cod.edu 388

ncat(1) is called nc(1) on some systems.

> A remake of the product queue and a restart of LDM on the Noaaport servers,
> and the downstream server connects immediately and traffic from the
> fail-over sites ceases.  If I run pqcheck prior to deleting the queue it
> returns status 3, which in my experience is normal / no problems found.

An exit code of 3 indicates that writer-counter in the queue isn't zero, either 
because a process has the queue open for writing or because a process 
terminated without closing the queue (because it was killed, for example). If 
no process has the queue open for writing, then it's a good idea to execute the 
command "pqcat -l- -s && pqcheck -F -q".

> It
> looks like there is more verbose logging turned on with these betas so the
> logs are hundreds of megs, sometimes over a gig in size, but after
> filtering out things like decoding and grib2 errors I do see what seems
> like an excessive number of "Gap in packet sequence" messages.  Hard to for
> me to tell if this is related or a red herring, but figured I'd mention
> it.

That's a separate issue caused by poor NOAAPort reception.

> Also, it's common that some combination of ldmd, noaaportIngester or
> even rtstats procs hang on 'ldmadmin stop' where I need to forcibly kill
> them in order to restart LDM, and in some cases reclaim disk space from the
> product queue; it's far from cleanly shutting down.

You should give it at least 30 seconds.

> Today was the second time within a week where I've had to restart LDM on
> those servers to get the downstream server to reconnect.  I don't recall
> having this kind of issue prior to testing the beta version.  And since
> I've been watching bandwidth more closely I suspect this has been happening
> pretty regularly in the last month at least maybe longer, but I'm only just
> now connecting the dots of the connection issue, fail-over & the rest.  I
> don't like that I can't depend on either of the Noaaport servers, so if
> this keeps happening I'm probably going to revert to 6.13.6 on at least one
> of them.  I'll let you all know if/when I do.

You might try running 6.13.6 on one of the NOAAPort ingest systems to see if it 
terminates better.

> If anyone has any ideas I'd love to hear them.  Otherwise I'll keep you
> posted.

Appreciate it.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: WKI-561206
Department: Support LDM
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.