[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050905: Weird connection problem between idd and freshair

>From: Harry Edmon <address@hidden>
>Organization: University of Washington
>Keywords: 200509060344.j863icjo012387 IDD

Hi Harry,

>There seems to have been a connection problem between idd.unidata.ucar.edu
>and freshair.atmos.washington.edu this evening.  Starting at 2346 UTC (1646
>PDT) I start to see this in my log files:
>Sep  5 16:46:05 freshair2 idd[29099] ERROR: Terminating due to LDM failure; Co
> nn
>ection to upstream LDM closed
>Then at 2352 I get:
>Sep  5 16:52:30 freshair2 idd[29098] ERROR: Terminating due to LDM failure; Co
> ul
>dn't connect to LDM on idd.unidata.ucar.edu using either port 388 or portmappe
> r;
>  : RPC: Remote system error - Connection timed out
>This continued until 0309 UTC on the 6th (2009).  What happened?

The machine that hosts the director portion the IDD relay cluster we
manage got into a weird state at about 17:35.  I got home at about
20:00 and noticed ldmping failures to the cluster and started
investigating.  I called Mike Schmidt and we both tried to logon to the
cluster's director, but all attempts failed.  Mike was forced to drive
into work and manually reboot the director -- it was in some sort of
snit related to networking and was spewing errors to the system
console.  Luckily, the machine rebooted cleanly, and IDD relay was
restored.  The cluster data backends stayed up and had a full set of
data so that when the director came back up LDMs that had not exited
resumed ingestion more or less from where they left off.

We will be installing a network power switch tomorrow so that we can
login into it and force a reboot by power cycling.

Sorry for the data interruption...


NOTE: All email exchanges with Unidata User Support are recorded in the
Unidata inquiry tracking system and then made publicly available
through the web.  If you do not want to have your interactions made
available in this way, you must let us know in each email you send to us.