[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20011204: LDM Failover Issues



Hi Patrick, 

We are working on improving ldmfail...

ldmfail does an ldmadmin stop and then an ldmadmin start..without checking
to see if any processes still exist. So if you were in the middle of a
large product while your upstream site goes down, and the ldm was still
acting on it in some way, and the cron acts at that time, you may get an
ungraceful exit from the ldmfail..We are putting in checks for rpc's and
perhaps a sleep 20 in between the ldmadmin start and ldmadmin stop in
ldmfail in our new release.


Yup, it sure looks like papagayo and stokes were stepping on each other.

Sorry about the bug in ldmfail, we hope to get the new release out soon.

Timing is everything, I  think it was a rare occurrence that this
happened, we do not get many of these issues in support, so I hope things
will proceed smoothly until we can remedy ldmfail..


We removed the ^M's from the file yesterday.


Thanks,


ps: got a mailer re the STORM project, saw your pic, sounds like you all
are doing some exciting work!

-Jeff
____________________________                  _____________________
Jeff Weber                                    address@hidden
Unidata Support                               PH:303-497-8676 
NWS-COMET Case Study Library                  FX:303-497-8690
University Corp for Atmospheric Research      3300 Mitchell Ln
http://www.unidata.ucar.edu/staff/jweber      Boulder,Co 80307-3000
________________________________________      ______________________

On Fri, 4 Jan 2002, Patrick O'Reilly wrote:

> HI Jeff,
> 
> When you were on blizzard, things were going okay, as I did reboot it.
> Rebooting it fixes the problem.
> 
> I agree that the ldmfail may fire up the ldm incorrectly.  Look at the
> progress of these email messages.  Looks like the ldmfail runs, decides to
> switch to failover, then tries to restart the ldm while one is running.
> Then, the next time ldmfail runs (20 mins later) it restarts the ldm, but by
> now things have probably gone haywire, correct?  Seems the ldmfail is
> deciding to switch to failover without first stopping the ldm.  Is that what
> I'm seeing?  And why would that be, did I mess up my script somehow?  I'll
> look at it, see what I see.
> 
> 
> > From ldm Wed Jan  2 10:00:05 2002
> > Date: Wed, 2 Jan 2002 10:00:04 -0600 (CST)
> > From: ldm
> > Message-Id: <address@hidden>
> > Subject: Local LDM restarted
> > Content-Length: 29
> >
> >
> > ldmfail: Jan 2 16:00:04 UTC
> >
> >
> > ? d
> > From ldm Wed Jan  2 09:40:20 2002
> > Date: Wed, 2 Jan 2002 09:40:19 -0600 (CST)
> > From: ldm
> > Message-Id: <address@hidden>
> > To: ldm
> > Subject: Output from "cron" command
> > Content-Length: 222
> >
> > Your "cron" job on blizzard
> > bin/ldmfail -p "stokes.metr.ou.edu" -f "papagayo.unl.edu"
> >
> > produced the following output:
> >
> > Jan 2 15:40:19 UTC blizzard.storm.uni.edu : start_ldm: There is another
> server running, start aborted
> >
> >
> > ? d
> > From ldm Wed Jan  2 09:40:19 2002
> > Date: Wed, 2 Jan 2002 09:40:19 -0600 (CST)
> > From: ldm
> > Message-Id: <address@hidden>
> > Subject: Switched LDM to failover feed
> > Content-Length: 29
> >
> > ldmfail: Jan 2 15:40:19 UTC
> >
> 
> Also, I'm not seeing the ^M's in my pqact.conf.  I opened it using Vi and
> don't see any.  I remember them at one point, but thought I removed all of
> them.
> 
> Thank you......
> 
> Patrick
> 
> ----- Original Message -----
> From: "Jeff Weber" <address@hidden>
> To: "Patrick O'Reilly" <address@hidden>
> Cc: "ldm-support" <address@hidden>
> Sent: Wednesday, January 02, 2002 6:47 PM
> Subject: Re: 20011204: LDM Failover Issues
> 
> 
> > Hi Patrick,
> >
> > Weve been on blizzard for awhile now....
> >
> > It loooks as if things are going OK now..?
> >
> > Is this correct? Did you stop and re-start the ldm around 17Z?
> >
> > We are thinking that ldmfail may fire up the ldm without allowing the ldm
> > to shut down gracefully..
> >
> > Anything you did after you sent this message would be helpful..
> >
> > Also FYI..your editor leaves a ^M at the end of your lines in
> > pqact.conf...this is not good.
> >
> > Keep us posted..
> >
> > Thanks,
> >
> > -Jeff
> > ____________________________                  _____________________
> > Jeff Weber                                    address@hidden
> > Unidata Support                               PH:303-497-8676
> > NWS-COMET Case Study Library                  FX:303-497-8690
> > University Corp for Atmospheric Research      3300 Mitchell Ln
> > http://www.unidata.ucar.edu/staff/jweber      Boulder,Co 80307-3000
> > ________________________________________      ______________________
> >
> > On Wed, 2 Jan 2002, Patrick O'Reilly wrote:
> >
> > > Hi Jeff -
> > >
> > > Hope your holidays were good (if you're not still on 'em).  If so, I
> hope
> > > they are still good.  Anyway...
> > >
> > > Stokes went kaput today, and the change I made to ldmfail didn't seem to
> > > erase the problem that data stopped being decoded.  Still came in great,
> but
> > > none showing up in the data directories.  You had me add my ldm paths
> right
> > > to the path variable in ldmfail.  My ldm home directory is
> /usr/local/ldm.
> > > Here's the section I changed in ldmfail:
> > >
> > >
> > >
> ############################################################################
> > > ###
> > > # END OF CONFIGURATION SECTION
> > >
> ############################################################################
> > > ###
> > > # identify ourselves and set up some extra stuff we will need
> > > $PROGNAME = "ldmfail" ;
> > > $lock_file = "/tmp/.ldmadmin.lck";
> > >
> > > $primary = "missing" ;
> > > $failover = "missing" ;
> > >
> > > # Dependencies:
> > > $ENV{ 'PATH' } =
> > >
> > >
> "/bin:/sbin:/usr/local/bin:/usr/ucb:/usr/bsd:/usr/bin:/usr/local/ldm/bin:/us
> > > r/local/ldm/decoders:/usr/etc:/us
> > > r/ccs/bin:$ENV{ 'PATH' }" ;
> > >
> > >
> > > And here's a snippet from my ldmd.conf when things were awry:
> > >
> > >
> > > Jan 02 17:44:09 blizzard pqact[13027]: pipe_prodput: trying again
> > > Jan 02 17:44:09 blizzard pqact[13027]: pbuf_flush (4) write: Broken pipe
> > > Jan 02 17:44:09 blizzard pqact[13027]:
> > >
> pipe_dbufput: -closedecoders/dcgrib2-ddata/gempak/logs/dcgrib_radar.log-eGEM
> > > TB
> > > L=/export/home/gem
> > > Jan 02 17:44:09 blizzard pqact[13027]: child 13380 terminated by signal
> 9
> > > Jan 02 17:44:09 blizzard pqact[13027]: child 13379 terminated by signal
> 9
> > > Jan 02 17:44:21 blizzard pqact[13027]: pbuf_flush (4) write: Broken pipe
> > > Jan 02 17:44:21 blizzard pqact[13027]:
> > >
> pipe_dbufput: -closedecoders/dcgrib2-ddata/gempak/logs/dcgrib_radar.log-eGEM
> > > TB
> > > L=/export/home/gem
> > > Jan 02 17:44:21 blizzard pqact[13027]: pipe_prodput: trying again
> > > Jan 02 17:44:21 blizzard pqact[13027]: pbuf_flush (4) write: Broken pipe
> > > Jan 02 17:44:21 blizzard pqact[13027]:
> > >
> pipe_dbufput: -closedecoders/dcgrib2-ddata/gempak/logs/dcgrib_radar.log-eGEM
> > > TB
> > > L=/export/home/gem
> > > Jan 02 17:44:21 blizzard pqact[13027]: child 13382 terminated by signal
> 9
> > > Jan 02 17:44:21 blizzard pqact[13027]: child 13381 terminated by signal
> 9
> > > Jan 02 17:44:23 blizzard pqact[13027]: pbuf_flush (4) write: Broken pipe
> > > Jan 02 17:44:23 blizzard pqact[13027]:
> > >
> pipe_dbufput: -closedecoders/dcgrib2-ddata/gempak/logs/dcgrib_radar.log-eGEM
> > > TB
> > > L=/export/home/gem
> > > Jan 02 17:44:23 blizzard pqact[13027]: pipe_prodput: trying again
> > > Jan 02 17:44:23 blizzard pqact[13027]: pbuf_flush (4) write: Broken pipe
> > > Jan 02 17:44:23 blizzard pqact[13027]:
> > >
> pipe_dbufput: -closedecoders/dcgrib2-ddata/gempak/logs/dcgrib_radar.log-eGEM
> > > TB
> > >
> > >
> > > Any other suggestions would be great.  I think you mentioned putting the
> > > path to the decoders in the cron, but didn't give a specific example.
> If
> > > you think this would be better, or have other fixes, let me know.  Once
> > > again, thank you from Icy Cornville (Iowa).
> > >
> > > Patrick
> > >
> > > ----- Original Message -----
> > > From: "Jeff Weber" <address@hidden>
> > > To: "Patrick O'Reilly" <address@hidden>
> > > Cc: "ldm-support" <address@hidden>
> > > Sent: Tuesday, December 04, 2001 1:49 PM
> > > Subject: Re: 20011204: LDM Failover Issues
> > >
> > >
> > > > Hello Patrick,
> > > >
> > > > The issue here, I believe, is an environment issue.
> > > >
> > > > ldmfail is a perl script, that will get executed via a borne shell.
> > > >
> > > > I suspect you are running in a c-shell (by the sea-shore).
> > > >
> > > > The borne shell will not grab the attributes(paths) that are in your
> > > > c-shell.
> > > >
> > > > Soooo, we can either place the path for the decoders in the cron (set
> > > > path, blah/blah/blah run ldmfail or you can "hack" your ldmfail
> program to
> > > > include the paths to your decoders.
> > > >
> > > > Check the "Dependencies"
> > > >
> > > > i.e. > from motherlode
> > > >
> > > >
> > >
> ############################################################################
> > > ##
> > > > # END OF CONFIGURATION SECTION
> > > >
> > >
> ############################################################################
> > > ###
> > > > # identify ourselves and set up some extra stuff we will need
> > > > $PROGNAME = "ldmfail" ;
> > > > $lock_file = "/tmp/.ldmadmin.lck";
> > > >
> > > > $primary = "missing" ;
> > > > $failover = "missing" ;
> > > >
> > > > # Dependencies:
> > > > $ENV{ 'PATH' } =
> > > >
> > >
> ".:/usr/ccs/bin:/opt/SUNWspro/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/b
> > > in:/opt/gnu/bin:/usr/openwin/bin:/opt/ldm/bi
> > > > n:/opt/ldm/util:/opt/ldm/decoders" ;
> > > >
> > > >
> > > > and if your install is the same as motherlode this should work.
> > > >
> > > > If your ldm dir tree is different, then the appropriate changes would
> need
> > > > to be made.
> > > >
> > > >
> > > > on lenny:
> > > >
> > >
> ############################################################################
> > > ###
> > > > # END OF CONFIGURATION SECTION
> > > >
> > >
> ############################################################################
> > > ###
> > > > # identify ourselves and set up some extra stuff we will need
> > > > $PROGNAME = "ldmfail" ;
> > > > $lock_file = "/tmp/.ldmadmin.lck";
> > > >
> > > > $primary = "missing" ;
> > > > $failover = "missing" ;
> > > >
> > > > # Dependencies:
> > > > $ENV{ 'PATH' } =
> > > >
> > >
> ".:/bin:/usr/bin:/opt/SUNWspro/bin:/usr/ccs/bin:/usr/local/ldm/bin:/usr/loca
> > > l/ldm/decoders:/usr/loc
> > > > al/bin:/usr/etc:/usr/ucb:/usr/local/gnu/bin" ;
> > > >
> > > >
> > > > notice on lenny:/usr/local/ldm/decoders
> > > >
> > > > and on motherlode:/opt/ldm/decoders
> > > >
> > > >
> > > > We are working on a more graceful ldmfail program, but that will be
> > > > months.
> > > >
> > > >
> > > > Hope this sheds some light on the subject.
> > > >
> > > > FYI...did not get your attachement.
> > > >
> > > > Thank you,
> > > >
> > > > -Jeff
> > > > ____________________________                  _____________________
> > > > Jeff Weber                                    address@hidden
> > > > Unidata Support                               PH:303-497-8676
> > > > NWS-COMET Case Study Library                  FX:303-497-8690
> > > > University Corp for Atmospheric Research      3300 Mitchell Ln
> > > > http://www.unidata.ucar.edu/staff/jweber      Boulder,Co 80307-3000
> > > > ________________________________________      ______________________
> > > >
> > > > On Tue, 4 Dec 2001, Unidata Support wrote:
> > > >
> > > > >
> > > > > ------- Forwarded Message
> > > > >
> > > > > >To: Unidata Support <address@hidden>
> > > > > >From: "Patrick O'Reilly" <address@hidden>
> > > > > >Subject: LDM Failover Issues
> > > > > >Organization: UCAR/Unidata
> > > > > >Keywords: 200112041640.fB4GeeN16636
> > > > >
> > > > > Hi there again!
> > > > >
> > > > > I have found that when the LDM fails over, whether it is to the
> failover
> > > > > host or back to the primary host, my hard drive fills up with
> errors, as
> > > > > data is no longer being decoded due to broken pipes, write errors,
> etc.
> > > > > I have attached a clip from a 13MB ldmd.log file to illustrate these
> > > > > messages.  I have found a support email that mentions this problem
> > > > > without telling how to fix it
> > > > > (http://www.unidata.ucar.edu/glimpse/ldm/3301).  The fix actually
> > > > > mentioned in the support email, I guess, is to comment out ldmfail
> in
> > > > > cron, if the primary host is reliable.  Have there been other
> reports of
> > > > > this with ldmfail and are there fixes?  Thanks!
> > > > >
> > > > > Patrick
> > > > >
> > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > Patrick O'Reilly             Support Scientist
> > > > > The STORM Project            address@hidden
> > > > > 208 Latham Hall              ph: 319-273-3789
> > > > > University of Northern Iowa
> > > > > Cedar Falls, IA 50614
> > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > >
> > > > > ------- End of Forwarded Message
> > > > >
> > > > >
> > > >
> > >
> > >
> >
> 
>