[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Support #WCE-307093]: ldm routing problem and solution



Tom,

> We had a problem with ldm and found a solution for it.  There was really
> nothing wrong with ldm - it was due to the platform which, as usual,
> must be lived with.  Below is a description of the problem we were
> seeing and the resolution. This was v6.8.0 and also 6.6.3.
> 
> ldms A1,A2,A3,A4 are feeding ldm B who in turn is feeding everything
> received to ldm C
> 
> At C, occasionally a file/product would not be received.  Examination of
> queue at B showed it was there.  Ie, it had been received by B but was
> not passed on to C.
> 
> Looking at the code in pq.c, one thing I could imagine was that if time
> were not sequential (ie time ran backwards), then the B feeder of C
> would read something from the queue, thus setting the cursor for the
> next read.  Then the clock is stepped back. Then incoming from Ax is
> received and inserted with a time earlier than that of the cursor saved
> by the B feeder of C process.  So the next time the B feeder checks the
> queue, it does not see the just inserted product.

Yup.  Bad things happen if time runs backwards.

> Well to make a long story short, I determined this was indeed
> happening.  A small test program that got time, slept a few
> microseconds, and then got time again would sometimes get an earlier
> time the second time.  It would do this whether ntpd was running or not.

Is computer "B" executing ntpdate(8) out of root's crontab(1) in order to
set the clock?  If so, then the solution would be to stop that and run 
ntpd(8) at boot-time.

> I made the change listed below to pq.c and it fixed the problem (or
> should I say hid the problem of dumb clock).  I ignore very large
> backwards time errors for similar reason you mention near the comment
> containing "ASSUMPTION" later in the code, but the threshold of one
> second is arbitrary.
> 
> Our problem is exacerbated by the fact the ldm B is a vmware vm,
> notorious for timekeeping problems.  However, I have heard of backwards
> time adjustment issues on non-vm systems. [I've also heard that
> correcting time in the reverse direction is smartly done by incrementing
> time counter slower than real time - maybe there is some threshold where
> the hard reverse step comes into play].
> 
> One thing that worried me was the added processing to make the tqe_find
> call.  It did not seem to impact the system much.  A better way might be
> to maintain a cursor near the end of the tq, but I wanted to keep
> everything straightforward and mainly demonstrate this was indeed the
> problem.
> 
> The data in question is the Level 2 feed from IRADS.  We connect to 4
> upstream ldms (actually 8 counting alternates).  When we connect to the
> same feed from ERC, we only connect to one ldm, which has the total
> feed.  We never (or so rarely we did not notice) have the problem
> there.  I suspect the added concurrency makes it show up quicker.
> 
> Although we are now fixed internally, we find that there are occasional
> missing files from IRADS.  I'm thinking they may have the same issue
> there.  When the files are missing from IRADS they are often present on
> the ERC feed and vice versa so its not a problem originating at the source.
> 
> Do you think this patch or something equivalent might be a candidate for
> a future release?

I'll have to think about it.

> Thanks
> Tom Thompson
> 
> The patch:
> 
> status = set_timestamp(&tp->tv);    /* set insertion-time to now */
> [line 591 in pq.c v6.8.0]
> 
> if (status == ENOERR) {
> tqelem* tpp;
> tqep_t  update[MAXLEVELS];
> tqep_t  p = TQ_HEAD;
> int     k = tq->level;
> 
> /* PATCH */
> 
> static tqelem * tqe_find(const tqueue *const tq, const
> timestampt *const key, const pq_match mt);
> tpp = tqe_find(tq,&TS_ENDT,TV_LT);
> if (NULL != tpp) {
> if (TV_CMP_LT(tp->tv,tpp->tv)) {
> if (d_diff_timestamp(&tpp->tv,&tp->tv) < 1) {
> tp->tv = tpp->tv;
> timestamp_incr(&tp->tv);
> uwarn("Insert time adjusted because clock went
> backwards");
> }
> else {
> /* Ignore very large reverse time changes */
> /*  uwarn("Large backwards time change ignored"); */
> }
> }
> }
> /* END PATCH */
> 
> tp->offset = offset;            /* tp is tqelem* to insert */

I'm impressed.  I hope you didn't hurt your mind reading the "pq" code.  :-)

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: WCE-307093
Department: Support LDM
Priority: Normal
Status: Open