[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LDM pqinsert/pq_del_oldest() signature not found



> To: address@hidden
> From: "Karen Cooper" <address@hidden>
> Subject: LDM - Linux RedHat 7.3 - pqinsert/pq_del_oldest
> Organization: NOAA/NSSL
> Keywords: 200303311517.h2VFHUEX003415 LDM-5.1.4 pq_del_oldest

Hi Karen,

> The process that is ingesting the data and inserting it into the queue
> routinely fails after a few days. This is only happening on one of my
> many machines.
> 
> The ldmd.log shows:
> 
> Mar 29 11:04:01 twxldm pqing_bdds[3860]: pq_del_oldest: signature 
> 0b2ff0a1543327aa9505e8a74cf251: Not Found
> Mar 29 11:04:01 twxldm pqing_bdds[3860]: pq_insert: Invalid argument
 ...
> I was hoping you might be able to give me some insight into the
> problem.

Whenever a product is inserted in the queue, its MD5 signature is
inserted into a hash table for quickly checking on duplicate products.
Later when it's time to delete the product to make room for a new
product, the signature must also be deleted from the hash table.  In
this case, the signature that was added to the hash table earlier is
not found, so it can't be deleted.  This should never happen, so it
indicates either a bug, a corrupted queue, or a disk or memory error.

If you are inserting products on all of your machines running the same
software, but you are only seeing this problem on one of the machines,
that makes it sound like a symptom of a disk or memory error on that
machine.  If an MD5 signature stored in the signature hash table that
is part of the product queue gets stored or retrieved wrong when paged
in from disk, "signature ...: Not found" is the error that will
result.  Another indication that this is a symptom of a hardware
problem on that machine is that we haven't seen this error on our
ingest machines, which are inserting millions of products per week
into our product queues.

Would it be possible to run extensive memory or disk diagnostics on
that machine to see if any problem can be found?  Alternatively, could
you just swap it for a different machine and see if the problem goes
away?

Otherwise, we would need to be able to reproduce the problem here.
The only way I can think of to do that would be for you to start with
a new queue, record every product that gets inserted into it, and make
that stream of products available to us after the failure occurs, to
see if we can reproduce the problem.

--Russ