We isolated the problem to a version of NetCDF library that has fsync() enabled by default. This caused long delays in processing data. On our testing tier, we've compiled against NetCDF with fsync() calls disabled. This reduced the I/O operations per second by 90%. The test tier looks a LOT healthier right now.
Meanwhile, I've heard reports that the production side has been a little more stable. These are all virtual machines; by fixing the test tier, we've effectively lopped off a third of the MADIS-related IOPS from the bare metal. It is possible that the production tier has just enough breathing room on bare metal now to be stable. But I wouldn't swear to it.
Barring any other show-stoppers that crop up (and we've had a few), the fix will make it to the production tier in no more than one week.