Implementing Thread-safe Access to the netCDF-C Library

Thread-Safe Access to the netcdf-c API

Initial Draft: 2017-2-21
Last Revised: 2017-5-30
Author: Dennis Heimbigner, Unidata

Table of Contents

Introduction

This document proposes an architecture for implementing thread-safe access to the netcdf-c library. Here, the term "thread-safe" means that multiple threads can access the netcdf-c library safely (i.e. without interference or deadlock or race conditions). This does not mean that the library is itself multi-threaded. Rather, access to the library is serialized so that only one thread at a time is executing the library code.

It is proposed that thread-safe operation is to be implemented such that all calls to the netcdf-c API are protected by a binary semaphore using a lock-unlock protocol. This means that all calls to the API are "serialized" in the sense that each API call is completed before any other call to the API can be executed. This means that in a multi-threaded environment, it is possible for all threads to safely access the netcdf-c library.

This approach comes with some caveats.

  1. If two different threads attempt to access the same file, then interference is still possible.
  2. Using thread-safe access simultaneously with MPI parallelism may not be safe. This is still unresolved

Architectural Considerations

At the moment, the implementation of the netcdf-c API resides in files in the libdispatch directory. Basically, all the code in libdispatch falls into the following categories.

  1. Dispatch functions -- These functions directly invoke methods in the dispatch table and typically have this form.

    int nc_xxx(...)
    {
        NC* ncp;
        int stat = NC_check_id(ncid,&ncp);
        if(stat != NC_NOERR) return stat;
        return ncp->dispatch->XXX(...);
    }
    
  2. Auxiliary functions -- These functions just invoke some other function in the API, but possibly with some special values for the arguments of the called function. Here is an example.

    int nc_inq_varname(int ncid, int varid, char *name)
    {
           return nc_inq_var(ncid, varid, name, NULL, NULL, NULL, NULL);
    }
    
  3. Complex functions -- These functions do complex computation including calling a variety of internal functions.

  4. Internal functions -- All other code in libdispatch is considered internal.

Functions in classes 1 and 3 are considered to be part of the API core. The followig Figure shows the notional relationship between the function classes.


Locking Regime

The simplest approach to thread-safety is to surround all calls to API functions with a LOCK/UNLOCK protocol. This is how the HDF5 library operates, for example.

Our proposal is to implement locking using a single, global binary semaphore. This is extremely simple and is well-supported under all versions of ~nix~ (using libpthreads) as well as Windows (built-in).

One consequence of this decision is that there must be no recursive calls to locked functions. If it happens, it will cause a deadlock. This means specifically that core functions and internal functions cannot invoke core functions (directly or transitively).

An example of adding locking to a core function is shown in this example.

    int nc_xxx(...)
    {
        NC* ncp;
        int stat = NC_NOERR;
        LOCK();
    if((stat=NC_check_id(ncid,&ncp)) != NC_NOERR) goto done;
        stat = ncp->dispatch->XXX(...);
    done:
        UNLOCK();
        return stat;
    }

The done label is used to provide a single exit to ensure that UNLOCK is always invoked before exiting the function.

Note that we do not need to add locking to our class 1 (Auxiliary) functions since they just invoke a core function (class 2 or 3) that does the actual locking. Because of this, it will pay to try to convert as many API calls as possible to be auxiliary functions. Currently, there are a number of class 2/3 functions that could be converted with small effort by revising the set of core functions.

Note also that we assume that all internal functions will be invoked either by other internal functions or by core API functions that use a locking protocol. Hence these internal functions do not need to use a locking protocol. In fact, if they did, it could cause a deadlock.

Problem 1: Mostly Auxiliary Functions

It turns out that there are a few functions that are mostly auxiliary functions except that they invoke some internal functions to get information not available through the standard netcdfd-c API. One example is the NCDEFAULTgetvars function. It invokes two internal functions: * NCisrecvar * NC_getshape

The solution is to "expose" these internal functions in the core API by providing wrappers for them that use the locking regime. Using this approach, it should be possible to increase the number of auxiliary functions that do not need to directly use locking.

Note, that exposing these functions does not mean that they are part of the public netcdf-c library API; only that they are accessible to our external functions.

Problem 2: Internal Functions calling Core Functions

This is the big problem is implementing thread-safety. It turns out that some internal code invokes core API functions. This mostly occurs inside the libdap2 and libdap4 code. This is a problem because it violates the no recursive call rule and will lead to deadlock.

The simplest solution to this problem is to change all recursive calls from the internal code to the core API code to no longer call the core API. Instead, the direct calls can, in most cases, be changed to call directly into the dispatch layer. The cost is increased complexity in the internal code. To some degree, this complexity can be mitigated by using macros to hide the complexity. In a few cases, some extra internal functions may have to be introduced into the libdispatch code to make this change possible or to simplify the required changes.

Steps to Implementing Proposed Architecture

The key to implementing the proposed architecture is to slowly refactor the code in libdispatch to properly segregate the auxiliary functions from the core API from the internal code.

The following sequence of actions is proposed.

  1. Create two new files: libdispatch/daux.c and libdispatch/dapi.c.
  2. Move auxiliary functions into daux.c and the core api functions into dapi.c.
  3. Add extra functions in dapi.c to expose functions like NCgetshape_ (see above).
  4. Move, where possible, code from dapi.c to daux.c using the exposed functions in #3.
  5. Identify the recursive calls in internal code. This can be accomplished by temporarily renaming the functions in dapi.c and dextend.c and then recompiling. That should flush out all such recursive calls.
  6. Convert the calls identified in #5 to call through the dispatcher instead.
  7. Add locking to dapi.c.
  8. Test and fix the resulting code to look for missed recursive calls.

Conclusion

Assuming the above approach is correct, then we should be able to make the netcdf-c library thread-safe with a straightforward, if tedious, sequences of changes.

Comments:

This might be out of date now, but Microsoft, in their Microsoft.Research.Science.Data.NetCDF4 release, used mutexes in the C# DLL wrapper to make the library thread safe for C#. e.g.

static System.Threading.Mutex m = new System.Threading.Mutex();

public static int nc_close(int ncidp) { m.WaitOne(); var r = NetCDFInterop.nc_close(ncidp); m.ReleaseMutex(); return r; }

Given your comments about the nature of the NetCDF internals, would this have worked?

Cheers,

Nick

Posted by Nick Humphries on May 10, 2018 at 06:41 AM MDT #

Good question. I suspect not because there is other mutable global state that is manipulated by other api calls. The interesting question is: what is the minimum number of calls that need to be protected?

Posted by Dennis Heimbigner on May 10, 2018 at 10:55 AM MDT #

Further to the C# multi-threading discussion, if each thread was to open the NetCDF file individually and obtain it's own ncid, would subsequent calls be treated like calls from separate programs and work correctly?

Posted by Nick Humphries on October 23, 2018 at 01:48 AM MDT #

I'm not sure about the NetCDF C# API, as that is not a thing we maintain here. The core C library is not thread safe; there is one global NCID variable, and it would contain the NCID of the most recent executing thread.

I don't know if the C# team has implemented some sort of thread safety on top of this, so you might contact their support or consult the documentation they provide.

Sorry I can't provide a more concrete answer!

-Ward

Posted by Ward Fisher on October 24, 2018 at 02:22 PM MDT #

Thanks Ward, I guess the best way to find out is to do some testing! When I get around to it I'll let you know.

Nick.

Posted by Nick Humphries on October 25, 2018 at 12:13 PM MDT #

Post a Comment:
Comments are closed for this entry.
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« March 2024
SunMonTueWedThuFriSat
     
1
2
3
5
6
7
8
9
10
11
12
13
14
15
16
17
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today