Implementing Thread-safe Access to the netCDF-C Library

Thread-Safe Access to the netcdf-c API

Initial Draft: 2017-2-21
Last Revised: 2017-5-30
Author: Dennis Heimbigner, Unidata

Table of Contents

Introduction

This document proposes an architecture for implementing thread-safe access to the netcdf-c library. Here, the term "thread-safe" means that multiple threads can access the netcdf-c library safely (i.e. without interference or deadlock or race conditions). This does not mean that the library is itself multi-threaded. Rather, access to the library is serialized so that only one thread at a time is executing the library code.

It is proposed that thread-safe operation is to be implemented such that all calls to the netcdf-c API are protected by a binary semaphore using a lock-unlock protocol. This means that all calls to the API are "serialized" in the sense that each API call is completed before any other call to the API can be executed. This means that in a multi-threaded environment, it is possible for all threads to safely access the netcdf-c library.

This approach comes with some caveats.

  1. If two different threads attempt to access the same file, then interference is still possible.
  2. Using thread-safe access simultaneously with MPI parallelism may not be safe. This is still unresolved

Architectural Considerations

At the moment, the implementation of the netcdf-c API resides in files in the libdispatch directory. Basically, all the code in libdispatch falls into the following categories.

  1. Dispatch functions -- These functions directly invoke methods in the dispatch table and typically have this form.

    int nc_xxx(...)
    {
        NC* ncp;
        int stat = NC_check_id(ncid,&ncp);
        if(stat != NC_NOERR) return stat;
        return ncp->dispatch->XXX(...);
    }
    
  2. Auxiliary functions -- These functions just invoke some other function in the API, but possibly with some special values for the arguments of the called function. Here is an example.

    int nc_inq_varname(int ncid, int varid, char *name)
    {
           return nc_inq_var(ncid, varid, name, NULL, NULL, NULL, NULL);
    }
    
  3. Complex functions -- These functions do complex computation including calling a variety of internal functions.

  4. Internal functions -- All other code in libdispatch is considered internal.

Functions in classes 1 and 3 are considered to be part of the API core. The followig Figure shows the notional relationship between the function classes.


Locking Regime

The simplest approach to thread-safety is to surround all calls to API functions with a LOCK/UNLOCK protocol. This is how the HDF5 library operates, for example.

Our proposal is to implement locking using a single, global binary semaphore. This is extremely simple and is well-supported under all versions of ~nix~ (using libpthreads) as well as Windows (built-in).

One consequence of this decision is that there must be no recursive calls to locked functions. If it happens, it will cause a deadlock. This means specifically that core functions and internal functions cannot invoke core functions (directly or transitively).

An example of adding locking to a core function is shown in this example.

    int nc_xxx(...)
    {
        NC* ncp;
        int stat = NC_NOERR;
        LOCK();
    if((stat=NC_check_id(ncid,&ncp)) != NC_NOERR) goto done;
        stat = ncp->dispatch->XXX(...);
    done:
        UNLOCK();
        return stat;
    }

The done label is used to provide a single exit to ensure that UNLOCK is always invoked before exiting the function.

Note that we do not need to add locking to our class 1 (Auxiliary) functions since they just invoke a core function (class 2 or 3) that does the actual locking. Because of this, it will pay to try to convert as many API calls as possible to be auxiliary functions. Currently, there are a number of class 2/3 functions that could be converted with small effort by revising the set of core functions.

Note also that we assume that all internal functions will be invoked either by other internal functions or by core API functions that use a locking protocol. Hence these internal functions do not need to use a locking protocol. In fact, if they did, it could cause a deadlock.

Problem 1: Mostly Auxiliary Functions

It turns out that there are a few functions that are mostly auxiliary functions except that they invoke some internal functions to get information not available through the standard netcdfd-c API. One example is the NCDEFAULTgetvars function. It invokes two internal functions: * NCisrecvar * NC_getshape

The solution is to "expose" these internal functions in the core API by providing wrappers for them that use the locking regime. Using this approach, it should be possible to increase the number of auxiliary functions that do not need to directly use locking.

Note, that exposing these functions does not mean that they are part of the public netcdf-c library API; only that they are accessible to our external functions.

Problem 2: Internal Functions calling Core Functions

This is the big problem is implementing thread-safety. It turns out that some internal code invokes core API functions. This mostly occurs inside the libdap2 and libdap4 code. This is a problem because it violates the no recursive call rule and will lead to deadlock.

The simplest solution to this problem is to change all recursive calls from the internal code to the core API code to no longer call the core API. Instead, the direct calls can, in most cases, be changed to call directly into the dispatch layer. The cost is increased complexity in the internal code. To some degree, this complexity can be mitigated by using macros to hide the complexity. In a few cases, some extra internal functions may have to be introduced into the libdispatch code to make this change possible or to simplify the required changes.

Steps to Implementing Proposed Architecture

The key to implementing the proposed architecture is to slowly refactor the code in libdispatch to properly segregate the auxiliary functions from the core API from the internal code.

The following sequence of actions is proposed.

  1. Create two new files: libdispatch/daux.c and libdispatch/dapi.c.
  2. Move auxiliary functions into daux.c and the core api functions into dapi.c.
  3. Add extra functions in dapi.c to expose functions like NCgetshape_ (see above).
  4. Move, where possible, code from dapi.c to daux.c using the exposed functions in #3.
  5. Identify the recursive calls in internal code. This can be accomplished by temporarily renaming the functions in dapi.c and dextend.c and then recompiling. That should flush out all such recursive calls.
  6. Convert the calls identified in #5 to call through the dispatcher instead.
  7. Add locking to dapi.c.
  8. Test and fix the resulting code to look for missed recursive calls.

Conclusion

Assuming the above approach is correct, then we should be able to make the netcdf-c library thread-safe with a straightforward, if tedious, sequences of changes.

Comments:

Post a Comment:
  • HTML Syntax: Allowed
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Recent Entries:
Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« November 2017
SunMonTueWedThuFriSat
   
1
2
3
4
5
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today