A fault-tolerant HPC scheduler extension for large and operational ensemble data assimilation in the Red Sea

by H. Toye, S. Kortas, P. Zhan, I. Hoteit
Year: 2018

Bibliography

A fault-tolerant HPC scheduler extension for large and operational ensemble data assimilation in the Red Sea
H. Toye, S. Kortas, P. Zhan, and I. Hoteit
Journal of Computational Science, 2018

Abstract

​A fully parallel ensemble data assimilation and forecasting system has been developed for the Red Sea based on the MIT general circulation model (MITgcm) to simulate the Red Sea circulation and the Data Assimilation Research Testbed (DART) ensemble assimilation software. An important limitation of operational ensemble assimilation systems is the risk of ensemble members’ collapse. This could happen in those situations when the filter update step imposes large corrections on one, or more, of the forecasted ensemble members that are not fully consistent with the model physics. Increasing the ensemble size is expected to improve the assimilation system performances, but obviously increases the risk of members’ collapse. Hardware failure or slow numerical convergence encountered for some members should also occur more frequently. In this context, the manual steering of the whole process appears as a real challenge and makes the implementation of the ensemble assimilation procedure uneasy and extremely time consuming.
This paper presents our efforts to build an efficient and fault-tolerant MITgcm-DART ensemble assimilation system capable of operationally running thousands of members. Built on top of Decimate, a scheduler extension developed to ease the submission, monitoring and dynamic steering of workflow of dependent jobs in a fault-tolerant environment, we describe the assimilation system implementation and discuss in detail its coupling strategies. Within Decimate, only a few additional lines of Python is needed to define flexible convergence criteria and to implement any necessary actions to the forecast ensemble members, as for instance (i) restarting faulty job in case of job failure, (ii) changing the random seed in case of poor convergence or numerical instability, (iii) adjusting (reducing or increasing) the number of parallel forecasts on the fly, (iv) replacing members on the fly to enrich the ensemble with new members, etc. We demonstrate the efficiency of the system with numerical experiments assimilating real satellites sea surface height and temperature observations in the Red Sea.

DOI: 10.1016/j.jocs.2018.04.018

Keywords

High Performance Computing Ensemble Data Assimilation Bayesian Filtering Operational Oceanography Red Sea