Publication Date



Technical Report: UTEP-CS-08-24


The massive scale of current and next-generation massively parallel processing (MPP) systems presents significant challenges related to fault tolerance. For applications that perform periodic checkpoints, the choice of the checkpoint interval, the period between checkpoints, can have a significant impact on the execution time of the application. Finding the optimal checkpoint interval that minimizes the wall clock execution time, has been a subject of research over the last decade. In an environment where there are concurrent applications competing for access to the network and storage resources, in addition to application execution times, contention at these shared resources need to be factored into the process of choosing checkpoint intervals. In this paper, we perform analytical modeling of a complementary performance metric - the aggregate number of checkpoint I/O operations. We then show the existence and characterize a range of checkpoint intervals which have a potential of improving application and system performance.