382 0

Full metadata record

DC FieldValueLanguage
dc.contributor.author강수용-
dc.date.accessioned2018-02-22T09:32:25Z-
dc.date.available2018-02-22T09:32:25Z-
dc.date.issued2011-10-
dc.identifier.citationIEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 22, 10,en_US
dc.identifier.issn1045-9219-
dc.identifier.urihttp://ieeexplore.ieee.org/abstract/document/5710900/-
dc.identifier.urihttp://hdl.handle.net/20.500.11754/40098-
dc.description.abstractThis article presents Athanasia, a user-transparent and fault-tolerant system, for parallel applications running on large-scale cluster systems. Cluster systems have been regarded as a de facto standard to achieve multitera-flop computing power. These cluster systems, as we know, have an inherent failure factor that can cause computation failure. The reliability issue in parallel computing systems, therefore, has been studied for a relatively long time in the literature, and we have seen many theoretical promises arise from the extensive research. However, despite the rigorous studies, practical and easily deployable fault-tolerant systems have not been successfully adopted commercially. Athanasia is a user-transparent checkpointing system for a fault-tolerant Message Passing Interface (MPI) implementation that is primarily based on the sync-and-stop protocol. Athanasia supports three critical functionalities that are necessary for fault tolerance: a light-weight failure detection mechanism, dynamic process management that includes process migration, and a consistent checkpoint and recovery mechanism. The main features of Athanasia are that it does not require any modifications to the application code and that it preserves many of the high performance characteristics of high-speed networks. Experimental results show that Athanasia can be a good candidate for practically deployable fault-tolerant systems in very-large and high-performance clusters and that its protocol can be applied to a variety of parallel communication libraries easily.en_US
dc.description.sponsorshipThe preliminary version of this article was presented at ACM/IEEE SC05. This work was supported by the National Research Foundation (NRF) grant funded by the Korean government (MEST) (No. 2010-0016788).en_US
dc.language.isoenen_US
dc.publisherIEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1314 USAen_US
dc.subjectUser transparencyen_US
dc.subjectfault toleranceen_US
dc.subjectmessage passing interfaceen_US
dc.subjectparallel systemsen_US
dc.subjectMyrineten_US
dc.subjectInfiniBanden_US
dc.subjectch_p4en_US
dc.titleAthanasia: A User-Transparent and Fault-Tolerant System for Parallel Applicationsen_US
dc.typeArticleen_US
dc.relation.no10-
dc.relation.volume22-
dc.identifier.doi10.1109/TPDS.2011.63-
dc.relation.page1653-1668-
dc.relation.journalIEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS-
dc.contributor.googleauthorJung, Hyungsoo-
dc.contributor.googleauthorHan, Hyuck-
dc.contributor.googleauthorYeom, Heon Y.-
dc.contributor.googleauthorKang, Sooyong-
dc.relation.code2011203885-
dc.sector.campusS-
dc.sector.daehakCOLLEGE OF ENGINEERING[S]-
dc.sector.departmentDEPARTMENT OF COMPUTER SCIENCE-
dc.identifier.pidsykang-
Appears in Collections:
COLLEGE OF ENGINEERING[S](공과대학) > COMPUTER SCIENCE AND ENGINEERING(컴퓨터공학부) > Articles
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE