go back
go back
Volume 18, No. 12
DECK: Experiences on Delta Checkpointing for Industrial Recommendation Systems
Abstract
In large-scale industrial recommendation systems, model checkpoints are instrumental in maintaining training goodput and numerical correctness during system failures and job preemptions. The increasing prevalence of multi-terabyte models has rendered frequent regular model checkpoints impractical, resulting in substantial lost progress when recovering from failures. As model sizes continue to grow, researchers and practitioners are compelled to investigate more e!cient and scalable solutions. This paper presents DECK, a novel approach to delta model checkpointing designed for real-world industrial systems. Specifically, DECK focuses on extracting delta states with near-zero overhead, staging and streaming delta checkpoints without interrupting the training process, and merging delta checkpoints in an optimal and decoupled manner. Experimental results demonstrate that DECK achieves a 12-fold increase in checkpoint frequency while maintaining negligible impact on training throughput, thereby attaining state-of-the-art (SOTA) production performance.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy