go back
go back
Volume 18, No. 12
Magnus: A Holistic Approach to Data Management for Large-Scale Machine Learning Workloads
Abstract
Machine learning (ML) has become a cornerstone of key applications at ByteDance. As model complexity and data volumes surge, data management for large-scale ML workloads faces substantial challenges, particularly with recent advances in large recommendation models (LRMs) and large multimodal models (LMMs). Traditional approaches exhibit limitations in storage efficiency, metadata scalability, update mechanisms, and integration with ML frameworks. To address these challenges, we propose Magnus , a holistic data management system built upon Apache Iceberg. Magnus integrates innovative optimizations across resource-efficient storage formats optimized for large wide tables and multimodal data, built-in support for vector and inverted indexes to accelerate data retrieval, scalable metadata planning with Git-like branching and tagging capabilities, and high-performance update/upsert based on lightweight merge-on-read (MOR) strategies. Additionally, Magnus provides native support and specialized enhancement for LRM and LMM training workloads. Experimental results demonstrate significant performance gains in real-world ML scenarios. Magnus has been deployed at ByteDance for over five years, enabling robust and efficient data infrastructure for large-scale ML workloads.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy