Abstract: Efficient machine learning (ML) systems translate data into value for decision making. Recent breakthroughs in large language models (GPT 4, Llama 2, ChatGPT) and remarkable outcomes of reinforcement learning (RL) in real-world settings (AlphaGo, AlphaFold, RLHF) have shown that scalable model training on large GPU/TPU clusters is critical to obtain state-of-the-art performance. This talk aims to answer the question “how to co-design multiple layers of the software/system stack to improve the scalability and performance of ML systems”. Specifically, it addresses the challenges to build (1) flexible distributed RL systems to accelerate and parallelize the RL training loop and (2) statement management libraries to transparently change the GPU device allocation and multi-dimensional parallelism (i.e., data/model/pipeline parallelism) without affecting the training result. Finally, the talk will discuss the open challenges to design and build large-scale RLHF systems.
Speaker: Bo Zhao is a tenure-track Assistant Professor in the Department of Computer Science at Aalto University. His research focuses on efficient data-intensive systems at the intersection of scalable machine learning systems and distributed data management systems, as well as compilation-based optimization techniques. Bo’s long-term goal is to explore and understand the fundamental connections between data management and modern machine learning systems (e.g., ML training on quantum computers) to make decision-making more transparent, robust and efficient. Besides publishing at database and system conferences (e.g., SIGMOD and USENIX ATC), Bo’s research output has been applied into real-world settings (e.g.,smart grid management and public transportation) and integrated into industry software (e.g., Amazon Redshift cloud data warehouse and MindSpore ML framework).
Affiliation: Aalto University
Place: Aalto CS-building room T4 (in person) & zoom