Alibaba Cloud and Peking University Develop GPU-Efficiency System That Cuts Hardware Use by 82%
Published: 10.24.2025
Alibaba Cloud, in collaboration with Peking University, has unveiled a new GPU pooling and scheduling system called Aegaeon, which reportedly reduces the number of high-end GPUs required for LLMi nference operations by up to 82%.
The system was detailed in a paper presented at the ACM Symposium on Operating Systems Principles (SOSP) 2025, held this week in Seoul.

According to the research paper, Aegaeon enables multiple AI models to share GPU resources more efficiently by scheduling workloads at the “token level”, which is a more granular approach than traditional request-based scheduling.
In a live beta environment within Alibaba’s Model Studio marketplace, the researchers reported that the system reduced GPU usage from 1,192 Nvidia H20 GPUs to just 213, while maintaining throughput across dozens of models of up to 72 billion parameters.
This method allows multiple models to operate simultaneously on the same GPU, while inactive memory portions are dynamically offloaded to host memory or secondary storage, resulting to higher hardware utiliz
ation and lower energy and cooling demands for data centers.
Alibaba Cloud said the approach is part of its broader strategy to mitigate the impact of global GPU shortages, particularly amid ongoing export restrictions affecting China’s access to advanced AI chips, such as Nvidia’s H20.
While the findings have been verified through peer review, analysts caution that the results may depend heavily on Alibaba’s integrated software and hardware ecosystem, including its elastic RDMA networking infrastructure.
Aegaeon’s introduction could influence future procurement patterns across the data center supply chain. If widely adopted, the technology could slow immediate demand growth for inference GPUs, while increasing demand for high-performance memory, interconnects, and power management systems required to support larger multi-model workloads on fewer accelerators.
This may also redirect investments toward system-level testing, inspection, and automation rather than component-level validation, as data centers prioritize optimization of full-server integration and energy efficiency over raw GPU expansion.