Publications
* indicates co-first author, † indicates corresponding author
Top-tier Mid-tier Recognized Preprint
2026
- SIGMOD ’26FaaSBoard: Efficient Graph Processing with a Disaggregated Architecture on Serverless ServicesIn Proceedings of the 2026 ACM SIGMOD International Conference on Management of Data, 2026
- EuroSys ’26Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-DesignIn Proceedings of the European Conference on Computer Systems, 2026
- ASPLOS ’26Towards High-Goodput LLM Serving with Prefill-decode MultiplexingIn Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2026
- NSDI ’26Flare: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus ScaleIn Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026
- NSDI ’26MuxTune: Efficient Multi-Task LLM Fine-Tuning in Multi-Tenant Datacenters via Spatial-Temporal Backbone MultiplexingIn Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026
- HPCA ’26LEGO: Supporting LLM-enhanced Games with One Gaming GPUIn 2026 IEEE International Symposium on High Performance Computer Architecture, 2026
2025
- arXivEfficient Function-as-a-Service for Large Language Models with TIDALarXiv preprint arXiv:2503.06421, 2025
- arXivBoosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline ExecutionarXiv preprint arXiv:2509.09560, 2025
- arXivHarli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service PlatformsarXiv e-prints, 2025
- ASPLOS ’25Voyager: Input-Adaptive Algebraic Transformations for High-Performance Graph Neural NetworksIn Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2025
- SC ’25A Sample-Free Compilation Framework for Efficient Dynamic Tensor ComputationIn Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025
- MLSys ’25Comet: Fine-grained Computation-Communication Overlapping for Mixture-of-ExpertsIn Proceedings of the 8th Annual Conference on Machine Learning and Systems (MLSys), 2025
- EuroSys ’25Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal SharingIn Proceedings of the Twentieth European Conference on Computer Systems, 2025
- ATC ’25Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space InterceptionIn 2025 USENIX Annual Technical Conference, 2025
- HPCA ’25VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM InferenceIn 2025 IEEE International Symposium on High Performance Computer Architecture, 2025
- TACO ’25EDAS: Enabling Fast Data Loading for GPU Serverless ComputingACM Transactions on Architecture and Code Optimization, 2025
- TACO ’25Taming Flexible Job Packing in Deep Learning Training ClustersACM Transactions on Architecture and Code Optimization, 2025
- TACO ’25ARACHNE: Optimizing Distributed Parallel Applications with Reduced Inter-Process CommunicationACM Transactions on Architecture and Code Optimization, 2025
- TACO ’25Ares: Fair and Efficient Scheduling of Deep Learning Jobs with Elastic Fair QueuingACM Transactions on Architecture and Code Optimization, 2025
- APPT ’25DACO: Unlocking Latent Dataflow Opportunities in Edge-Side SIMT AcceleratorsIn International Symposium on Advanced Parallel Processing Technologies, 2025
2024
- TC ’24Accelerating Sparse DNNs Based on Tiled GEMMIEEE Transactions on Computers, 2024
- arXivThe CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model ServingarXiv preprint arXiv:2405.11299, 2024
- TC ’24Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoSIEEE Transactions on Computers, 2024
2023
- OSDI ’23Optimizing Dynamic Neural Networks with BrainstormIn 17th USENIX Symposium on Operating Systems Design and Implementation, 2023
- TC ’23Improving Cluster Utilization Through Adaptive Resource Management for Deep Neural Network and CPU Jobs ColocationIEEE Transactions on Computers, 2023
- SoCC ’23Maximizing the Utilization of GPUs Used by Cloud Gaming through Adaptive Co-location with ComboIn Proceedings of the 2023 ACM Symposium on Cloud Computing, 2023
- ICPADS ’23Microless: Cost-efficient Hybrid Deployment of Microservices on IaaS VMs and ServerlessIn 2023 IEEE 29th International Conference on Parallel and Distributed Systems, 2023
- CF ’23AdaptGear: Accelerating GNN Training via Adaptive Subgraph-Level Kernels on GPUsIn Proceedings of the 20th ACM International Conference on Computing Frontiers, 2023
2022
- ATC ’22DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUsIn 2022 USENIX Annual Technical Conference, 2022
- HPCA ’22Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization While Ensuring QoSIn 2022 IEEE International Symposium on High-Performance Computer Architecture, 2022
- TC ’22ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-grained Resource ManagementIEEE Transactions on Computers, 2022
- ICS ’22PAME: Precision-Aware Multi-Exit DNN Serving for Reducing Latencies of Batched InferencesIn Proceedings of the 36th ACM International Conference on Supercomputing, 2022
2021
- TC ’21Toward QoS-awareness and Improved Utilization of Spatial Multitasking GPUsIEEE Transactions on Computers, 2021
- ICCD ’21Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic BlocksIn 2021 IEEE 39th International Conference on Computer Design, 2021
- SC ’21Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency PredictionIn Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021
2020
- TPDS ’20E^2bird: Enhanced Elastic Batch for Improving Responsiveness and Throughput of Deep Learning ServicesIEEE Transactions on Parallel and Distributed Systems, 2020
- ICDCS ’20CODA: Improving Resource Utilization by Slimming and Co-locating DNN and CPU JobsIn 2020 IEEE 40th International Conference on Distributed Computing Systems, 2020
2019
- ICCD ’19Ebird: Elastic Batch for Improving Responsiveness and Throughput of Deep Learning ServicesIn 2019 IEEE 37th International Conference on Computer Design, 2019
- ICS ’19Laius: Towards Latency Awareness and Improved Utilization of Spatial Multitasking Accelerators in DatacentersIn Proceedings of the ACM International Conference on Supercomputing, 2019