论文合集

收录论文总数：227 篇

在这里，您可以找到所有论文的完整分类列表。

LLM

Algorithm

Agent

2025-09 The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
2025-05 A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well
2025-04 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
2025-03 R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
2025-03 Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
2024-09 Large Language Model-Based Agents for Software Engineering: A Survey
2023-12 Retrieval-Augmented Generation for Large Language Models: A Survey
2023-12 nips23 Tree of Thoughts: Deliberate Problem Solving with Large Language Models
2023-11 iclr23 REAC T: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS
2023-10 FIREACT: TOWARD LANGUAGE AGENT FINE-TUNING
2023-09 nips23 Tree of Thoughts: Deliberate Problem Solving with Large Language Models
2023-03 iclr23 REAC T: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS
2023-02 nips23 Toolformer: Language Models Can Teach Themselves to Use Tools
2022-07 nips22 WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
2022-01 nips22 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Models

Pretrain sft

Rl

Engineering

Attention

Compiler

Inference

Kvcache

2025-06 atc25 KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider
2025-05 isca25 Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression
2024-10 Do Large Language Models Need a Content Delivery Network?
2024-07 atc24 Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
2024-05 eurosys25 CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
2024-04 mlsys24 PROMPT CACHE: MODULAR ATTENTION REUSE FOR LOW-LATENCY INFERENCE
2023-10 sigcomm24 CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Low precision

2025-05 Recipes for Pre-training LLMs with MXFP8
2025-03 isca25 Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
2025-01 osdi25 DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization
2024-11 isca25 MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization
2024-08 ppopp25 MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
2024-08 isca25 LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
2024-02 Massive Activations in Large Language Models
2024-01 FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
2023-09 OCP Microscaling Formats (MX) Specification
2023-06 mlsys24 AWQ: ACTIVATION-AWARE WEIGHT QUANTIZATION FOR ON-DEVICE LLM COMPRESSION AND ACCELERATION
2023-06 FP8 versus INT8 for efficient deep learning inference
2023-05 icme24 Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
2023-04 Stable and low-precision training for large-scale vision-language models
2023-04 isca23 With Shared Microexponents, A Little Shifting Goes a Long Way
2022-11 icml23 SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
2022-09 FP8 FORMATS FOR DEEP LEARNING
2022-08 FP8 Quantization: The Power of the Exponent

Speculative decoding

General

2025-07 MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
2025-06 A Survey of LLM Inference Systems
2025-06 isca25 LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
2025-05 isca25 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
2025-04 Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving
2025-01 atc25 Weaver: Efficient Multi-LLM Serving with Attention Offloading
2025-01 atc25 QFactory: Accelerating Quantized Large Language Model Serving with Qtile Graphs
2024-08 osdi25 NanoFlow: Towards Optimal Large Language Model Serving Throughput
2024-07 nips24 SGLang: Efficient Execution of Structured Language Model Programs
2024-07 Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
2024-05 Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation
2024-05 Splitwise: Efficient Generative LLM Inference Using Phase Splitting
2024-05 Preble: Efficient Distributed Prompt Scheduling for LLM Serving
2024-01 osdi24 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
2023-12 nsdi25 SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
2023-11 STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS
2023-09 sosp23 Efficient Memory Management for Large Language Model Serving with PagedAttention
2023-08 SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
2022-11 EFFICIENTLY SCALING TRANSFORMER INFERENCE
2022-07 osdi22 Orca: A Distributed Serving System for Transformer-Based Generative Models
2019-10 Transformers: State-of-the-Art Natural Language Processing

Moe

2025-01 atc25 PopFetcher: Towards Accelerated Mixture-of-Experts Training Via Popularity Based Expert-Wise Prefetch
2024-10 EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
2024-10 MOE++: ACCELERATING MIXTURE-OF-EXPERTS METHODS WITH ZERO-COMPUTATION EXPERTS
2024-08 AUXILIARY-LOSS-FREE LOAD BALANCING STRATEGY FOR MIXTURE-OF-EXPERTS
2024-04 icml25 Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts
2023-03 emnlp23 Scaling Vision-Language Models with Sparse Mixture of Experts
2022-06 mlsys22 TUTEL: ADAPTIVE MIXTURE-OF-EXPERTS AT SCALE
2022-02 ST-MOE: DESIGNING STABLE AND TRANSFERABLE SPARSE EXPERT MODELS
2022-01 DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
2021-12 icml22 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Rl

Train

2025-08 atc25 Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
2025-06 isca25 MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training
2025-06 isca25 Scaling Llama 3 Training with Efficient Parallelism Strategies
2025-04 nsdi25 ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
2025-04 nsdi25 SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision
2025-03 osdi25 WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
2025-03 osdi25 Understanding Stragglers in Large Model Training Using What-if Analysis
2025-02 ppopp25 Mario: Near Zero-cost Activation Checkpointing in Pipeline Parallelism
2025-02 Training LLMs with MXFP4
2025-02 ppopp25 WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training
2025-01 osdi25 Enabling Efficient GPU Communication over Multiple NICs with FuseLink
2025-01 atc25 Obscura: Concealing Recomputation Overhead in Training of Large Language Models with Bubble-filling Pipeline Transformation
2025-01 atc25 Jenga: Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity
2025-01 atc25 FlexPipe: Maximizing Training Efficiency for Transformer-based Models with Variable-Length Inputs
2025-01 nsdi25 Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation
2025-01 osdi25 Zen: Empowering Distributed Training with Sparsity-driven Data Synchronization
2024-11 nsdi25 Minder: Faulty Machine Detection for Large-scale Distributed Model Training
2024-09 osdi25 Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping
2024-08 sigcomm25 DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models
2024-07 Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
2024-06 atc25 Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism
2024-05 nips24 Pipeline Parallelism with Controllable Memory
2024-02 nsdi24 MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
2023-11 ZERO BUBBLE PIPELINE PARALLELISM
2023-04 PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
2022-11 mlsys23 ON OPTIMIZING THE COMMUNICATION OF MODEL PARALLELISM
2022-05 ppopp22 FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
2022-05 mlsys22 PATHWAYS: ASYNCHRONOUS DISTRIBUTED DATAFLOW FOR ML
2022-04 jmlr23 PaLM: Scaling Language Modeling with Pathways
2021-08 sc23 Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
2021-07 sc21 Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
2021-06 iclr22 LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
2021-04 sc21 ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
2021-01 atc21 ZeRO-Offload: Democratizing Billion-Scale Model Training
2020-06 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
2020-03 sc20 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
2020-03 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
2019-07 cvpr19 GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
2018-06 PipeDream: Fast and Efficient Pipeline Parallel DNN Training

MLSYS

Compiler

2025-05 osdi25 KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads
2025-01 osdi25 PipeThreader: Software-Defined Pipelining for Efficient DNN Execution
2022-07 osdi22 Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Cpu

Framework

2022-01 osdi25 Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training

Gpu

2025-07 Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks
2025-06 Serving Large Language Models on Huawei CloudMatrix384
2025-03 osdi25 Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing
2025-01 NVIDIA RTX BLACKWELL GPU ARCHITECTURE
2024-07 NVIDIA Blackwell Architecture Technical Brief
2024-06 isca24 Mind the Gap: Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms
2023-05 rtas23 Hardware Compute Partitioning on NVIDIA GPUs\*
2023-01 vlsi23 A 135 GBps/Gbit 0.66 pJ/bit Stacked Embedded DRAM with Multilayer Arrays by Fine Pitch Hybrid Bonding and Mini-TSV
2023-01 ppopp23 Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
2022-01 NVIDIA H100 Tensor Core GPU Architecture
2020-02 GPU Initiated OpenSHMEM: Correct and Eicient Intra-Kernel Networking for dGPUs
2018-04 jpdc18 GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters
2018-03 Improving Real-Time Performance with CUDA Persistent Threads (CuPer) on the Jetson TX2
2017-05 Offloading communication control logic in GPU accelerated applications
2017-04 sigarch17 Locality-Aware CTA Clustering for Modern GPUs
2016-04 Optimizing Performance of Recurrent Neural Networks on GPUs
2010-01 Demystifying GPU Microarchitecture through Microbenchmarking
2009-04 Roofline: An Insightful Visual Performance Model for Multicore Architectures

Networking

System

2025-01 osdi25 Principles and Methodologies for Serial Performance Optimization