论文合集
收录论文总数:227 篇
在这里,您可以找到所有论文的完整分类列表。
📊 标签统计
共找到 计算中... 个不同的标签
LLM
Algorithm
Agent
- 2025-09 The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
- 2025-05 A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well
- 2025-04 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
- 2025-03 R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
- 2025-03 Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
- 2024-09 Large Language Model-Based Agents for Software Engineering: A Survey
- 2023-12 Retrieval-Augmented Generation for Large Language Models: A Survey
- 2023-12 nips23 Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- 2023-11 iclr23 REAC T: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS
- 2023-10 FIREACT: TOWARD LANGUAGE AGENT FINE-TUNING
- 2023-09 nips23 Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- 2023-03 iclr23 REAC T: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS
- 2023-02 nips23 Toolformer: Language Models Can Teach Themselves to Use Tools
- 2022-07 nips22 WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- 2022-01 nips22 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Models
- 2025-08 KIMI K2: OPEN AGENTIC INTELLIGENCE
- 2025-08 LongCat-Flash Technical Report
- 2025-07 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
- 2025-06 dots.llm1 Technical Report
- 2025-04 Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
- 2025-04 DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level
- 2025-01 KIMI K1.5: SCALING REINFORCEMENT LEARNING WITH LLMS
- 2024-12 DeepSeek-V3 Technical Report
- 2024-03 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
- 2024-01 DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
- 2023-07 Llama 2: Open Foundation and Fine-Tuned Chat Models
Pretrain sft
- 2025-03 SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
- 2024-12 UNVEILING THE SECRET RECIPE: A GUIDE FOR SUPERVISED FINE-TUNING SMALL LLMS
- 2024-08 iclr25 INFERENCE SCALING LAWS: AN EMPIRICAL ANALYSIS OF COMPUTE-OPTIMAL INFERENCE FOR LLM PROBLEM-SOLVING
- 2024-05 nips24 Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
- 2022-03 Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
- 2021-04 ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING
- 2013-12 WHAT MAKES GOOD DATA FOR ALIGNMENT? A COMPREHENSIVE STUDY OF AUTOMATIC DATA SELECTION IN INSTRUCTION TUNING
Rl
- 2025-08 ON-POLICY RL MEETS OFF-POLICY EXPERTS: HARMONIZING SUPERVISED FINE-TUNING AND REINFORCEMENT LEARNING VIA DYNAMIC WEIGHTING
- 2025-05 ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- 2025-05 DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- 2025-05 The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
- 2025-04 Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
- 2024-08 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- 2024-07 nips24 HelpSteer2: Open-source dataset for training top-performing reward models
- 2024-05 NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
- 2024-02 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- 2024-01 Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
- 2023-10 SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
- 2023-05 nips23 Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- 2022-04 Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- 2022-03 Training language models to follow instructions with human feedback
- 2020-09 nips20 Learning to summarize from human feedback
- 2017-12 pmlr18 RLlib: Abstractions for Distributed Reinforcement Learning
- 2017-08 Proximal Policy Optimization Algorithms
- 2016-01 Mastering the game of Go with deep neural networks and tree search
- 2015-07 Massively Parallel Methods for Deep Reinforcement Learning
- 2013-12 Playing Atari with Deep Reinforcement Learning
Engineering
Attention
- 2025-08 Mixture of Contexts for Long Video Generation
- 2025-05 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training
- 2025-05 FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
- 2025-02 MOBA: MIXTURE OF BLOCK ATTENTION FOR LONG-CONTEXT LLMS
- 2025-02 Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- 2025-02 TREE ATTENTION: TOPOLOGY-AWARE DECODING FOR LONG-CONTEXT ATTENTION ON GPU CLUSTERS
- 2025-01 MiniMax-01: Scaling Foundation Models with Lightning Attention
- 2024-10 SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION
- 2024-07 nips24 FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- 2024-04 Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
- 2024-03 Jamba: A Hybrid Transformer-Mamba Language Model
- 2023-12 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- 2023-12 Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- 2023-07 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- 2022-07 nips22 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- 2020-06 icml20 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
- 2020-02 icml20 Low-Rank Bottleneck in Multi-head Attention Models
- 2019-11 Fast Transformer Decoding: One Write-Head is All You Need
Compiler
- 2025-06 osdi25 Mirage: A Multi-Level Superoptimizer for Tensor Programs
- 2025-05 ppopp25 FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property
- 2025-04 TileLang: A Composable Tiled Programming Model for AI Systems
- 2025-04 TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
- 2024-12 FLEX ATTENTION: A PROGRAMMING MODEL FOR GENERATING OPTIMIZED ATTENTION KERNELS
- 2024-10 FLUX: FAST SOFTWARE-BASED COMMUNICATION OVERLAP ON GPUS THROUGH KERNEL FUSION
Inference
Kvcache
- 2025-06 atc25 KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider
- 2025-05 isca25 Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression
- 2024-10 Do Large Language Models Need a Content Delivery Network?
- 2024-07 atc24 Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
- 2024-05 eurosys25 CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
- 2024-04 mlsys24 PROMPT CACHE: MODULAR ATTENTION REUSE FOR LOW-LATENCY INFERENCE
- 2023-10 sigcomm24 CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
Low precision
- 2025-05 Recipes for Pre-training LLMs with MXFP8
- 2025-03 isca25 Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
- 2025-01 osdi25 DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization
- 2024-11 isca25 MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization
- 2024-08 ppopp25 MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
- 2024-08 isca25 LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
- 2024-02 Massive Activations in Large Language Models
- 2024-01 FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
- 2023-09 OCP Microscaling Formats (MX) Specification
- 2023-06 mlsys24 AWQ: ACTIVATION-AWARE WEIGHT QUANTIZATION FOR ON-DEVICE LLM COMPRESSION AND ACCELERATION
- 2023-06 FP8 versus INT8 for efficient deep learning inference
- 2023-05 icme24 Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
- 2023-04 Stable and low-precision training for large-scale vision-language models
- 2023-04 isca23 With Shared Microexponents, A Little Shifting Goes a Long Way
- 2022-11 icml23 SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
- 2022-09 FP8 FORMATS FOR DEEP LEARNING
- 2022-08 FP8 Quantization: The Power of the Exponent
Speculative decoding
- 2025-03 EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
- 2024-08 MAGICDEC: BREAKING THE LATENCY-THROUGHPUT TRADEOFF FOR LONG CONTEXT GENERATION WITH SPECULATIVE DECODING
- 2024-06 MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- 2024-06 EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
- 2024-04 Better & Faster Large Language Models via Multi-token Prediction
- 2024-01 EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- 2018-11 Blockwise Parallel Decoding for Deep Autoregressive Models
General
- 2025-07 MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
- 2025-06 A Survey of LLM Inference Systems
- 2025-06 isca25 LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
- 2025-05 isca25 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
- 2025-04 Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving
- 2025-01 atc25 Weaver: Efficient Multi-LLM Serving with Attention Offloading
- 2025-01 atc25 QFactory: Accelerating Quantized Large Language Model Serving with Qtile Graphs
- 2024-08 osdi25 NanoFlow: Towards Optimal Large Language Model Serving Throughput
- 2024-07 nips24 SGLang: Efficient Execution of Structured Language Model Programs
- 2024-07 Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- 2024-05 Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation
- 2024-05 Splitwise: Efficient Generative LLM Inference Using Phase Splitting
- 2024-05 Preble: Efficient Distributed Prompt Scheduling for LLM Serving
- 2024-01 osdi24 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- 2023-12 nsdi25 SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
- 2023-11 STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS
- 2023-09 sosp23 Efficient Memory Management for Large Language Model Serving with PagedAttention
- 2023-08 SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- 2022-11 EFFICIENTLY SCALING TRANSFORMER INFERENCE
- 2022-07 osdi22 Orca: A Distributed Serving System for Transformer-Based Generative Models
- 2019-10 Transformers: State-of-the-Art Natural Language Processing
Moe
- 2025-01 atc25 PopFetcher: Towards Accelerated Mixture-of-Experts Training Via Popularity Based Expert-Wise Prefetch
- 2024-10 EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
- 2024-10 MOE++: ACCELERATING MIXTURE-OF-EXPERTS METHODS WITH ZERO-COMPUTATION EXPERTS
- 2024-08 AUXILIARY-LOSS-FREE LOAD BALANCING STRATEGY FOR MIXTURE-OF-EXPERTS
- 2024-04 icml25 Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts
- 2023-03 emnlp23 Scaling Vision-Language Models with Sparse Mixture of Experts
- 2022-06 mlsys22 TUTEL: ADAPTIVE MIXTURE-OF-EXPERTS AT SCALE
- 2022-02 ST-MOE: DESIGNING STABLE AND TRANSFERABLE SPARSE EXPERT MODELS
- 2022-01 DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
- 2021-12 icml22 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Rl
- 2025-08 SeamlessFlow: A Trainer–Agent Isolation RL Framework Achieving Bubble-Free Pipelines via Tag Scheduling
- 2025-08 rStar2-Agent: Agentic Reasoning Technical Report
- 2025-07 AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training
- 2025-07 DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training
- 2025-05 AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
- 2025-04 ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
- 2025-04 StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
- 2024-10 eurosys25 HybridFlow: A Flexible and Efficient RLHF Framework
- 2024-09 RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion
- 2024-05 OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
- 2023-03 DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
- 2022-06 atc25 GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing
- 2019-10 SEED RL: SCALABLE AND EFFICIENT DEEP-RL WITH ACCELERATED CENTRAL INFERENCE
Train
- 2025-08 atc25 Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
- 2025-06 isca25 MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training
- 2025-06 isca25 Scaling Llama 3 Training with Efficient Parallelism Strategies
- 2025-04 nsdi25 ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
- 2025-04 nsdi25 SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision
- 2025-03 osdi25 WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
- 2025-03 osdi25 Understanding Stragglers in Large Model Training Using What-if Analysis
- 2025-02 ppopp25 Mario: Near Zero-cost Activation Checkpointing in Pipeline Parallelism
- 2025-02 Training LLMs with MXFP4
- 2025-02 ppopp25 WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training
- 2025-01 osdi25 Enabling Efficient GPU Communication over Multiple NICs with FuseLink
- 2025-01 atc25 Obscura: Concealing Recomputation Overhead in Training of Large Language Models with Bubble-filling Pipeline Transformation
- 2025-01 atc25 Jenga: Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity
- 2025-01 atc25 FlexPipe: Maximizing Training Efficiency for Transformer-based Models with Variable-Length Inputs
- 2025-01 nsdi25 Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation
- 2025-01 osdi25 Zen: Empowering Distributed Training with Sparsity-driven Data Synchronization
- 2024-11 nsdi25 Minder: Faulty Machine Detection for Large-scale Distributed Model Training
- 2024-09 osdi25 Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping
- 2024-08 sigcomm25 DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models
- 2024-07 Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
- 2024-06 atc25 Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism
- 2024-05 nips24 Pipeline Parallelism with Controllable Memory
- 2024-02 nsdi24 MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
- 2023-11 ZERO BUBBLE PIPELINE PARALLELISM
- 2023-04 PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- 2022-11 mlsys23 ON OPTIMIZING THE COMMUNICATION OF MODEL PARALLELISM
- 2022-05 ppopp22 FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
- 2022-05 mlsys22 PATHWAYS: ASYNCHRONOUS DISTRIBUTED DATAFLOW FOR ML
- 2022-04 jmlr23 PaLM: Scaling Language Modeling with Pathways
- 2021-08 sc23 Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- 2021-07 sc21 Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
- 2021-06 iclr22 LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
- 2021-04 sc21 ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
- 2021-01 atc21 ZeRO-Offload: Democratizing Billion-Scale Model Training
- 2020-06 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- 2020-03 sc20 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
- 2020-03 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- 2019-07 cvpr19 GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
- 2018-06 PipeDream: Fast and Efficient Pipeline Parallel DNN Training
MLSYS
Compiler
- 2025-05 osdi25 KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads
- 2025-01 osdi25 PipeThreader: Software-Defined Pipelining for Efficient DNN Execution
- 2022-07 osdi22 Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Cpu
- 2022-05 Understanding BIOS Configuration for Performance Tuning
- 2022-05 Everything You Need to Know About the CPU Power Management
Framework
- 2022-01 osdi25 Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training
Gpu
- 2025-07 Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks
- 2025-06 Serving Large Language Models on Huawei CloudMatrix384
- 2025-03 osdi25 Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing
- 2025-01 NVIDIA RTX BLACKWELL GPU ARCHITECTURE
- 2024-07 NVIDIA Blackwell Architecture Technical Brief
- 2024-06 isca24 Mind the Gap: Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms
- 2023-05 rtas23 Hardware Compute Partitioning on NVIDIA GPUs\*
- 2023-01 vlsi23 A 135 GBps/Gbit 0.66 pJ/bit Stacked Embedded DRAM with Multilayer Arrays by Fine Pitch Hybrid Bonding and Mini-TSV
- 2023-01 ppopp23 Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
- 2022-01 NVIDIA H100 Tensor Core GPU Architecture
- 2020-02 GPU Initiated OpenSHMEM: Correct and Eicient Intra-Kernel Networking for dGPUs
- 2018-04 jpdc18 GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters
- 2018-03 Improving Real-Time Performance with CUDA Persistent Threads (CuPer) on the Jetson TX2
- 2017-05 Offloading communication control logic in GPU accelerated applications
- 2017-04 sigarch17 Locality-Aware CTA Clustering for Modern GPUs
- 2016-04 Optimizing Performance of Recurrent Neural Networks on GPUs
- 2010-01 Demystifying GPU Microarchitecture through Microbenchmarking
- 2009-04 Roofline: An Insightful Visual Performance Model for Multicore Architectures
Networking
- 2025-07 Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
- 2025-07 Scale-Up Ethernet Framework Scale-Up Ethernet Framework Specification
- 2025-04 Introducing UALink 200G 1.0 Specification
- 2025-03 UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture
- 2025-01 nsdi25 AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training
- 2024-04 asplos24 Scaling Up Memory Disaggregated Applications with Smart
- 2023-07 Overview of and Motivation for the Forthcoming Ultra Ethernet Consortium Specification
- 2022-11 Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async
- 2022-02 Doubling all2all Performance with NVIDIA Collective Communication Library 2.12
- 2020-08 sc20 An In-Depth Analysis of the Slingshot Interconnect
- 2015-07 UCX: An Open Source Framework for HPC Network APIs and Beyond