NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

作者/机构: NVIDIA

主要贡献

本文介绍了NVIDIA Nemotron Nano 2，这是一款混合Mamba-Transformer架构的推理模型，旨在显著提升推理工作负载的吞吐量，同时在同等规模的模型中实现顶尖的准确率。相较于现有的同规模模型（如Qwen3-8B），Nemotron Nano 2在推理基准测试中取得了持平或更优的准确率，而在生成密集型场景下（如8k输入/16k输出），其推理吞吐量提高了3至6.3倍。

核心研究目标与创新点：

架构创新与效率提升：Nemotron Nano 2建立在Nemotron-H架构之上，该架构用Mamba-2层替换了传统Transformer中的大部分自注意力层。这种混合设计旨在提高生成推理所需的长思维链时的推理速度，从而在保持高准确率的同时实现高吞吐量。
先进的训练流程：模型的创建始于一个120亿参数的基础模型（Nemotron-Nano-12B-v2-Base），该模型使用FP8训练方案在20万亿个令牌上进行了预训练，并采用了Warmup-Stable-Decay学习率调度。随后，通过持续的预训练长上下文扩展阶段，使其具备了处理高达128k上下文的能力，且未损害其他基准测试的性能。
全面的对齐技术：模型通过多阶段的后训练（post-training）流程进行了对齐，包括监督微调（SFT）、组相对策略优化（GRPO）、直接偏好优化（DPO）和基于人类反馈的强化学习（RLHF）。整个后训练过程处理了约900亿个令牌，重点优化了工具使用、长上下文性能和对话能力。其中，约5%的数据包含被刻意截断的推理轨迹，这使得模型在推理时能实现对“思考预算”的精细控制。
高效的模型压缩与部署：为了能在单张NVIDIA A10G GPU（22GiB显存）上处理128k令牌的上下文，研究人员采用并扩展了基于Minitron的压缩策略，通过剪枝和知识蒸馏，成功将12B参数的对齐模型压缩至9B参数，最终形成了Nemotron-Nano-9B-v2。
开源贡献：NVIDIA开源了该系列模型，包括最终的推理模型Nemotron-Nano-9B-v2、剪枝后的基础模型Nemotron-Nano-9B-v2-Base以及原始基础模型Nemotron-Nano-12B-v2-Base。此外，还发布了超过6万亿令牌的预训练数据集和更新后的后训练数据集，以推动社区发展。

图1 | Nemotron Nano 2与Qwen3-8B在准确率和吞吐量上的比较。Nemotron Nano 2在复杂推理基准测试上取得了相当或更好的准确率，同时在此类工作负载下实现了高达6.3倍的吞吐量。我们将输入序列长度缩写为ISL，输出序列长度缩写为OSL，并在单张A10G GPU上以bfloat16精度测量吞吐量。

预训练

本节讨论Nemotron-Nano-12B-v2-Base模型的架构和预训练过程，并将其与其他最先进模型在流行基准测试上的准确性进行比较。

方法细节

模型架构

混合Mamba-Transformer架构。与Nemotron-H模型【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】类似，Nemotron-Nano-12B-v2-Base模型由Mamba2层【索引22，Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality，2024】、自注意力（self-attention）层和前馈网络（FFN）层混合构成。该模型共有62层，其中包含6个自注意力层、28个FFN层和28个Mamba-2层。自注意力层在模型中均匀分布，约占总层数的8%。模型的隐藏维度为5120，FFN隐藏维度为20480。自注意力层采用了分组查询注意力（Grouped-Query Attention）【索引5，GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints，2023】，设有40个查询头和8个键值头。对于Mamba-2层，模型配置了8个组，状态维度为128，头维度为64，扩展因子为2，卷积窗口大小为4。FFN层则使用平方ReLU（squared ReLU）作为激活函数【索引103，Primer: Searching for Efficient Transformers for Language Modeling，2022】。模型架构还继承了Nemotron-H的一些设计，例如不使用任何位置嵌入，采用RMSNorm进行归一化【索引124，Root Mean Square Layer Normalization，2019】，分离嵌入层和输出层的权重，不使用dropout，并且线性层中不使用偏置权重。

图2 | Nemotron-Nano-12B-v2-Base的层模式。与Nemotron-H模型一样，模型中约8%的总层数是自注意力层，这些层在整个模型中均匀分布。

表1 | Nemotron-Nano-12B-v2-Base架构摘要。

预训练数据

高质量预训练语料库。Nemotron-Nano-12B-v2-Base是在一个由高质量精选数据和合成数据组成的大型语料库上进行预训练的。

精选数据

多源数据精选流程。我们为通用网络爬取数据（英语和多语言）、数学数据和代码数据分别建立了独立的数据精选流程。下文将逐一讨论这些流程。

英语网络爬取数据。我们使用了Nemotron-CC数据集【索引107，Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset，2025】，并使用相同的处理流程更新了该数据集，加入了八个最新的Common Crawl快照（从CC-MAIN-2024-33到CC-MAIN-2025-13）。在合成转述阶段，我们主要改用Qwen3-30B-A3B模型（之前为Mistral Nemo 12B）。此外，为了提升模型的知识截止日期，我们还使用了截至2025年4月23日的CC-NEWS数据。CC-NEWS数据仅进行了英语过滤和全局模糊去重，未应用其他过滤措施。

多语言数据。我们从三个Common Crawl快照（CC-MAIN-2024-51、CC-MAIN-2025-08和CC-MAIN-2025-18）中提取了十五种语言的数据。这十五种语言包括阿拉伯语、中文、丹麦语、荷兰语、法语、德语、意大利语、日语、韩语、波兰语、葡萄牙语、俄语、西班牙语、瑞典语和泰语。由于缺乏可靠的基于模型的多语言质量分类器，我们仅应用了启发式过滤，其方式类似于Nemotron-CC流程中过滤低质量英语数据的方法，但不得不选择性地禁用了某些对特定语言误报率很高的启发式过滤器。去重方式与Nemotron-CC相同。此外，我们还为这十五种语言使用了来自维基百科和FineWeb-2【索引88，Fineweb2: One pipeline to scale them all – adapting pre-training data processing to every language，2025】的数据。

数学数据。网络上的数学内容格式多样，包括行内和块状LATEX、MathML、Unicode符号以及MathJax或KaTeX等自定义渲染器。我们详细分析了先前的数学专用提取流程——包括OpenWebMath【索引87，OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text，2023】、MegaMath【索引127，MegaMath: Pushing the limits of open math corpora，2025】、jusText【索引26，More effective boilerplate removal-the goldminer algorithm，2013】、Trafilatura【索引11，Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction，2021】和Resiliparse【索引15，Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl，2018】——发现它们都无法可靠地保留数学表达式或代码结构。这些工具经常丢弃或扭曲方程式，并扁平化代码格式，严重限制了提取内容在预训练中的效用。

高保真数学数据提取流程。为了解决上述问题，我们构建了一个专为从Common Crawl中高保真提取数学内容而设计的新流程。首先，我们从先前的数据集（如InfiMM-WebMath【索引30，InfiMM-WebMath-40B: Advancing multimodal pre-training for enhanced mathematical reasoning，2024】、OpenWebMath【索引87，OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text，2023】、FineMath【索引7，SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model，2025】和MegaMath【索引127，MegaMath: Pushing the limits of open math corpora，2025】）中汇总了一份详尽的数学相关URL列表，然后从98个Common Crawl快照（2014-2024）中重新获取了它们的原始HTML文档。每个页面都使用基于文本的浏览器lynx进行渲染，以保留布局和数学结构。接着，我们应用Phi-4（14B参数）【索引2，Phi-4 technical report，2024】来移除样板文件、将符号标准化为LATEX格式并纠正不一致之处。我们使用了一个FineMath分类器【索引7，SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model，2025】来保留高质量文档，随后通过NeMo-Curator框架，利用基于MinHash【索引17，Identifying and filtering near-duplicate documents，2000】的局部敏感哈希（LSH）【索引38，Approximate nearest neighbors: towards removing the curse of dimensionality，1998】进行模糊去重。最后，我们使用LLM Decontaminator【索引122，Rethinking benchmark and contamination for language models with rephrased samples，2023】对数据集进行了净化。

Nemotron-CC-Math数据集。该流程最终产出了一个1330亿令牌的语料库，名为Nemotron-CC-Math-3+，以及一个更高质量的520亿令牌子集，名为Nemotron-CC-Math-4+，后者仅包含得分最高的样本。当用于预训练时，该数据集在数学（MATH-500）、代码（HumanEval+、MBPP+、MBPP）和通用领域评估（MMLU、MMLU-STEM、MMLU-Pro）方面均取得了显著提升，超越了所有现有的开放数学数据集。更多细节请参见Mahabadi等人的论文【索引58，Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset，2025】。

代码数据。与Nemotron系列之前的模型【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025；83，Nemotron-4 340B Technical Report，2024；86，Nemotron4 15B Technical Report，2024】一样，我们使用大规模原始源代码对Nemotron-Nano-12B-v2-Base进行了预训练。所有用于训练该模型的源代码均源自GitHub，并经过一个多阶段处理流程，最终形成训练数据。我们采用了一个类似于BigCode项目【索引56，Starcoder 2 and the stack v2: The next generation，2024】的许可证检测流程进行基于许可证的移除，但接受的许可证种类更少（详见附录A）。对于源代码而言，去重尤为重要，因为许多文件在大量仓库中被完全复制。因此，我们同时进行了精确去重（通过哈希）和模糊去重（使用MinHash LSH）。为了更好地理解数据集中的每个文件，我们为所有文件标注了多种度量，并利用这些标注进行过滤。我们发现OpenCoder【索引39，Opencoder: The open cookbook for top-tier code large language models，2025】中的启发式过滤器非常有效，并利用它们来过滤那些对LLM预训练价值较低甚至有害的文件。

合成生成数据

STEM（科学、技术、工程和数学）数据。我们使用从多个来源收集的8.86万个问题作为种子数据，为STEM学科（包括天文学、生物学、化学、数学和物理学）生成了合成数据。除了广泛使用的GSM8K、MATH和AOPS训练集外，我们还从Stemez、以及OpenStax和Open Textbook Library的有宽松许可的教科书中收集了更多样化的问题。我们使用Qwen2.5-VL-72B-Instruct模型【索引10，Qwen2.5-vl technical report，2025】从教科书的练习部分提取问题，并附加了指令，如删除题号、忽略需要图像解释的问题以及使用LaTeX格式化方程式。我们手动整理了提取的问题，以修复偶发的OCR错误，并移除了非独立的题目（例如，引用同一章节中某个例子的题目）。

多模型、多提示的问题生成。为了扩充问题的数量和多样性，我们使用四种模型（Qwen3-30B-A3B和Qwen3-235B-A22B【索引121，Qwen3 technical report，2025】（均开启思维模式），Deepseek-R1【索引24，DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning，2025a】，以及Deepseek V3【索引25，DeepSeek-V3 Technical Report，2025b】）和三种提示进行了三轮问题生成：1. 相似问题：创建一个探索相似概念但提出新挑战的新问题。2. 更难的问题：创建一个需要更多逻辑步骤或涉及更高级概念的新问题。3. 多样化的问题：创建一个与原始问题类型不同的新问题。我们指示模型在创建新问题时避免肤浅或琐碎的修改，并要思考解决方案。

合成数据后处理。我们通过模糊去重过滤掉了重复和高度相似的问题，并使用问题生成阶段所用的模型为剩余的问题生成了解决方案。我们将一部分样本转换为MMLU或MMLU-Pro风格的多项选择题。通过连接随机的合成样本，我们构建了数千个少样本（few-shot）示例。

数学数据。我们还重新审视并重新生成了Nemotron-MIND数据集【索引6，MIND: Math Informed syNthetic Dialogues for Pretraining LLMs，2024】，这是一个最初基于OpenWebMath构建的数学知识合成预训练语料库。在我们的更新版本中，我们使用Nemotron-CC-Math-4+（我们最高质量的数学子集，包含520亿个token）作为源语料库重新生成了MIND数据集。我们沿用原始方法，应用了七种提示模板（例如，师生、辩论、访谈等），使用Phi-4模型生成结构化的数学对话。与依赖于147亿个较低保真度数据的原始MIND不同，我们的版本利用了质量显著更高的输入，并以5K个token的块大小进行处理。这次重新生成产生了一个730亿个token的合成数据集，与原始MIND版本相比，在数学推理和常识知识（MMLU, MMLU-Pro, MMLU-Stem）基准测试上均实现了持续的改进，凸显了输入数据质量的关键作用。完整的细节和结果可在Mahabadi等人的论文中找到【索引58，Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset，2025】。

多语言数据。我们从两个来源生成了多语言的多样化问答数据（Diverse QA）【索引107，Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset，2025】：1. 我们使用Qwen3-30B-A3B模型【索引121，Qwen3 technical report，2025】将英语的Diverse QA数据翻译成十五种语言（见多语言数据部分）。2. 我们使用Diverse QA提示从这些语言的维基百科文章中生成合成数据，并指示模型用目标语言编写所有问题和答案。此外，我们使用Qwen3-30B-A3B将一部分GSM8K增强数据（见STEM数据部分）翻译成这些语言。我们对每个翻译后的解决方案进行了后处理，附加了一句表示“答案是...”的结尾句（例如，西班牙语为“La respuesta es ...”，德语为“Die Antwort lautet ...”），其中最终的数值答案从原始的英语解决方案中提取。

代码数据。我们通过提示一个大语言模型（LLM）根据我们精选的源代码中的短片段生成问题，要求模型解决生成的问题，然后根据启发式规则（例如，Python AST解析）对生成的问答对进行后处理，从而为11种不同的编程语言大规模生成了问答（QA）数据。这项技术产生了针对问题解决的多样化合成数据，其中既包含自然语言也包含源代码。更多细节在Nemotron-H技术报告【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】中有详细介绍，我们首次在预训练中利用了这种类型的合成代码数据。

学术数据。在Nemotron-H系列模型【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】的预训练数据集中，我们为所有来自学术数据（包括教科书和学术论文）的文档分配了教育质量、教育难度和教育学科的属性标签。由于技术领域中教育难度较高的内容对模型来说仍然具有挑战性，我们通过生成问答（QA）对的方式，优先提升模型对此类信息的理解能力，因为这类数据已被证明能够增强语言模型中的知识存储和提取【索引8，Physics of language models: Part 3.1, knowledge storage and extraction，2024】。

学术QA数据生成流程。为此，我们首先收集了本科和研究生教育难度的所有技术学科领域的文档，包括数学、化学、生物学、物理学和医学。利用这个文档子集，我们旨在找到最相关的文本片段，作为生成QA对的种子上下文。我们将每个文档分块为512个token长度的片段，使用e5-large模型【索引114，Text embeddings by weakly-supervised contrastive pre-training，2024】进行嵌入，并将它们存储在支持近似最近邻搜索的Milvus向量数据库中。然后，我们从一系列复杂学科领域（例如数学：实变分析，生物学：遗传学，统计学：信息论）中整理文档，并为每个查询文档在Milvus数据库中查询250个最近邻的文本片段。返回的片段作为我们的种子上下文，我们再将其输入到Qwen-2.5 72B instruct模型【索引91，Qwen2.5 Technical Report，2025】中，根据片段中包含的信息生成多项选择题和自由回答式的QA对。对于每个QA对，我们还额外生成了答案的理由。

SFT风格数据。在预训练的后期阶段使用SFT（监督微调）风格的数据已被证明有助于促进更全面的模型学习【索引36，MiniCPM: Unveiling the potential of small language models with scalable training strategies，2024】。因此，我们合成并包含了涵盖多个领域的不同SFT风格数据：1) 主要关注解决代码问题的代码SFT数据；2) 主要关注推理的数学SFT数据；3) 包含涵盖不同知识主题的各种问答示例的MMLU风格SFT数据；以及4) 通用指令遵循的SFT数据。我们确保SFT风格的数据涵盖了上述每个领域具有不同难度级别的多样化主题。上述SFT数据的详细合成方法和流程可以在先前的工作中找到【索引111，Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data，2024；68，Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset，2025；13，Llama-nemotron: Efficient reasoning models，2025a；14，Llama-nemotron: Efficient reasoning models，2025b；4，Opencodeinstruct: A large-scale instruction tuning dataset for code llms，2025a；3，Opencodereasoning: Advancing data distillation for competitive coding，2025b；59，Genetic instruct: Scaling up synthetic generation of coding instructions for large language models，2024】。

基础推理SFT风格数据。尽管上述SFT风格数据有助于增强LLM在代码、数学和通用语言理解基准测试中回答问题的能力，但它们并不能帮助模型在更深层次的推理任务中，从更多潜在干扰项中辨别出正确答案。我们提出通过合成专注于分析推理、逻辑推理和阅读理解的SFT风格数据来缓解这一问题。具体来说，我们收集了现有数据集，包括：1) 来自Wang等人【索引115，From lsat: The progress and challenges of complex reasoning，2022】和Zhong等人【索引126，Analytical reasoning of text，2022】的法学院入学考试（LSAT）数据集，该数据集包含逻辑推理、阅读理解和分析推理三项任务；2) 由Liu等人【索引53，Logiqa: A challenge dataset for machine reading comprehension with logical reasoning，2020】重新利用的LogiQA数据集，其中包含从中国国家公务员考试中收集的各种类型的逻辑推理问题；以及3) Ling等人【索引52，Program induction by rationale generation: Learning to solve and explain algebraic word problems，2017】提出的强调代数应用题的AQuA-RAT数据集。然后，我们分别提示DeepSeek-V3【索引25，DeepSeek-V3 Technical Report，2025b】和Qwen3-30B-A3B【索引121，Qwen3 technical report，2025】合成更多带有相应选项的相似问题。对于生成的每个问题，我们再次提示DeepSeek-V3生成带有最终解决方案的思维链（CoT）过程。在后处理阶段，我们采用多数投票法，只保留那些获得最多投票解决方案的样本。总的来说，我们从DeepSeek-V3生成了40亿个token，从Qwen3-30B模型生成了42亿个token。

数据混合与排序

数据类别与质量分级。我们的数据混合包含十三个数据类别。最大的一类是网络爬取数据，我们根据Nemotron-CC质量分类【索引107，Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset，2025】将其细分为四个子类：crawl-medium、crawl-medium-high、crawl-high和syn-crawl-high，分别表示中等、中高质量、高质量和合成质量的爬取数据。除此之外，我们的数据混合还有其他类别，如数学、维基百科、代码、学术数据、crawl++、多语言和合成SFT风格数据，后者又分为general-sft、stem-sft和code-sft。Crawl++由网络爬取衍生物组成，如OpenWebText、BigScience和Reddit。我们的多语言数据涵盖十五种语言：阿拉伯语、丹麦语、德语、西班牙语、法语、意大利语、葡萄牙语、荷兰语、波兰语、瑞典语、泰语、中文、日语、韩语和俄语。我们设计数据混合方案时，为质量相似的数据源赋予相似的权重。质量较高的数据源权重高于质量较低的数据源。关于数据集质量评估和混合创建过程的详细解释，请参见Feng等人【索引27，Maximize Your Data’s Potential: Enhancing LLM Accuracy with Two-Phase Pretraining，2024】和NVIDIA【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】的文献。

三阶段课程学习。我们使用了一个基于三阶段数据混合方法的课程来预训练Nemotron-Nano12B-v2-Base。在第一阶段，我们使用了一个促进数据多样性的数据混合；在第二和第三阶段，我们主要使用高质量的数据集（例如维基百科）。我们在训练进度的60%时切换到第二阶段，在90%时切换到第三阶段。每个阶段使用的数据混合如图3所示。

多语言数据消融研究

多语言数据源评估。在第2.2节中，我们提到了几大类多语言数据，包括精选和合成的：1. Common Crawl：使用我们自己的流程从最近的Common Crawl快照中提取。2. FineWeb-2【索引88，Fineweb2: One pipeline to scale them all – adapting pre-training data processing to every language，2025】。3. DiverseQA-wiki：使用翻译后的Diverse QA提示从多语言维基百科文章生成。4. DiverseQA-crawl：从英语Diverse QA数据翻译而来。为了确定这些不同多语言数据源之间的适当混合比例，我们首先进行了消融实验，以比较这四种多语言数据在下游任务上的性能。

实验设置与结果。具体来说，我们取一个已经训练了3500亿个token的1B模型检查点，并对其进行额外的1000亿个token的持续预训练。我们将持续预训练数据中的50%分配给多语言数据，其余50%使用我们默认的预训练数据混合。我们使用Global-MMLU基准测试【索引101，Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation，2024a】评估了每个模型的性能；结果如表2所示。我们精选的基于Common Crawl的多语言数据表现略好于基于Fineweb2的多语言数据，而合成的多语言QA对的表现远好于精选的多语言网络爬取数据。从英语Common Crawl翻译过来的多样化问答对在我们评估的8种语言中取得了最高的平均分。因此，在确定我们的多语言数据混合时，我们为DiverseQA-crawl数据分配了比其他类别高得多的权重。

表2 | 多语言数据集在Global-MMLU基准上的比较。

基础推理SFT风格数据消融研究

FR-SFT数据的有效性验证。为了证明我们在2.2节中引入的专注于基础推理（Fundamental Reasoning, FR）的SFT风格数据的有效性，我们使用了在14.5T tokens上训练的Nemotron-H-8B【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】中间检查点，并对其进行了另外1000亿tokens的持续预训练。我们将这1000亿tokens中的5%分配给新合成的FR-SFT数据（以替代Common Crawl数据），并保持所有其他数据类别与Nemotron-H-8B的第三阶段混合比例相同。我们将这个模型与同样训练了14.6T tokens的Nemotron-H-8B进行了比较。详细的评估基准在2.7节中介绍。比较结果如表3所示。SFT风格数据帮助Nemotron-H 8B模型在MMLU-Pro上的性能从44.24提升到56.36，并且平均MATH得分也提高了约2分。虽然MMLU-Pro是一个更具挑战性的基准，评估模型的语言理解能力，但它也要求模型具有出色的推理能力，才能从十个选项中选出正确答案。我们的SFT数据通过基础推理帮助模型具备了从其他九个干扰项中选出正确答案的能力。我们注意到，在平均常识推理和平均代码基准上没有出现性能下降。

表3 | 基础推理（FR）专用SFT风格数据的消融研究。

FP8训练方案

FP8训练细节。在整个预训练过程中，我们使用了DeepSeek的FP8训练方案【索引25，DeepSeek-V3 Technical Report，2025b】。具体来说，我们对所有张量使用E4M3格式，权重使用128x128的量化块，激活值使用1x128的块。与Nemotron-H不同，我们原生将模型权重保持在E4M3格式，这样我们就可以在FP8格式下进行分布式优化器的参数all-gather操作（跨数据并行副本）；主权重仍然保持在FP32格式。与DeepSeek的公式有一个不同之处在于，我们像处理Nemotron-H一样，将第一层和最后四层的线性层保留为BF16格式。同样与DeepSeek-V3的运行不同，我们将所有优化器状态保留在FP32格式。我们观察到这种数值选择没有导致任何训练不稳定性。

超参数

训练超参数设置。我们以20万亿个token的训练总量为目标来训练Nemotron-Nano-12B-v2-Base。我们使用的序列长度为8192，全局批量大小为768（每个批次6,029,312个token）。我们没有使用任何批量大小的 ramp-up 策略。我们采用了一个WSD（Warmup-Stable-Decay）学习率调度策略【索引36，MiniCPM: Unveiling the potential of small language models with scalable training strategies，2024】，其“稳定”阶段的学习率为$4.5 \cdot 10^{-4}$，最小值为$4.5 \cdot 10^{-6}$；学习率在最后的3.6万亿个token期间进行衰减。权重衰减设置为0.1，Adam的$\beta_1$和$\beta_2$分别设置为0.9和0.95。

长上下文扩展

长上下文持续预训练（CPT）。为确保Nemotron-Nano-12B-v2-Base能够在长上下文窗口上进行推理，我们在预训练的第3阶段之后增加了一个长上下文阶段（Phase LC）。在Phase LC中，我们使用524,288（512k）个token的上下文长度，以$4.5 \cdot 10^{-6}$的恒定学习率进行持续预训练（CPT）。尽管Nemotron Nano 2的目标上下文长度是128k，但在对Nemotron-H 8B模型的初步研究中，我们发现使用512k的序列长度进行CPT比使用256k或128k效果更好。我们的直觉是，更长的训练序列可以有效降低连贯的长文档被预训练数据加载的Concat & Chunk算法切分和分离的几率。我们使用了8路张量模型并行和16路上下文并行，以确保使用512k token序列长度的训练仍能适应GPU内存。我们使用了12的全局批量大小，以确保长上下文CPT期间每全局批次的总token数与预训练期间相同：约600万个token。Phase LC包含了189亿个token。

长上下文合成数据生成。此外，我们进行了长上下文合成数据生成，为Phase LC创建了更多高质量数据。由于学术预训练数据集是连贯长上下文文档的良好来源，我们使用了长度超过32k token的此类文档作为种子数据。我们遵循Llama-3【索引60，The Llama 3 Herd of Models，2024】和Qwen-2.5【索引91，Qwen2.5 Technical Report，2025】技术报告中提到的方法来生成长上下文文档问答数据。我们将每个文档分割成1024个token的块，然后随机选择10%的块输入到Qwen-2.5-72B-Instruct中进行数据合成。我们要求生成器根据文本块中的信息生成一个问答对。我们将这些问答对连接起来，并附加到原始文档的末尾，作为长上下文文档问答数据的样本。这种长文档问答为模型学习长上下文依赖关系提供了很好的材料。关于Nemotron-H 8B上不同训练序列长度和合成数据效果的消融结果见表4。

Phase LC数据混合策略。在Phase LC中使用的数据混合是基于第3阶段的数据混合构建的。我们将所有第3阶段数据的权重按比例下调至其原始值的80%，将剩余的20%分配给新添加的长上下文文档-QA数据。我们发现这样的混合可以有效地扩展Nemotron-Nano-12B-v2-Base的上下文长度，而不会降低常规基准测试的分数。

表4 | 不同训练序列长度和合成数据使用方式的比较。消融研究在Nemotron-H 8B上进行。

对齐

本节将介绍我们遵循的对齐流程，该流程将基础检查点转换为一个对齐的12B检查点。我们的流程概述如图4所示。

后训练数据

大规模SFT数据。我们的对齐过程始于一个大规模的监督微调（SFT）阶段，该阶段使用约800亿个token的提示-响应对来训练基础模型。各领域的数据分布如表7所示。

数学、科学和编码数据。对于数学【索引111，Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data，2024；68，Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset，2025】、科学和编码【索引4，Opencodeinstruct: A large-scale instruction tuning dataset for code llms，2025a；3，Opencodereasoning: Advancing data distillation for competitive coding，2025b；59，Genetic instruct: Scaling up synthetic generation of coding instructions for large language models，2024】数据，我们使用开源的DeepSeek-R1-0528模型【索引25，DeepSeek-V3 Technical Report，2025b】生成响应，使用的提示与训练Nemotron-H-8B和47B推理模型【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】时相同。该训练数据已作为Nemotron-Post-Training-Dataset-v1的一部分发布。

工具调用数据。工具调用数据集包含单轮、多轮和多步对话。对于单轮情况，我们从xlam-function-calling-60k、glaive-function-calling-v2和NVIDIA-When2Call【索引95，When2Call: When (not) to call tools，2025】中采样提示，并使用Qwen3-235B-A22B生成响应。受ToolACE【索引55，Toolace: Winning the points of llm function calling，2024】和APIGen-MT【索引90，Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay，2025】的启发，我们通过模拟对话将其扩展到多轮和多步场景，其中Qwen3-235B-A22B扮演用户代理、助理代理和API服务器代理的角色。用户代理负责审查可用工具、提出有挑战性的查询、在被助理代理提及时进行互动，并在最后判断任务是否成功。每个实例都与Nemotron-Personas中的一个随机角色配对，以丰富查询的多样性。

工具调用中的代理角色。助理代理接收初始查询和可用工具，通过调用工具执行任务，解释其响应，并在单轮、多轮或多步场景中与用户代理互动。同时，API服务器代理扮演一个模拟API服务器的角色，检查参数并根据正确性返回有效输出或错误消息。一个轻量级的基于规则的工具调用验证层通过确保输出的一致性和可验证性，进一步增强了可靠性，并且只保留成功的轨迹。

多语言数据。我们的多语言合成后训练数据是通过翻译现有的英语后训练数据构建的。为应对在生成合成翻译数据时大语言模型（LLM）的幻觉和长输入质量下降的挑战，我们实施了一个强大的质量保证流程。我们的方法包括逐行翻译输入以管理复杂性并跳过代码等不可翻译的内容。我们还强制使用严格的括号格式以实现可靠的提取，并使用语言识别来过滤掉偏离目标的翻译，从而确保最终输出的高质量。

对话数据。对于对话数据，我们使用了来自LMSYS数据集【索引125，Judging llm-as-a-judge with mt-bench and chatbot arena，2023】的提示，并使用Qwen3-235B-A22B推理模型【索引121，Qwen3 technical report，22025】生成了响应。我们还结合了来自HelpSteer2和HelpSteer3的提示，并用同一模型生成了响应。此外，我们还利用了WildChat1M【索引50，Wildchat: 1m chatgpt interaction logs in the wild，2024b】中约55万个提示的子集，同样使用Qwen3-235B-A22B生成了推理响应。我们还包括了与Deepseek R1进行多轮对话的数据，使用了NVIDIA（2025）【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】中使用的多轮对话提示。

安全数据。我们利用了来自Nemotron内容安全数据集V2【索引28，AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails，2025】、HarmfulTasks【索引31，Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning，2024】、RedTeam2K【索引57，Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks，2024】和gretel-v1【索引1，Gretel synthetic safety alignment dataset，2024】的有害和良性提示混合。响应是使用DeepSeek-R1-0528生成的。为确保安全，我们采用了一个两步法：先进行初步提示，然后使用护栏模型进行过滤，以验证输出保持安全。

表7 | 用于我们SFT阶段的后训练数据在各领域的分布。

后训练

第一阶段SFT。如图4所示，我们采用了三个不同的监督微调阶段。第一阶段使用了第3.1节中描述的完整数据集，并增加了大约10%的提示，这些提示配对的输出被剥离了推理轨迹。这让模型接触到“空的”轨迹，使其能够在关闭推理模式下直接生成答案。为了提高效率并保留预训练中的长上下文能力，我们将样本拼接成大约128k个token的序列，从而减少了填充开销并促进了长程学习。

第二阶段SFT。第二阶段针对工具调用。尽管第一阶段提高了大多数基准的性能，但工具调用的准确性有所下降。我们认为这是由于在128k长度上进行样本拼接，这可能扰乱了对工具调用模式的学习。因此，第二阶段的训练没有进行拼接，使用了完整的工具调用数据集以及其他领域的代表性子样本。

第三阶段SFT。第三阶段旨在增强长上下文能力。它整合了遵循Nemotron-H准备过程【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】中使用的配方的长上下文数据，同时增加了跨领域的增强样本，其中推理轨迹被突然截断到1–2k个token，但保留了最终答案。这种截断策略提高了模型在不同推理时思考预算下的鲁棒性。

IFeval RL。为提高指令遵循能力，我们从LMSYS聊天数据集中抽样了16,000个提示，并用IFEval风格的指令对其进行了增强。一个基于规则的验证器根据输出满足每条指令的程度进行评分，从而创建了一个优先考虑精确遵循指令的奖励信号。IFEval RL实验显著提升了IFEval能力，而其余基准测试的性能略有波动，需要仔细选择检查点。

DPO。在训练的另一个分支中，我们应用DPO算法来改进工具调用。我们使用BFCL v3基准来评估性能，该基准在BFCL v2的基础上扩展，更加强调多步（为实现一个目标进行多次工具调用）和多轮（多次用户-代理交互）。为了增强Nano V2对齐模型中的这些能力，我们使用了WorkBench环境，这是一个改编自Styles【索引106，Workbench: a benchmark dataset for agents in a realistic workplace setting，2024】的多步可验证工具调用设置。在每个WorkBench任务中，模型必须在多个步骤中发出一系列工具调用，其正确性通过数据库状态比较进行验证。

通过DPO进行强化学习。Nano V2 在此环境中通过直接偏好优化的迭代阶段进行强化学习。对于长上下文阶段的每个候选检查点，我们为每个WorkBench提示生成策略内（on-policy）数据，这些数据包括正面示例（成功的工具调用）和负面示例（失败的生成）。此过程确保迭代的DPO保持在策略内。

RLHF。我们使用 Arena-Hard 基准来评估模型的整体有用性和聊天能力。为了提高在该基准上的性能，我们使用 GRPO 算法，利用 HelpSteer3 【索引116，Helpsteer3-preference: Open human-annotated preference data across diverse tasks and languages，2025】中的纯英文上下文来训练 SFT 阶段的候选检查点。在训练过程中，我们同时生成带有和不带思维轨迹的响应，并使用一个基于 Qwen 的奖励模型来评判这些生成结果（rollouts）。

模型合并。在训练过程中，我们观察到推理能力和聊天能力之间存在一种权衡。为了解决这个问题，我们选择了检查点插值【索引117，Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time，2022】的方法，将一个具有强大推理能力的RL检查点与一个具有强大聊天能力的RL检查点进行融合。检查点插值通过线性插值模型权重来执行：$(1 - \alpha) \cdot \text{weights}_1 + \alpha \cdot \text{weights}_2$。我们对$\alpha$值从0.1到0.9以0.1为增量进行了参数扫描，发现$\alpha$值在0.5附近能提供一个良好的权衡。

剪枝与蒸馏

本节描述了将对齐的12B模型压缩为Nano 2模型的剪枝和蒸馏过程，目标是在NVIDIA A10G GPU上运行更长上下文（128k序列长度）的推理。值得注意的是，仅以bfloat16精度存储一个12B参数模型的权重就需要22.9 GiB，这超过了A10G GPU的22 GiB内存容量；这清楚地表明了压缩的必要性。

Minitron压缩框架扩展。我们的压缩策略建立在Minitron【索引69，Compact Language Models via Pruning and Knowledge Distillation，2024；104，LLM Pruning and Distillation in Practice: The Minitron Approach，2024；109，Efficient hybrid language model compression through group-aware ssm pruning，2025】之上，这是一个用于LLM的轻量级模型剪枝框架。虽然Minitron最初是为压缩预训练的基础模型以达到用户定义参数预算而设计的，但在这项工作中，我们将其扩展到压缩推理模型，同时还考虑了上述内存限制和基于吞吐量的目标。

重要性估计

轻量级重要性估计。我们为每个模型组件（如层、FFN神经元）收集重要性或敏感度分数，以帮助决定移除哪些组件；这就是重要性估计阶段。此阶段计算的分数用于决定哪些模型组件可以被剪枝。我们注意到，基于梯度信息的敏感度分析在现代LLM规模上通常是不切实际的【索引69，Compact Language Models via Pruning and Knowledge Distillation，2024】；因此，我们依赖于一种仅使用前向传播的轻量级策略。在这项工作中，我们使用了一种在消融研究中表现良好的简化方法：a) 剪枝层，以及 b) 剪枝FFN隐藏维度（实际上是神经元）和嵌入通道。我们还尝试了剪枝Mamba头；不幸的是，这个维度导致了严重的精度下降。现在我们描述如何计算每个层、嵌入通道、FFN神经元和Mamba头的重要性。

层重要性。我们以迭代方式计算层的重要性：对于每个候选层，我们暂时将其从模型中移除，并计算原始模型 logits 与剪枝后模型产生的 logits 之间的均方误差（MSE）。这个 MSE 反映了该层对模型预测的贡献：值越低表示影响越小。在每个剪枝步骤中，我们移除 MSE 最低的层，因为它对最终输出的影响最小。我们重复此过程，直到达到期望的深度。该策略确保剪枝优先移除那些其缺失对模型行为影响最小的层。有关基于 MSE 的迭代层重要性的更多细节，请参考 NVIDIA (2025)【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】。

FFN和嵌入通道重要性。FFN层内部由两个线性算子和一个非线性激活函数组成：

$$ \text{FFN}(\mathbf{X}) = \delta \left( \mathbf{X} \cdot \mathbf{W}_{1}^{T} \right) \cdot \mathbf{W}_{2}. $$

这里，$X$表示输入，$W_1$和$W_2$是FFN层中相关的两个权重矩阵。$W_1, W_2 \in \mathbb{R}^{d_{ffn} \times d_{model}}$，其中$d_{model}$和$d_{ffn}$分别是模型的隐藏维度和FFN的隐藏维度。$\delta(\cdot)$指的是非线性激活函数（在这项工作中是平方ReLU）。

神经元重要性计算。我们遵循与 Minitron【索引69，Compact Language Models via Pruning and Knowledge Distillation，2024】相同的程序，通过检查每个 FFN 层第一个线性算子中每个神经元产生的一组输出来计算其重要性。为此，我们使用了一个包含 1024 个样本的小型校准数据集。形式上，我们通过聚合给定输入批次$B$的输出来计算每个神经元的重要性分数：

$$F_{\text{neuron}}^{(i)} = \sum_{\text{B,S}} \delta\left(\mathbf{X}(\mathbf{w}_1^i)^T\right).$$

这里，$W_1^i$指的是权重矩阵$W_1$的第$i$行。$\sum_{B,S}$指的是沿批次和序列维度的聚合。我们遵循Minitron论文中的观察，沿批次和序列维度使用均值和l2范数聚合函数。对于一个分数序列S，均值聚合定义为$\frac{1}{N}\sum_{i=1}^{N}|S_i|$，l2范数是$\sqrt{\sum_{i=1}^{N}S_i^2}$。嵌入通道的重要性计算方式类似，通过检查LayerNorm层的输出而非FFN层的输出来实现；我们建议读者参考Muralidharan等人（2024）【索引69，Compact Language Models via Pruning and Knowledge Distillation，2024】以获取更多细节。

Mamba重要性。Mamba层通过多个投影矩阵（$x_{proj}$, $dt_{proj}$, $A_{proj}$, $B_{proj}$, $C_{proj}$）处理输入，这些矩阵在因果卷积和选择性状态空间模型（SSM）更新之前产生中间表示，随后是门控归一化和输出投影（$v_{proj}$）。我们遵循Taghibakhshi等人（2025）【索引109，Efficient hybrid language model compression through group-aware ssm pruning，2025】中描述的方法进行重要性估计：具体来说，我们在一个包含1024个样本的小型校准数据集上采用一种嵌套的基于激活的评分策略，类似于FFN重要性估计，但适应了Mamba的组感知结构。首先，我们从$v_{proj}$投影中获得激活分数，表示为$s \in \mathbb{R}^{m_h \times m_d}$，其中$m_h$是Mamba头的数量，$m_d$是Mamba头通道维度。对于每个通道$d$，分数计算如下：

$$s_d = \left\| \sum_{\mathbf{B},\mathbf{S}} s_{:,d} \right\|_2$$

其中聚合是在批次（B）和序列（S）维度上进行的，同时使用了均值和l2范数度量。接下来，通过在Mamba头通道集上使用l2范数来计算头分数：

$f_h = |s_{h, m_d}|_2; \quad \forall h \in {1, \dots, m_h},$

并且头在每个Mamba组$G_g$内进行排序，以保留组感知计算的语义：

$\mathcal{R}g = \operatorname{argsort} (f_h).$}_g

这确保了剪枝决策尊重模型的结构约束和SSM的序列建模能力。得分最低的头通过从所有受影响的投影、卷积和SSM参数矩阵中修剪相应的行来进行剪枝。这种策略在移除不那么重要的Mamba头的同时，保留了SSM块的完整性。正如Taghibakhshi等人（2025）【索引109，Efficient hybrid language model compression through group-aware ssm pruning，2025】所示，剪枝Mamba头比剪枝头通道能产生更好的准确率-吞吐量权衡；因此，我们在这项工作中专注于头剪枝。

轻量级神经架构搜索

约束与目标定义。我们首先为Nano 2模型定义约束和目标，然后描述我们的轻量级神经架构搜索（NAS）框架，该框架能找到满足我们目标和约束的最有希望的架构候选者。

内存约束。推理过程中的内存需求包括两个具有不同扩展行为的独立部分。参数内存虽然庞大，但与输入大小无关，保持恒定。相比之下，键值（key-value）缓存内存随批处理大小和序列长度线性扩展，在长序列场景中常常成为主导因素。对于Nano 2模型，我们的目标是在19.66 GiB的内存预算内，能够以128k的序列长度和至少为1的批处理大小执行推理。我们如下获得该预算：从NVIDIA A10G GPU可用的22.06 GiB内存中，我们减去5%作为vLLM和TensorRT-LLM等框架的缓冲，再减去1.3 GiB以为视觉编码器留出足够空间。

吞吐量测量。对于以下实验，除非另有说明，我们测量的吞吐量是基于输入和输出序列长度分别为8k和16k token的情况，我们认为这代表了一个典型的推理场景。对于这种输入和输出序列长度的组合，我们报告了vLLM在A10G GPU上能容纳的最大批次大小下的输出token生成吞吐量。

候选者枚举

多轴组合剪枝。我们的压缩策略通过组合剪枝，在19.66 GiB的内存预算内探索了多个维度。我们的搜索空间包括深度缩减（从原始的62层架构中移除6-10层），结合嵌入通道（4480-5120）、FFN维度（13440-20480）和Mamba头（112-128）的宽度剪枝。这个多轴搜索空间产生了数百个满足内存约束的候选架构。

寻找最佳架构

两步优化法。由于对所有候选架构进行知识蒸馏和吞吐量基准测试的成本过高，我们将问题分解为两部分：（1）为压缩模型找到最佳深度，以及（2）在给定深度的情况下，找到最佳的宽度剪枝架构。

深度的影响。我们比较了从12B模型中通过深度剪枝得到的三个候选架构的准确率，它们分别有52、54和56层。在这里，为了在KV缓存大小和长上下文性能之间取得良好平衡，我们将这三个变体的注意力层数量固定为4层；先前的工作表明，注意力层与总层数的比例在7-8%之间是合理的【索引84，Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models，2025】。在这个实验中，我们保持宽度维度不变。表9列出了在进行60亿token的蒸馏后，不同深度的平均推理准确率；与我们之前观察到的深度与任务性能之间的强相关性一致【索引69，Compact Language Models via Pruning and Knowledge Distillation，2024；104，LLM Pruning and Distillation in Practice: The Minitron Approach，2024】，我们注意到将深度减少到56层以下会导致显著的准确率下降；因此，我们将深度固定在56层进行进一步的宽度剪枝。

表9 | 深度对推理准确率的影响。结果是在使用60亿个token进行蒸馏后得出的。

结合深度与宽度剪枝。如上所述，我们将目标模型的深度固定为56层，其中包含4个注意力层。我们对这个检查点进行了600亿个token的蒸馏（详见4.3节），并进一步在嵌入、FFN和Mamba轴上进行宽度剪枝。我们枚举了所有满足我们内存预算的候选剪枝架构，并按照在128k上下文长度和批大小为1的情况下估计的内存消耗降序排列。从这个列表中挑选出前3个候选架构进行进一步评估：特别是，在进行深度+宽度剪枝后，我们对这些候选架构进行了190亿个token的短时知识蒸馏（KD）；我们还对它们的吞吐量进行了基准测试，以挑选出最终的架构候选者。表10列出了前3个候选架构的详细信息，以及它们所达到的任务性能（KD后）和吞吐量。如表所示，候选者2在准确率上表现最佳，同时仍具有合理的运行时性能；因此，我们选择这个架构用于Nano 2。

FFN与Mamba剪枝的权衡。我们遵循Taghibakhshi等人（2025）【索引109，Efficient hybrid language model compression through group-aware ssm pruning，2025】的方法，对Mamba头的数量进行了消融研究，考虑了保留原始头数量87.5%和93.75%的配置。然而，由于本工作中探索的压缩率相对较小（深度剪枝后不到15%），与Taghibakhshi等人（2025）（约50%）的研究相比，我们发现应用Mamba头剪枝带来的益处有限。在这些情况下，仅对FFN和嵌入维度进行剪枝——在深度剪枝之后——已足以达到期望的压缩效果，同时保持了准确性。表10中的候选者1和候选者2突显了这一差异。

表10 | 用于架构选择的前3个候选者。准确率是经过19B token蒸馏后在推理基准上的平均值。最后一列显示了vLLM的输出生成吞吐量（ISL/OSL=8k/16k，批大小=8）。

使用蒸馏进行再训练

基于Logit的知识蒸馏。为了恢复因剪枝造成的精度损失，模型需要进行持续训练。最近的研究表明，将知识从原始模型蒸馏到剪枝后的模型，其效果优于传统的微调【索引69，Compact Language Models via Pruning and Knowledge Distillation，2024；104，LLM Pruning and Distillation in Practice: The Minitron Approach，2024；12，Puzzle: Distillation-Based NAS for Inference-Optimized LLMs，2024】；因此，我们在持续训练中采用了基于logit的蒸馏方法，并在精度恢复阶段专门使用前向KL散度损失（关于蒸馏损失公式的更多细节，请参见Minitron论文【索引69，Compact Language Models via Pruning and Knowledge Distillation，2024】的第3节）。基于第4.2节中描述的候选模型选择过程，我们对候选模型2进行了一个扩展阶段的持续训练，如下详述，以产生最终的Nano 2推理模型和基础模型。

表11 | 改变推理数据比例对约60亿个token知识蒸馏后数学准确率的影响。

推理模型。推理模型通过分阶段增加序列长度进行蒸馏，以加强扩展推理和长上下文能力；随后进行有针对性的强化学习（RL）、偏好优化和模型合并，以保留期望的行为并确保在各种任务中的鲁棒性。我们现在描述这些不同的阶段：
1. 深度剪枝至56层；使用约600亿个token在8,192序列长度下进行知识蒸馏（KD）。
2. 宽度剪枝和KD，使用：
* 约500亿个token，序列长度为8,192。
* 约250亿个token，序列长度为49,152。
* 约10亿个token，序列长度为262,144。

直接偏好优化（DPO）。
组相对策略优化（GRPO）。
使用约4亿个token在262,144序列长度下进行KD，以恢复RL后的性能下降。
用于与人类偏好对齐的RLHF。
通过0.5线性插值在步骤5和6之间进行模型合并。

分阶段训练效果。关于DPO、GRPO和RLHF的更多细节可以在第3节中找到。图6展示了分阶段训练对不同推理基准测试中模型准确率的影响。图中，$x$轴表示不同的阶段（从上面的第2步开始），$y$轴显示了随着训练进展，各个基准测试获得的分数。如图所示，DPO和GRPO对于增强函数调用（BFCL v3）和指令遵循（IFEval）能力至关重要，尽管后者暂时降低了多任务理解（MMLU-Pro）的性能，这在下一步（GRPO后的KD）中得以恢复。最后，RLHF增强了与人类偏好（Arena-Hard）的对齐，但导致了额外的基准测试性能下降，这些下降随后通过模型合并得以恢复。

数据集。我们观察到，70%的后训练第二阶段数据（3.2节）和30%的预训练数据（2.2节）的混合能产生最高的准确率（表11）。对于序列长度为262,144的KD，我们使用100%的第三阶段后训练数据（3.2节）。

基础模型。蒸馏分阶段进行：首先进行仅深度剪枝和在约1200亿个token上的KD，然后是宽度剪枝和在约3600亿个token上的KD（两者序列长度均为8,192），最后是在序列长度524,288上进行约25亿个token的KD，以注入长上下文能力。

数据集。遵循Sreenivas等人（2024）【索引104，LLM Pruning and Distillation in Practice: The Minitron Approach，2024】的方法，我们分别在序列长度为8,192和524,288时，使用100%的2.2节和2.6节中描述的预训练数据来蒸馏基础模型。

实验环境

数据集：
- 数学推理：GSM8K、MATH（包括MATH Level 5）、AIME-2024、AIME-2025、GPQA-Diamond。
- 代码任务：HumanEval+、MBPP+。
- 通用/常识推理：MMLU、MMLU-Pro、AGIEval English CoT、OpenBookQA、PIQA、Hellaswag、Winogrande、ARC-Challenge。
- 多语言能力：MGSM、Global MMLU-Lite。
- 长上下文：RULER (128k)。
- 指令遵循：IFEval。
- 工具调用：BFCL v3。
- 科学编码：SciCode。
- 综合推理：Humanity’s Last Exam。
- 对话能力：ArenaHard。
模型架构：
- Nemotron-Nano-12B-v2-Base：62层混合Mamba-Transformer架构，模型维度5120，FFN维度20480，40个查询头，8个KV头。
- Nemotron-Nano-9B-v2 (最终模型)：由12B模型剪枝而来，保留56层，嵌入通道从5120减至4480，FFN中间尺寸从20480减至15680。
硬件配置：
- 推理/部署目标：单张NVIDIA A10G GPU，配备22 GiB显存。
- 训练：文中未明确指出训练硬件，但提及使用了8路张量模型并行和16路上下文并行，表明训练是在大型GPU集群上进行的。
软件配置：
- 评估框架：基于lm-evaluation-harness构建，并进行了修改。数学评估使用Math-Verify进行评分。代码任务使用EvalPlus变体。
- 推理框架：吞吐量测量使用vLLM。提及了vLLM和TensorRT-LLM等框架。
- 数据处理：使用了NeMo-Curator框架进行数据去重。

实验结果

基础模型评估

Nemotron-Nano-12B-v2-Base及其剪枝后的9B版本在多个基准测试中与Qwen3-8B Base和Gemma3-12B Base进行了比较。

通用与数学能力 (表5)：12B基础模型在MMLU、MMLU-Pro、GSM8K、MATH和AIME 2024等多个关键基准上均优于Qwen3和Gemma3。剪枝后的9B基础模型在多数任务上仍保持了对Qwen3-8B的优势，尤其是在数学和代码任务上表现突出。例如，在MATH Level 5上，9B模型得分63.64，远超Qwen3的29.91。
多语言能力 (表6)：在Global-MMLU-Lite基准上，Qwen3-8B Base的平均分最高。然而，在多语言数学推理任务MGSM上，Nemotron-Nano的9B和12B基础模型均显著优于Qwen3和Gemma3，其中9B模型在西班牙语、德语和法语等多个语种上取得了最佳表现。

表5 | Nemotron-Nano-V2-Base模型与现有SoTA模型在准确率上的比较。N-Nano-V2是Nemotron-Nano-V2的缩写。蒸馏后的N-Nano-V2-9B-Base与Qwen3-8B-Base和Gemma3-12B-Base进行了比较，每行的最佳分数已高亮显示。

表6 | Nemotron-Nano-V2-Base模型与现有SoTA模型在多语言基准上的准确率比较。N-Nano-V2是Nemotron-Nano-V2的缩写。蒸馏后的N-Nano-V2-9B-Base与Qwen3-8B-Base和Gemma3-12B-Base进行了比较，每行的最佳分数已高亮显示。

对齐与剪枝后模型评估

推理能力与吞吐量 (图1 & 4.4节)：最终的Nemotron-Nano-9B-v2模型与Qwen3-8B进行了端到端比较。结果显示，Nemotron在AIME24/25、GPQA-D、LiveCodeBench、BFCLv3和RULER 128K等多个复杂推理任务上取得了相当或更优的准确率。同时，在生成密集型场景下（8k输入/1k输出和8k输入/16k输出），其吞吐量分别是Qwen3-8B的3.3倍和6.3倍，实现了性能与效率的双重提升。
对齐模型性能 (表8)：在对齐后，12B的Nemotron模型在多个推理基准上与Qwen3-8B和Qwen3-14B进行了比较。结果显示，Nemotron-Nano-v2-12B在AIME、MATH-500、GPQA-DIAMOND、LiveCodeBench和RULER @ 128K等任务上全面超越了Qwen3的两个版本，展现了其强大的推理和长上下文处理能力。

表8 | 在推理和通用能力基准上，开启“推理模式”的评估结果（Nemotron-Nano-v2-12B、Qwen3-8B和Qwen3-14B）。

蒸馏流程各阶段性能分析 (图6)

蒸馏流程对模型在不同任务上的性能产生了动态影响。
- 初始的知识蒸馏（KD+LCExt）和直接偏好优化（DPO）阶段，模型在多数基准上性能稳定或提升。
- 组相对策略优化（GRPO）阶段显著提升了指令遵循（IFEval）能力，但暂时性地降低了MMLU-Pro的性能。
- 随后的KD阶段成功恢复了MMLU-Pro的性能。
- RLHF阶段提升了对话能力（ArenaHard），但导致其他一些基准性能下降。
- 最终的模型合并（Merge）步骤成功地平衡了各项能力，恢复了因RLHF造成的性能损失，实现了整体性能的最优化。

思考预算控制评估 (图5)

通过在SFT阶段引入截断的推理轨迹，模型学会了在有限的“思考token”预算下进行推理。
- 截断训练前 (图5a)：在预算受限时，模型准确率急剧下降，并且会试图在最终答案中用更多token来“补偿”被限制的思考过程。此外，生成的格式规范性（Well-Formedness）在低预算下也显著降低。
- 截断训练后 (图5b)：模型展现出良好的预算控制能力。即使在较低的预算下，准确率也能平滑下降而非骤降。同时，“补偿”效应消失，模型在各种预算下都能生成格式规范的回答。这表明模型能够有效适应不同的推理时间限制。

图5 | 截断训练前（a）与截断训练后（b）的预算控制比较。所有图表中，x轴表示为思考token分配的预算。

结论

本报告介绍了Nemotron-Nano-9B-v2，一个混合Mamba-Transformer架构的推理模型。与Qwen3-8B等现有SOTA模型相比，它在实现相当或更优准确率的同时，推理吞吐量最高可提升6倍。Nemotron-Nano-9B-v2的创建过程始于在20万亿个token上预训练Nemotron-Nano-12B-v2-Base模型，该过程采用了精心构建的精选与合成数据混合。随后，我们通过多阶段的SFT、GRPO、DPO和RLHF对Nemotron-Nano-12B-v2-Base进行对齐，最后利用Minitron压缩策略（通过剪枝和蒸馏）产出最终模型。得益于这种压缩，Nemotron-Nano-9B-v2能够在单张显存为22 GiB的NVIDIA A10G GPU上，以bfloat16精度处理长达128k个token的上下文。我们已在HuggingFace上开源了Nemotron-Nano-9B-v2及其对应的同系列模型Nemotron-Nano-9B-v2-Base和父模型Nemotron-Nano-12B-v2-Base，同时还发布了其大部分预训练和后训练数据。

附录

A. 宽松的源代码许可证

许可证列表。我们移除了不在此列表中的源代码：

3Com Microcode 3com-microcode, 3D Slicer License 1.0 [3dslicer-1.0], 4Suite 1.1 [4suite-1.1], AAL [attribution], Abstyles License [abstyles], ACE TAO License [ace-tao], AdaCore Doc License [adacoredoc], ADI BSD [adi-bsd], Adobe Glyph License [adobe-glyph], Adobe Postscript AFM License [apafml], Adobe Source Code License 2006 [adobe-scl], AES-128 3.0 License [aes-128-3.0], AFL 1.1 [afl-1.1], AFL 1.2 [afl-1.2], AFL 2.0 [afl-2.0], AFL 2.1 [afl-2.1], AFL 3.0 [afl-3.0], afmparse License [afmparse], Agere BSD [agere-bsd], Alexisisaac Freeware License [alexisisaac-freeware], Allegro 4 License [allegro-4], Altera License [xnet], Amazon Digital Services License [adsl], AMD Historical License [amd-historical], AMD PLPA License [amdplpa], AMPAS BSD-Style License [ampas], AMSFonts license [ams-fonts], Andre Adrian DFS license [adrian], ANTLR-PD [antlrpd], ANTLR-PD with fallback [antlr-pd-fallback], ANU License [anu-license], Apache 1.0 [apache1.0], Apache 1.1 [apache-1.1], Apache 2.0 [apache-2.0], Apache Patent Provision Exception Terms [apache-patent-exception], App::s2p License [app-s2p], Apple Attribution 1997 [apple-attribution1997], Apple Attribution License [apple-attribution], Apple Example Code License [apple-excl], Apple MIT License [aml], Apple Sample Source Code License [apple-sscl], Aravindan Premkumar Licenase [aravindan-premkumar], ArgoUML License [argouml], ARM LLVM Grant [arm-llvm-sga], Array Input Method Public License [array-input-method-pl], Artistic 1.0 [artistic-1.0], Artistic 1.0 w/clause 8 [artistic-1.0-cl8], Artistic 2.0 [artistic-2.0], Artistic-Perl-1.0 [artistic-perl-1.0], ASMUS License [asmus], ASN.1 Object Dumping Code License [asn1], Atkinson Hyperlegible Font License [atkinson-hyperlegible-font], Baekmuk Fonts License [baekmuk-fonts], Bahyph License [bahyph], BaKoMa Fonts Licence 1995 [bakoma-fonts-1995], Barr TeX License [barr-tex], BEA 2.1 [bea-2.1], Beal Screamer License [beal-screamer], Beer-Ware License [beerware], BERI Hardware-Software License v1.0 [beri-hw-sw-1.0], BigDigits License [bigdigits], Bigelow & Holmes Lucida Fonts License [bigelow-holmes], Biopython License [biopython], Bitstream Vera Font License [bitstream], Bitzi-PD [bitzi-pd], BLAS License 2017 [blas-2017], Blue Oak Model License 1.0.0 [blueoak-1.0.0], BOHL0.2 [bohl-0.2], Boost 1.0 [boost-1.0], Boost Original [boost-original], Borceux License [borceux], Boutell libgd declarations 2021 [boutell-libgd-2021], http://bpmn.io License [bpmn-io], Brent Corkum License [brent-corkum], Brian Clapper License [brian-clapper], Brian Gladman 3-Clause License [brian-gladman-3-clause], Brian Gladman Dual BSD-GPL [brian-gladman-dual], Brian Gladman License [brian-gladman], Broadcom CFE License [broadcom-cfe], Broadcom Warranty Disclaimer [broadcom-linux-timer], Brocade Firmware License [brocade-firmware], Bruno Podetti License [brunopodetti], BSD 1988 [bsd-1988], BSD 3-Clause Devine [bsd-3-clause-devine], BSD 3-Clause FDA [bsd-3-clause-fda], BSD 3-Clause jtag [bsd-3-clause-jtag], BSD 3-Clause No Change [bsd-3-clause-nochange], BSD 3-Clause No Nuclear Warranty [bsd-3-clause-no-nuclear-warranty], BSD 3-Clause no trademark [bsd-3-clause-no-trademark], BSD 3-Clause Open MPI variant [bsd-3-clause-open-mpi], BSD 3-Clause Sun [bsd-3-clause-sun], BSD 3-Clause with GPL reference [bsd-top-gpl-addition], BSD Acknowledgment (Carrot2) License [bsd-ack-carrot2], BSD Acknowledgment License [bsdack], BSD Advertising Acknowledgement License [bsd-advertising-acknowledgement], BSD Artwork [bsd-artwork], BSD Atmel License [bsd-atmel], BSD DPT [bsd-dpt], BSD plus modification notice [bsd-plus-mod-notice], BSD Simplified Darwin [bsd-simplified-darwin], BSD Source Code Attribution [bsd-source-code], BSD Unchanged [bsd-unchanged], BSD Unmodified [bsd-unmodified], BSD Zero Clause License [bsd-zero], BSD-1-Clause [bsd-1-clause], BSD-1-Clause Build [bsd-1-clause-build], BSD-2-Clause [bsd-simplified], BSD-2-Clause no disclaimer [bsd-no-disclaimer], BSD-2-Clause no disclaimer Unmod [bsd-no-disclaimer-unmodified], BSD-2-Clause Plus Patent [bsd-plus-patent], BSD2-Clause-plus-advertizing [bsd-2-clause-plus-advertizing], BSD-2-Clause-Views [bsd-2-clause-views], BSD-3-Clause [bsd-new], BSD-3-Clause tcpdump variant [bsd-new-tcpdump], BSD-3-Clause without notice modification [bsd-new-nomod], BSD-3-Clause X11 disclaimer [bsd-x11], BSD-4-Clause with Voices [bsd-original-voices], BSD-4-Clause-Shortened [bsd-4-clause-shortened], BSD-Axis without modification [bsd-axis-nomod], BSD-Credit [bsd-credit], BSD-Derivative [bsd-new-derivative], BSDExport [bsd-export], BSD-InnoSys [bsd-innosys], BSD-Mylex [bsd-mylex], BSD-Original [bsd-original], BSD-Original-Muscle [bsd-original-muscle], BSD-Original-UC [bsd-original-uc], BSD-Original-UC1986 [bsd-original-uc-1986], BSD-Simplified Intel [bsd-simplified-intel], BSD-Simplified source [bsdsimplified-source], BSD-Top [bsd-top], BSLA [bsla], BSLA no advertizing [bsla-no-advert], Business Source License 1.0 [bsl-1.0], BYTEmark License [bytemark], bzip2 License 2010 [bzip2-libbzip-2010], Caldera License [caldera], Careware [careware], Carnegie Mellon Contributors [carnegie-melloncontributors], Carnegie Mellon License [carnegie-mellon], Cavium malloc License [cavium-malloc], CC-BY-1.0 [cc-by-1.0], CC-BY-2.0 [cc-by-2.0], CC-BY-2.0-UK [cc-by-2.0-uk], CC-BY-2.5 [cc-by2.5], CC-BY-3.0 [cc-by-3.0], CC-BY-3.0-AT [cc-by-3.0-at], CC-BY-3.0-US [cc-by-3.0-us], CC-BY-4.0 [cc-by-4.0], CC-PD [cc-pd], CC-PD Mark 1.0 [cc-pdm-1.0], CC0-1.0 [cc0-1.0], CDLA Permissive 1.0 [cdla-permissive-1.0], CDLA Permissive 2.0 [cdla-permissive-2.0], CeCILL-B License [cecill-b], CeCILL-B License English [cecill-b-en], CERN Attribution 1995 [cern-attribution-1995], CERN Open Hardware Licence v1.2 [cern-ohl-1.2], CERN Open Hardware License v1.1 [cern-ohl-1.1], CERN-OHL-P-2.0 [cern-ohl-p-2.0], CFITSIO License [cfitsio], Checkmk License [checkmk], Chicken Dance License v0.2 [chicken-dl-0.2], Chris Maunder License [chris-maunder], Chris Stoy Attribution License [chris-stoy], Clarified Artistic License [artistic-clarified], Classic VB License [classic-vb], Clear BSD 1-Clause License [clear-bsd-1-clause], Clear BSD License [clear-bsd], Click License [click-license], CLIPS License 2017 [clips-2017], CMU Computing Services License [cmu-computing-services], CMU License [cmu-template], CMU MIT-style [cmu-mit], CMU Simple License [cmu-simple], CMU Style [cmu-uc], CNRI Jython License [cnri-jython], CNRI Python 1.6 [cnri-python-1.6], CNRI Python 1.6.1 [cnri-python-1.6.1], Code Credit License v1.0.1 [code-credit-license-1.0.1], Code Credit License v1.1.0 [code-credit-license-1.1.0], CodeGuru Permissions [codeguru-permissions], CodeSourcery 2004 [codesourcery-2004], COIL-1.0 [coil-1.0], Common Lisp LOOP License [loop], CommonJ Timer License [commonj-timer], Compass License [compass], ComponentAce JCraft License [componentacejcraft], compuphase Linking Exception to Apache 2.0 [compuphase-linking-exception], Condor Public License 1.1 [condor-1.1], Copyheart [copyheart], Cornell Lossless JPEG License [cornell-lossless-jpeg], Cougaar Open Source License [cosl], CP/M License 2022 [cpm-2022], CppCoreGuidelines License [cpp-core-guidelines], CRCalc license [crcalc], Creative Commons Attribution 2.5 Australia [cc-by2.5-au], Creative Commons Attribution 3.0 Germany [cc-by-3.0-de], Creative Commons Attribution 3.0 Netherlands [cc-by-3.0-nl], Crossword License [crossword], Crypto++ License [cryptopp], Crystal Stacker License [crystal-stacker], CSL-1.0 [csl-1.0], CSPRNG [csprng], Cube License [cube], cURL License [curl], CVE ToU [cve-tou], CWE ToU [cwe-tou], CxImage License [cximage], D Zlib [d-zlib], DAMAIL [damail], Dante Treglia License [dante-treglia], DBAD License 1.1 [dbad-1.1], Debian reportbug License [reportbug], Delorie Historical License [delorie-historical], dhtmlab Public License [dhtmlab-public], diffmark License [diffmark], dl-de/by-1-0-de [dl-de-by-1-0-de], dl-de/by-1-0-en [dl-de-by-1-0-en], dl-de/by-2-0-de [dl-de-by-2-0-de], dl-de/by-2-0-en [dl-de-by-2-0-en], dmalloc License [dmalloc], DMTF License 2017 [dmtf-2017], Docbook License [docbook], Dom4j License [dom4j], Dotseqn License [dotseqn], Douglas Young License [douglas-young], DRL-1.0 [drl-1.0], DRL-1.1 [drl-1.1], Dropbear License [dropbear], Dropbear-2016 [dropbear-2016], DSDP License [dsdp], Dtree License [dtree], dvipdfm License [dvipdfm], DWTFNMFPL-3.0 [dwtfnmfpl-3.0], Dynamic Drive TOU [dynamic-drive-tou], ECL 1.0 [ecl-1.0], ECL 2.0 [ecl-2.0], EFL 1.0 [efl-1.0], EFL 2.0 [efl-2.0], EFL MIT-Style License [enlightenment], eGenix Public License 1.0.0 [egenix-1.0.0], eGenix Public License 1.1.0 [egenix-1.1.0], EllisLab License [ellis-lab], EMX Library License [emx-library], EnergyPlus BSD-Style License [energyplus-bsd], Enhanced MIT License [emit], enna License [enna], Entessa 1.0 [entessa-1.0], ePaperPress License [epaperpress], EPICS Open License [epics], Eric Glass License [eric-glass], Errbot exception [errbot-exception], Etalab Open License 2.0 [etalab-2.0], Etalab Open License 2.0 English [etalab-2.0-en], EU DataGrid Software License [eu-datagrid], Fabien Tassin License [fabien-tassin], Fair License [fair], FAL 1.3 [free-art-1.3], Far Manager exception to BSD3-Clause [far-manager-exception], FASTBuild License 2012-2020 [fastbuild-2012-2020], FastCGI DevKit [fastcgi-devkit], FastCGI License for Spec Implementation [openmarket-fastcgi], FatFs License [fatfs], FFTPACK License 2004 [fftpack-2004], Filament Group MIT License [filamentgroup-mit], Flex 2.5 [flex-2.5], Flora License v1.1 [flora-1.1], font-alias License [font-alias], FPLOT LIcense [fplot], Fraunhofer ISO 14496-10 License [fraunhofer-iso-14496-10], FreeBSD Boot [freebsdboot], FreeBSD Doc License [freebsd-doc], FreeBSD unmodified first lines License [freebsd-first], FreeMarker License [freemarker], FreeTTS License [freetts], FreeType Project License [freetype], Freeware Public License (FPL) [fpl], FSF All Permissive License [fsf-ap], FSF Free Software License [fsf-free], FSF Notice [fsf-notice], FSF Unlimited License No Warranty [fsf-unlimited-no-warranty], FSF-Unlimited [fsf-unlimited], Fujion Clinical Exception to Apache 2.0 [fujion-exception-to-apache2.0], Gareth McCaughan License [gareth-mccaughan], Gary S. Brown License [gary-s-brown], GDCL License [gdcl], Generic patent disclaimer [patent-disclaimer], Geoff Kuenning License 1993 [geoffkuenning-1993], Ghostpdl Permissive [ghostpdl-permissive], Glulxe License [glulxe], GLUT License [glut], GLWTPL [glwtpl], Good Boy License [good-boy], Graphics Gems License [graphics-gems], Greg Roelofs License [greg-roelofs], Gregory Pietsch Liberal License [gregory-pietsch], GStreamer Exception (2005) [gstreamer-exception-2005], GTPL-v1 [gtpl-v1], GTPL-v2 [gtpl-v2], GTPL-v3 [gtplv3], Haskell Report License [haskell-report], HDF4 License [hdf4], HDF5License [hdf5], HDPARM License [hdparm], Henry Spencer License 1999 [henry-spencer-1999], Henry Spencer Regexp License [hs-regexp], HIDAPI License [hidapi], Historical Notice - NTP [historical-ntp], Historical Permission Notice and Disclaimer [historical], Homebrewed License [homebrewed], HP 1986 License [hp-1986], HPND sell variant with MIT disclaimer [hpnd-sell-variant-mit-disclaimer], HTML 5 spec License [html5], httpget notice and disclaimer [httpget], Ian Kaplan License [ian-kaplan], Ian Piumarta License [ian-piumarta], IBM AS-IS License [ibm-as-is], IBM DHCP License [ibm-dhcp], IBM NonWarranted Sample Code License [ibm-nwsc], IBM PowerPC Software [ibm-pibs], IBM Sample License [ibm-sample], IBPP License [ibpp], ICANN-Public [icann-public], ICOT Free Software [icot-free], ICU Composite License [ibm-icu], ICU License 58 and later [unicode-icu-58], IDT License Notice [idt-notice], IETF License [ietf], IETF Trust License [ietf-trust], ilmid License [ilmid], ImageMagick License [imagemagick], Independent JPEG Group License - short [ijg-short], Indiana Extreme License 1.1.1 [indiana-extreme], Indiana Extreme License 1.2 [indiana-extreme-1.2], Infineon Free Software License [infineon-free], Info-Zip License 1997-10 [info-zip-1997-10], Info-Zip License 2001-01 [info-zip-2001-01], Info-Zip License 2002-02 [info-zip-2002-02], Info-Zip License 2003-05 [info-zip2003-05], Info-Zip License 2004-05 [info-zip-2004-05], Info-Zip License 2005-02 [info-zip-2005-02], Info-Zip License 2007-03 [info-zip-2007-03], Info-Zip License 2009-01 [info-zip-2009-01], Info-Zip License [info-zip], Inno Setup License [inno-setup], Intel ACPI SLA [intel-acpi], Intel BSD - Export Control [intel-bsd-export-control], Intel BSD 2 Clause License [intel-bsd-2-clause], Intel BSD License [intel-bsd], Intel Limited Patent License [intel], Intel OSL 1989 [intel-osl-1989], Intel OSL 1993 [intel-osl-1993], Intel Royalty Free License [intel-royalty-free], ISC License [isc], ISO 14496-10 [iso14496-10], ISO 8879 [iso-8879], ITU License [itu], JA-SiG License [ja-sig], Jam License [jam], Jason Mayes License [jason-mayes], Jasper 1.0 [jasper-1.0], JasPer 2.0 [jasper-2.0], Java App Stub License [java-app-stub], JDBM License v1.00 [jdbm-1.00], JDOM License [jdom], Jetty License [jetty], JGraph License [jgraph], JPEG License [ijg], JPNIC idnkit License [jpnic-idnkit], JPNIC mdnkit License [jpnic-mdnkit], JPython 1.1 [jpython-1.1], jQuery-Tools-PD [jquery-pd], Jscheme License [jscheme], JSFromHell License [jsfromhell], JSON License [json], JSON-js-PD [json-js-pd], JSON-PD [json-pd], Jython License [jython], Kalle Kaukonen License [kalle-kaukonen], Kazlib [kazlib], Keith Rule License [keith-rule], Kerberos License [kerberos], Kevan Stannard License [kevan-stannard], Kevlin Henney License [kevlin-henney], Khronos License [khronos], Knuth CTAN License [knuth-ctan], Kumar Robotics License [kumar-robotics], latex-ec-fonts [ecfonts-1.0], Latex2e License [latex2e], Latex2e with translated notice permission [latex2e-translated-notice], LBNL BSD Variant [lbnlbsd], LCS-Telegraphics License [lcs-telegraphics], Leptonica License [leptonica], libgd License 2018 [libgd-2018], libgeoTiff License [libgeotiff], LibMib License [libmib], libmng License 2007 [libmng2007], Libpng License [libpng], LIbpng License v2 [libpng-v2], libselinux License [libselinux-pd], libsrv License v1.0.2 [libsrv-1.0.2], Lil License v1 [lil-1], LILO License [lilo], Linux Device Drivers [linux-device-drivers], Linux-OpenIB [linux-openib], LinuxBIOS License [linuxbios], linuxhowtos License [linuxhowtos], LLNL [llnl], LLVM Exception to Apache 2.0 [llvm-exception], Logica OSL 1.0 [logica-1.0], LPPL 1.3c [lppl-1.3c], Lucent Public License 1.0 [lucent-pl-1.0], Lucent Public License 1.02 [lucent-pl-1.02], Lucre License [lucre], LZMA SDK License (versions 9.22 and beyond) [lzma-sdk9.22], LZMA SDK Public Domain [lzma-sdk-pd], M+ Fonts license [m-plus], MakeHuman License [make-human-exception], Markus Kuhn License [markus-kuhn-license], Martin Bergmeier License [martin-birgmeier], Matrix Template Library License [mtll], Matt Gallagher Attribution License [matt-gallagher-attribution], Matt Kruse License [mattkruse], Matthew Kwan License [matthewkwan], MediaInfo(Lib) License [mediainfo-lib], metamail License [metamail], MgOpen Font License [mgopen-font-license], Michael Barr License [michael-barr], Minpack Copyright Notice [minpack], MirOS License [mir-os], MIT (SEI) [vince], MIT 1995 [mit-1995], MIT Acknowledgment License [mit-ack], MIT Addition License [mit-addition], MIT License 1998 [mit-license-1998], MIT License [mit], MIT Modern Variant [mit-modern], MIT Nagy Variant [mit-nagy], MIT no advertising with Export Control [mit-no-advert-export-control], MIT No Commercial Use of Trademarks [mit-notrademarks], MIT no false attribution License [mit-no-false-attribs], MIT Old Style [mit-old-style], MIT Old Style no advertising [mit-old-style-no-advert], MIT Old Style Spare [mit-old-style-sparse], MIT README License [mit-readme], MIT Synopsys License [mit-synopsys], MIT Taylor Variant [mit-taylor-variant], MIT Veillard Variant [mit-veillard-variant], MIT with Export Control [mitexport-control], MIT with Specification Disclaimer [mit-specification-disclaimer], MIT Xfig Variant [mit-xfig], MIT-0-Clause [mit-0], mod_dav License 1.0 [mod-dav-1.0], Modified MIT License for Public Domain software [pd-mit], Motorola Microprocessor License [motorola], Mozilla GC License [mozilla-gc], MPEG SSG License [mpeg-ssg], MPEG-2 NBC MPEG-4 Audio ISO [mpeg-iso], MPICH License [mpich], MS Systems Journal Sample Code License [msj-sample-code], MS WS Routing Specifications License [ms-ws-routing-spec], MS-LPL [ms-lpl], MS-PL [ms-pl], MS-SS-PL [ms-sspl], Mulan PSL v1 [mulanpsl-1.0], Mulan PSL v1.0 (En) [mulanpsl-1.0-en], Mulan PSL v2 [mulanpsl-2.0], Mulan PSL v2.0 (En) [mulanpsl-2.0-en], Mulle Kybernetik License [mulle-kybernetik], Multics License [multics], Mup License [mup], musl attribution exception [musl-exception], MX4J License 1.0 [mx4j], Nara Institute License 2003 [naist-2003], NASA 1.3 [nasa-1.3], NAUMEN Public License [naumen], NBPL-1.0 [nbpl-1.0], NCBI Public Domain Notice [ncbi], NCSA Open Source License [uoi-ncsa], Net SNMP License [net-snmp], Netcat License [netcat], NetCDF License [netcdf], Netron Project License [netron], Newlib Historical License [newlib-historical], Newran License [newran], Newsletr License [newsletr], Nice License [nice], NICTA Public Software Licence 1.0 [nicta-psl], Niels Ferguson License [niels-ferguson], Nilsson Historical License [nilsson-historical], NIST Public Domain Notice [nist-pd], NIST Public Domain Notice with fallback [nist-pd-fallback], NIST Software License [nist-software], NIST SRD License [nist-srd], NLOD-1.0 [nlod-1.0], NLOD-2.0 [nlod-2.0], NLPL [nlpl], Node License [node-js], Non White Heterosexual Male [nwhm], Nonexclusive License [nonexclusive], Nortel DASA License [nortel-dasa], Notre Dame License [notre-dame], NRL License [nrl], NRL permission [nrl-permission], NTLM License [ntlm], NTP Origin License [ntpl-origin], NTP-0 [ntp-0], NVIDIA 2002 License [nvidia-2002], NVIDIA License [nvidia], NVIDIA License with Government Qualifications [nvidia-gov], NYSL 0.9982 [nysl-0.9982], NYSL 0.9982 JP [nysl0.9982-jp], O Young Jong License [o-young-jong], O’Reilly Code Sample Notice [oreilly-notice], O-UDA-1.0 [o-uda-1.0], Oasis WS Security Specification License [oasis-ws-security-spec], Object Form Exception to MIT [object-form-exception-to-mit], ODC-By-1.0 [odc-by-1.0], ODMG License [odmg], OFFIS License [offis], OFL 1.0 [ofl-1.0], OFL 1.0 no Reserved Font Name [ofl-1.0-norfn], OFL 1.0 Reserved Font Name [ofl-1.0-rfn], OFL 1.1 no Reserved Font Name [ofl-1.1-no-rfn], OGC 1.0 [ogc-1.0], OGC Software Notice [ogc], OGL 1.0a [ogl-1.0a], OGL Alberta 2.1 [can-oglalberta-2.1], OGL British Columbia 2.0 [can-ogl-british-columbia-2.0], OGL Canada 2.0 [can-ogl-2.0- en], OGL Canada 2.0 Francais [ogl-canada-2.0-fr], OGL Nova Scotia 1.0 [can-ogl-nova-scotia-1.0], OGL Ontario 1.0 [can-ogl-ontario-1.0], OGL Toronto 1.0 [can-ogl-toronto-1.0], OGL-UK-1.0 [ogluk-1.0], OGL-UK-2.0 [ogl-uk-2.0], OGL-UK-3.0 [ogl-uk-3.0], OGL-WPD-3.0 [ogl-wpd-3.0], Open Directory License [odl], Open Group Test Suite License [opengroup], Open Publication License 1.0 [openpub], OpenLDAP Public License 1.1 [openldap-1.1], OpenLDAP Public License 1.2 [openldap1.2], OpenLDAP Public License 1.3 [openldap-1.3], OpenLDAP Public License 1.4 [openldap1.4], OpenLDAP Public License 2.0 [openldap-2.0], OpenLDAP Public License 2.0.1 [openldap2.0.1], OpenLDAP Public License 2.1 [openldap-2.1], OpenLDAP Public License 2.2 [openldap-2.2], OpenLDAP Public License 2.2.1 [openldap-2.2.1], OpenLDAP Public License 2.2.2 [openldap2.2.2], OpenLDAP Public License 2.3 [openldap-2.3], OpenLDAP Public License 2.4 [openldap2.4], OpenLDAP Public License 2.5 [openldap-2.5], OpenLDAP Public License 2.6 [openldap2.6], OpenLDAP Public License 2.7 [openldap-2.7], OpenLDAP Public License 2.8 [openldap-2.8], OpenORB Community License 1.0 [openorb-1.0], OpenSAML License v1 [opensaml-1.0], OpenSSH License [openssh], OpenSSL License [openssl], OpenSSL/SSLeay License [openssl-ssleay], OPML 1.0 [opml-1.0], OPNL-1.0 [opnl-1.0], OPNL-2.0 [opnl-2.0], Oracle BSD-Style with Nuclear Restrictions [oracle-bsd-no-nuclear], Original SSLeay License [ssleay], Original SSLeay License with Windows Clause [ssleay-windows], Oswego Concurrent License [oswego-concurrent], Other Permissive Licenses [other-permissive], OWTChart License [owtchart], OZPLB 1.0 [ozplb-1.0], OZPLB 1.1 [ozplb-1.1], Paolo Messina 2000 [paolo-messina-2000], ParaView License 1.2 [paraview-1.2], Paul Mackerras Binary License [paul-mackerras-binary], Paul Mackerras License [paul-mackerras], Paul Mackerras New License [paul-mackerras-new], Paul Mackerras Simplified License [paul-mackerras-simplified], Paulo Soares License [paulo-soares], PayPal SDK License 2013-2016 [paypal-sdk-2013-2016], PBM Library License [libpbm], PCRE License [pcre], PD’Programming License [pd-programming], PDDL 1.0 [pddl-1.0], Perl 1.0 [perl-1.0], Peter Deutsch Document License [peter-deutsch-document], Phil Bunce License [phil-bunce], Philippe De Muyter License [philippe-de-muyter], Phorum License 2.0 [phorum-2.0], PHP License 2.0.2 [php-2.0.2], PHP License 3.0 [php-3.0], PHP License 3.01 [php-3.01], Pine License [pine], PngSuite License [pngsuite], Politepix Public License 1.0 [politepix-pl-1.0], PostgreSQL License [postgresql], ppp License [ppp], Protobuf License [protobuf], PS Utilities License [psutils], PSF Python License 3.7.2 [psf-3.7.2], PSF-2.0 [psf-2.0], psfrag License [psfrag], Psytec Free Software License [psytec-freesoft], Public Domain [public-domain], Public Domain Disclaimer [public-domain-disclaimer], Purdue BSD-Style License [purdue-bsd], pybench License [pybench], PyCrypto License [pycrypto], PyGres License 2.2 [pygres-2.2], Python CWI License [python-cwi], Python License 2.0 [python], Python License 2.0.1 [python-2.0.1], Qhull License [qhull], QLogic Microcode [qlogic-microcode], Qpopper License [qpopper], Qualcomm Turing License [qualcommturing], Quirksmode Copyright Notice [quirksmode], radvd License [radvd], Rdisc License [rdisc], Red Hat Attribution License [red-hat-attribution], Red Hat BSD-Simplified [red-hat-bsd-simplified], Regexp License [regexp], Repoze License [repoze], RiceBSD [ricebsd], Richard Black License [richardblack], Robert Hubley License [robert-hubley], RSA 1990 [rsa-1990], RSA Cryptoki License [rsacryptoki], RSA Demo License [rsa-demo], RSA-MD4 License [rsa-md4], RSA-MD5 License [rsa-md5], RTools.Util License [rtools-util], Ruby License [ruby], Runtime Library Exception to Apache 2.0 [apple-runtime-library-exception], Rute Users Tutorial and Exposition License 0.8.0 [rute], Ryszard Szopa License [ryszard-szopa], SaaS MIT License [saas-mit], Sash Notice [sash], SATA License [sata], SAX-PD [sax-pd], Saxpath License [saxpath], SBIA Part B [sbia-b], ScanCode acknowledgment [scancode-acknowledgment], scanlogd License [scanlogd-license], ScanSoft Public License 1.2 [scansoft1.2], SCEA Shared Source License 1.0 [scea-1.0], Scheme Language Report License [schemereport], Scheme Widget Library (SWL) Software License [swl], Scintilla License [scintilla], Scribbles Demos Recognizer Notice [scribbles], Script Asylum License [script-asylum], Secret Labs License 2011 [secretlabs-2011], selinux-nsa-declaration-1.0 [selinux-nsa-declaration-1.0], Sendmail License [sendmail], Service Availability Forum License [saf], Service Component Architecture License [service-comp-arch], SFL License Agreement [sfl-license], SGI CID Font Code Public License 1.0 [sgi-cid-1.0], SGI Free Software License B 1.1 [sgi-freeb-1.1], SGI Free Software License B 2.0 [sgi-freeb-2.0], SGI GLX Public License 1.0 [sgi-glx-1.0], Sglib License [sglib], SGP4 Permission Notice [sgp4], Shital Shah License [shital-shah], SIL Open Font License 1.1 with Reserved Font Name [ofl-1.1-rfn], SimPL 1.1 [simpl-1.1], SNMP++ License [hp-snmp-pp], snprintf License [snprintf], SoftFloat [softfloat], SoftFloat Legal Notice 2.0 [softfloat-2.0], softSurfer License [softsurfer], SolderPad Hardware License v0.5 [shl-0.5], Solderpad Hardware License v2.0 [shl-2.0], Solderpad Hardware License v2.1 [shl-2.1], SolderPad Hardware License, Version 0.51 [shl-0.51], Sparky License [sparky], SpeechWorks Public License 1.1 [speechworks-1.1], SQLite Blessing [blessing], Standard ML of New Jersey [standard-ml-nj], Stanford PVRG License [stanford-pvrg], STLport License 2000 [stlport-2000], STLport License 4.5 [stlport-4.5], STREAM Benchmark License [stream-benchmark], Stu Nicholls License [stu-nicholls], Sun RPC License [sun-rpc], Sun source code License [sun-source], SunPro Attribution License [sunpro], Sunsoft License [sunsoft], Supervisor License [supervisor], svndiff License [svndiff], SWIG Library License [swig], Symlinks License [symlinks], Symphonysoft [symphonysoft], Synopsys MIT License [synopsys-mit], Synthesis Toolkit License [synthesis-toolkit], SystemC Open Source License Agreement [accellera-systemc], Taiwan Open Government Data License, version 1.0 [ogdl-taiwan-1.0], Takao Abe License [takao-abe], Takuya OOURA License [takuya-ooura], Talis Community License [ttcl], Tatu Ylonen License [tatu-ylonen], TCG Spec License v1 [tcg-spec-license-v1], TCL/TK License [tcl], TCP Wrappers License [tcp-wrappers], TekHVC License [tekhvc], Term Readkey License [term-readkey], Tested Software License [tested-software], TeX Live License [tex-live], TextTabs+Wrap License [ttwl], TFL [tfl], The Happy Bunny License [happy-bunny], Theodore Ts’o license [tso-license], Things I Made (TIM) Public License [things-i-made-public-license], Tidy License [tidy], Tiger Cryptography License [tiger-crypto], Tigra Calendar 3.2 License [tigra-calendar-3.2], Tigra Calendar 4.0 License [tigra-calendar-4.0], Tim Janik License 2003 [tim-janik-2003], Time::ParseDate License [tpdl], Timestamp Picker License [timestamp-picker], TTYP0 License [ttyp0], TU Berlin License 1.0 [tu-berlin], TU Berlin License 2.0 [tu-berlin-2.0], Tumbolia Public License [tumbolia], TwistedSNMP License [twisted-snmp], UCAR License [ucar], UnboundID LDAP SDK Free Use License [ldap-sdk-free-use], Unicode DFS 2015 [unicode-dfs-2015], Unicode DFS 2016 [unicode-dfs2016], Unicode Inc License Agreement [unicode], Unicode Mappings License [unicode-mappings], University of British Columbia License [ubc], University of Michigan OSL [michigan-disclaimer], UNIX Network Programming Book License [unpbook], UnixCrypt License [unixcrypt], Unlicense [unlicense], Unlimited Binary Use Exception [unlimited-binary-use-exception], UPL 1.0 [upl-1.0], US Government Public Domain [us-govt-public-domain], US Government Unlimited Rights [us-govtunlimited-rights], USRobotics Permissive License [usrobotics-permissive], Utopia Typeface License [utopia], VCalendar License [vcalendar], Vic Metcalfe Public Domain [vic-metcalfe-pd], VIM License [vim], Visual Idiot [visual-idiot], Visual Numerics License [visual-numerics], Vixie Cron License [vixiecron], Vovida Software License 1.0 [vsl-1.0], W3C 3-Clause BSD License [w3c-03-bsd-license], W3C Software Notice and License [w3c], W3C-SOFTWARE-19980720 [w3c-software-19980720], W3CSOFTWARE-DOC-20150513 [w3c-software-doc-20150513], w3m License [w3m], Westhawk License [westhawk], Whistle Communications License [whistle], Whitecat License [whitecat], WIDE License [wide-license], Wide Open License [wol], Widget Workshop License [widget-workshop], William Alexander License [william-alexander], wingo License [wingo], Wordnet License [wordnet], Wrox Press License [wrox], WS-Addressing Specification License [ws-addressing-spec], WS-Policy Specification [ws-policy-specification], WS-Trust Specification [ws-trust-specification], Wsuipa License [wsuipa], WTFNMFPL-1.0 [wtfnmfpl-1.0], WTFPL 1.0 [wtfpl-1.0], WTFPL 2.0 [wtfpl-2.0], WTHPL 1.0 [wthpl-1.0], wxWidgets Licence [wxwidgets], wxWindows Unrestricted Licence 3.0 [wxwindows-u-3.0], X11 Documentation License [x11-doc], X11 License [x11], X11-R5 [x11-x11r5], X11-Style (Acer) [x11-acer], X11-Style (Adobe) [x11-adobe], X11-Style (Adobe-DEC) [x11-adobe-dec], X11-Style (Bitstream Charter) [x11-bitstream], X11-Style (David R. Hanson) [x11-hanson], X11-Style (DEC 1) [x11-dec1], X11-Style (DEC 2) [x11-dec2], X11-Style (DSC Technologies) [x11-dsc], X11-Style (FSF) [x11-fsf], X11-Style (Keith Packard) [x11-keith-packard], X11-Style (Lucent) [x11-lucent], X11-Style (Lucent-variant) [x11-lucent-variant], X11-Style (OAR) [x11-oar], X11-Style (Open Group) [x11-opengroup], X11-Style (OpenGL) [x11-opengl], X11-Style (Quarterdeck) [x11-quarterdeck], X11-Style (Realmode) [x11-realmode], X11-Style (Silicon Graphics) [x11-sg], X11-Style (Stanford University) [x11-stanford], X11-Style (Tektronix) [x11-tektronix], X11-Style (Tiff) [x11-tiff], X11-Style (X Consortium Veillard) [x11-xconsortium-veillard], X11-Style (X Consortium) [x11-xconsortium], Xdebug License v 1.03 [xdebug-1.03], XFree86 License 1.0 [xfree86-1.0], XFree86 License 1.1 [xfree86- 1.1], xinetd License [xinetd], XML:DB Initiative Software License 1.0 [xmldb-1.0], XSkat License [xskat], xxd License [xxd], Yale CAS License [yale-cas], Yensdesign License [yensdesign], Zed License [zed], Zend Engine License 2.0 [zend-2.0], ZeusBench notice [zeusbench], ZLIB License [zlib], ZLIB License with Acknowledgment [zlib-acknowledgement], ZPL 1.0 [zpl-1.0], ZPL 1.1 [zpl-1.1], ZPL 2.0 [zpl-2.0], ZPL 2.1 [zpl-2.1], zsh License [zsh], Zuora Software License [zuora-software], Zveno Research License [zveno-research]

许可证列表来源说明。以上列表给出了短名称（如果没有短名称，则为名称）以及在ScanCode许可证数据集中的键（在方括号中），该数据集可在 https://github.com/aboutcode-org/scancode-toolkit/tree/develop/src/licensedcode/data/licenses 获取。

PaperCache

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

主要贡献

预训练