cuNumeric and Legate: How to Create a Distributed, GPU-Accelerated Library

Wonchan Lee | March 23, 2023

使命：可访问的加速计算 (Mission: Accessible Accelerated Computing
Legate 策略 (Legate Strategy
运行示例：cuNumeric (Running Example: cuNumeric
通过隐式并行实现可组合性 (Composability via Implicit Parallelism
真实可组合性中的挑战 (Challenges in True Composability
真实可组合性的统一数据抽象 (Unified Data Abstraction for True Composability
生产性编程的抽象 (Abstractions for Productive Programming
性能 (Performance
案例研究：TorchSWE (Case Study: TorchSWE
超越密集数组编程 (Beyond Dense Array Programming
- Legate Sparse：Legate 中的 SciPy-Sparse 实现（特性）
- Legate Sparse：Legate 中的 SciPy-Sparse 实现（性能）
状态与计划：cuNumeric (Status and Plan: cuNumeric
状态与计划：Legate (Status and Plan: Legate
立即开始使用 Legate！ (Start Legate Today

使命：可访问的加速计算 (Mission: Accessible Accelerated Computing) (Page 2)

我们的使命是提供加速计算，同时避免复杂的并行编程所带来的痛苦。这包括利用GPU、Grace CPU、DGX和DGX SuperPOD等硬件资源，让加速计算变得平易近人。
加速计算硬件示例

Legate 策略 (Legate Strategy) (Page 3)

Legate 旨在为终端用户和库开发者提供一个全面的生态系统和工具包。

为终端用户：
* 建立一个可组合库的生态系统，这些库可以扩展到任何 NVIDIA 硬件。

为库开发者：
* 提供一个用于高效库开发的工具包。
* 为在此框架中构建的库带来可扩展性和可组合性。

Legate 策略的核心架构如下：
用户程序
↓
Legate 库 (Legate Libraries)
* cuNumeric
* Legate Sparse
* Legate Pandas
* ...
Legate 工具包 (Legate Toolkit)
* 生产力层 (Productivity Layer)
* 可组合性与可移植性层 (Composability & Portability Layer)
* 可伸缩执行运行时 (Runtime for Scalable Execution)
↓
NVIDIA 硬件 (NVIDIA Hardware)
Legate 策略架构图

运行示例：cuNumeric (Running Example: cuNumeric) (Page 4)

Legate 中的 NumPy 实现
cuNumeric 允许 NumPy 程序在不进行大量代码修改的情况下大规模运行。

主要特点：
* NumPy 运算符具有充足的并行性，可实现可伸缩执行。
* 允许 NumPy 程序以最小的代码或无需更改即可大规模运行。
* 这段代码可以在 DGX SuperPODs 和单个 GPU 上运行。

以下是一个共轭梯度求解器的示例代码：

import cunumeric as np

x = np.zeros_like(b)
r = b - A.dot(x)
p = r
rsold = r.dot(r)
max_iters = b.shape[0]

for i in range(max_iters):
    Ap = A.dot(p)
    alpha = rsold / (p.dot(Ap))
    x = x + alpha * p
    r = r - alpha * Ap
    rsnew = r.dot(r)
    if np.sqrt(rsnew) < tolerance:
        break
    beta = rsnew / rsold
    p = r + beta * p
    rsold = rsnew
# Conjugate Gradient Solver

通过隐式并行实现可组合性 (Composability via Implicit Parallelism) (Page 5)

Legate 运行时通过依赖分析提取并行性，实现库之间的可组合性。

工作原理：
1. Legate 程序： API 调用按程序顺序排列。
2. 库：仅将 API 调用转换为任务。
3. Legate 运行时： 通过依赖分析提取并行性。

优势：
* 库无需显式同步和数据移动，使其彼此可组合。
* 运行时将来自多个库的任务编织成一个单一的执行。
* 通过精确的运行时分析，提取库内部和跨库的所有可用并行性。

真实可组合性中的挑战 (Challenges in True Composability) (Page 6)

以 cunumeric.nonzero 为例，它用于查找 C 顺序中非零条目的坐标，并希望输入仅在第一维上进行分区。

挑战需要回答以下问题：
* 输入是否已按正确方式分区？
* 每个输入块在哪里以及如何获取它？
* 谁在为每个块生成数据？
* 对于生态系统中任何可能生成输入的库。

Legate 的解决方案： Legate 从根本层面解决了所有 Legate 开发者面临的“接口问题”。

输入分区示例：
* 这些输入可以重用： (chunk 0, chunk 1, chunk 2) 在两列中具有相同的分区。
* 这些输入需要重新分区： (chunk 0, chunk 1, chunk 2) 在第一列，而 (chunk 0, chunk 1, chunk 2, chunk 3) 在第二列，显示需要不同的分区。

真实可组合性的统一数据抽象 (Unified Data Abstraction for True Composability) (Page 7)

为了实现真正的可组合性，开发者应遵循统一的数据抽象。

开发者应：
* 将其领域特定的容器实现为 Legate 存储 (Legate stores)，这是 Legate 中的一个核心数据抽象。
* 提供由运行时分区和管理的数据的全局视图。
* 指定 Legate 存储上的分区约束（下一页）。

开发者从运行时中受益于：
* 不同分区之间的数据移动。
* 任务之间以及任务与数据移动之间的同步。

统一数据抽象示意图：
各种数据结构（如 np.ndarray、sparse.csr_matrix、pd.DataFrame、pd.Series）都可以作为 Legate 存储，并通过 Legate 运行时进行管理。
统一数据抽象图

生产性编程的抽象 (Abstractions for Productive Programming) (Page 8)

Legate 提供与规模无关的任务组织描述，以实现高效编程。

def nonzero(inp):
    out = [
        legate.create_store(shape=None) # 创建大小由任务确定的存储
        for _ in range(inp.ndim)
    ]
    task = legate.task(cuNumeric.NONZERO)
    task.output(*out)
    task.input(inp)
    
    task.broadcast(inp, axes=range(1, inp.ndim)) # 约束inp仅在第一个维度上进行分区
    
    task.execute()
    return out

特点：
* 可以扩展到任意数量的处理器。
* 将性能决策（分区和映射）卸载到运行时。
* 允许任务实现（未显示）专注于单处理器执行。
* 可以表达各种并行模式。

性能 (Performance) (Page 9)

Legate 运行时通过十年的研究实现了隐式并行的可伸缩执行，在不同类别的程序中展现出合理的弱扩展性能。

最近邻通信基准测试 (Benchmarks with nearest neighbor communication)：包括 Stencil、Logistic Regression 和 CFD Simulation。
对数通信复杂性基准测试 (Benchmarks with logarithmic communication complexity)：包括 Jacobi 和 Conjugate Gradient Solver。

这两组基准测试都显示了 Normalized Throughput 随 GPU 数量（从 1 到 1024）的增加呈现出良好的弱扩展性。
性能基准测试图

案例研究：TorchSWE (Case Study: TorchSWE) (Page 10)

TorchSWE 是一个 GPU 加速的浅水方程求解器。

原始的 CuPy+MPI 代码被移植到 cuNumeric（大部分通过删除 MPI 代码）。
cuNumeric 允许没有 MPI 知识的领域科学家达到 historically restricted to only a few scientific groups 的模拟分辨率（约 20B 点）。

图中展示了地形（下方）和水位（上方）的 3D 可视化，以及在 40M 网格点/GPU 条件下的弱扩展性能。性能图显示了 Normalized Throughput 随 GPU 数量（从 1 到 1024）的增加而良好扩展。
TorchSWE 案例研究图

超越密集数组编程 (Beyond Dense Array Programming)

Legate Sparse：Legate 中的 SciPy-Sparse 实现（特性） (Page 11)

Legate Sparse 与 cuNumeric 无缝互操作。

在 3 个月内实现了 4 种稀疏矩阵格式（CSR, CSC, COO, DIA）约 35% 的 API 覆盖。
数据相关的分区约束在快速开发中发挥了关键作用。

以下是一个在 Legate Sparse 和 cuNumeric 中执行稀疏矩阵向量乘法 (SpMV) 的代码片段，以及 CSR 矩阵的分区约束。

# Create a banded diagonal matrix in CSR format
A = legate.sparse.diags(
    [1] * nnz_per_row,
    [x - (nnz_per_row // 2) for x in range(nnz_per_row)],
    shape=(n, n),
    format="csr",
)
# Create a dense vector
x = cunumeric.ones((n,))
# Perform SpMV
y = A.dot(x)

Legate Sparse：Legate 中的 SciPy-Sparse 实现（性能） (Page 12)

用 Python 编写的基准测试与 PETSc（一种最先进的手动优化 MPI 实现）具有竞争性性能。
两个图表展示了 Legate (黄色) 和 PETSc (灰色) 在不同 GPU 数量（从 1 到 192）下的性能比较：
* SpMV (稀疏矩阵向量乘法): Legate 的 Normalized Throughput 与 PETSc 非常接近。
* Conjugate Gradient Solver (共轭梯度求解器): Legate 的 Normalized Throughput 与 PETSc 同样表现出竞争力。
Legate Sparse 性能基准测试

状态与计划：cuNumeric (Status and Plan: cuNumeric) (Page 13)

cuNumeric Beta 版本发布！
* API 覆盖率达 60%。
* 高级索引。
* 张量收缩 (Tensor contraction)。
* 多维排序。
* 96% 的 ufuncs。
* 80% 的 RNGs。
* 符合人体工程学的设计。

2023 年计划增强功能：
* 改进 API 覆盖率：
* numpy.linalg 和 numpy.fft。
* 分布式文件 IO。
* 接受 UDFs 的高阶运算符。

Grace 支持。
性能改进：
- 小数据。
- 更好的分区启发式方法 (Legate)。

可用性：

Conda 包
Jupyter Notebooks

状态与计划：Legate (Status and Plan: Legate) (Page 14)

Legate 的发展路线图显示了从当前状态到 2023 年第四季度及未来的愿景。

今天 (Today):
核心是 Legate Toolkit，之上是 Python Productivity Layer，支持 cuNumeric、Legate Sparse 和 Legate Pandas 库。

2023 年第四季度 (Q4'23):
* Legate Toolkit 保持不变。
* 增加 C++ Productivity Layer 和 C++ Legate Libraries。
* 引入 Language bindings，并支持其他语言的库。

未来憧憬的功能 (Aspirational features (future)):
* 跨库数据分区和映射优化。
* 跨库任务融合和任务图特化。
* 用于智能性能调优和调试的开发者工具。
Legate 状态与计划

立即开始使用 Legate！ (Start Legate Today!) (Page 15)

致终端用户：
* 尝试使用 cuNumeric！
* 请关注未来的 Legate 库发布（Legate Sparse、Legate Pandas 等）。
* 告诉我们您希望在 Legate 生态系统中看到哪些库。

致库开发者：
* 请考虑在 Legate 中编写您的下一个库！
* 通过 https://github.com/nv-legate/nv-legate/tree/master/examples/hello-world 和 https://legate.readthedocs.io/en/latest/legate.html 开始。
* 如果您已经是 Legate 开发者，请告诉我们如何为您改进 Legate。

发送任何反馈或问题至：
* legate@nvidia.com
* https://github.com/nv-legate/nv-legate

PaperCache

cuNumeric and Legate: How to Create a Distributed, GPU-Accelerated Library

cuNumeric and Legate: How to Create a Distributed, GPU-Accelerated Library

目录

使命：可访问的加速计算 (Mission: Accessible Accelerated Computing) (Page 2)

Legate 策略 (Legate Strategy) (Page 3)

运行示例：cuNumeric (Running Example: cuNumeric) (Page 4)

通过隐式并行实现可组合性 (Composability via Implicit Parallelism) (Page 5)

真实可组合性中的挑战 (Challenges in True Composability) (Page 6)

真实可组合性的统一数据抽象 (Unified Data Abstraction for True Composability) (Page 7)

生产性编程的抽象 (Abstractions for Productive Programming) (Page 8)

性能 (Performance) (Page 9)

案例研究：TorchSWE (Case Study: TorchSWE) (Page 10)

超越密集数组编程 (Beyond Dense Array Programming)

Legate Sparse：Legate 中的 SciPy-Sparse 实现（特性） (Page 11)

Legate Sparse：Legate 中的 SciPy-Sparse 实现（性能） (Page 12)

状态与计划：cuNumeric (Status and Plan: cuNumeric) (Page 13)

状态与计划：Legate (Status and Plan: Legate) (Page 14)

立即开始使用 Legate！ (Start Legate Today!) (Page 15)

cuNumeric and Legate: How to Create a Distributed, GPU-Accelerated Library

目录

使命：可访问的加速计算 (Mission: Accessible Accelerated Computing) (Page 2)

Legate 策略 (Legate Strategy) (Page 3)

运行示例：cuNumeric (Running Example: cuNumeric) (Page 4)

通过隐式并行实现可组合性 (Composability via Implicit Parallelism) (Page 5)

真实可组合性中的挑战 (Challenges in True Composability) (Page 6)

真实可组合性的统一数据抽象 (Unified Data Abstraction for True Composability) (Page 7)

生产性编程的抽象 (Abstractions for Productive Programming) (Page 8)

性能 (Performance) (Page 9)

案例研究：TorchSWE (Case Study: TorchSWE) (Page 10)

超越密集数组编程 (Beyond Dense Array Programming)

Legate Sparse：Legate 中的 SciPy-Sparse 实现（特性） (Page 11)

Legate Sparse：Legate 中的 SciPy-Sparse 实现（性能） (Page 12)

状态与计划：cuNumeric (Status and Plan: cuNumeric) (Page 13)

状态与计划：Legate (Status and Plan: Legate) (Page 14)

立即开始使用 Legate！ (Start Legate Today!) (Page 15)

登录

注册

忘记密码

重发验证邮件