Slinky is an open source toolkit developed by SchedMD (now part of NVIDIA) that integrates Slurm workload management with Kubernetes. It allows organizations to run and manage Slurm clusters inside Kubernetes environments on nearly any GPU-accelerated cluster, providing unified scheduling for HPC and cloud-native AI workloads.

What is Slurm Operator in Slinky?

Slurm Operator is a core Slinky component that runs full Slurm clusters on Kubernetes infrastructure. It manages the complete lifecycle of Slurm daemons as Kubernetes pods, including job allocation, accounting, dependencies, fair-share scheduling, and priority scheduling.

What is Slurm Bridge in Slinky?

Slurm Bridge is a Slinky component that brings Slurm scheduling to native Kubernetes workloads. It allows Slurm to act as a Kubernetes scheduler for pods, supporting the co-location of Slurm and Kubernetes workloads on shared infrastructure.

Is Slinky open source?

Yes. Slinky is fully open source and hardware agnostic. The project is available on GitHub at https://github.com/SlinkyProject. Users can deploy Slinky, contribute to its development, and integrate it freely into their infrastructure.

What hardware does Slinky support?

Slinky is designed to run on nearly any GPU-accelerated cluster, including on-premises supercomputers and major cloud providers such as AWS, GCP, and Azure. Its hardware-agnostic architecture allows consistent scheduling policies across heterogeneous data center environments.

What support is available for Slinky?

NVIDIA offers Slurm and Slinky support, training, and consultation services. Organizations can get direct-to-engineering help from NVIDIA experts for implementation and customization. More information is available at https://www.nvidia.com/en-us/software/slurm/slinky/support.

Slinky

Name: Slinky
Author: SchedMD (part of NVIDIA)

适用于 Kubernetes 的 Slurm 工作负载管理。

下载

阅读新闻稿 | NVIDIA 开发者

概览
技术
下载
优势
后续步骤

概览
技术
下载
优势
后续步骤

概览

将 Slurm 的能力引入 Kubernetes

Slinky 是由 SchedMD (现隶属于 NVIDIA) 开发的一个开源项目，可在 Slurm 与 Kubernetes 之间实现无缝的互操作性。它引入的工具允许用户在基于几乎任何 GPU 加速集群构建的 Kubernetes 环境中运行和管理 Slurm 集群，并提供专为当今异构数据中心设计的广泛硬件支持。无论是管理高性能计算 (HPC) 工作负载，还是在云原生环境中进行运维，Slinky 都有助于融合两者的优势，实现高效的资源管理和调度。

获取 Slinky 支持

Slurm 和 Slinky 的支持、培训和咨询服务现已由 NVIDIA 提供。从实施到定制，获得专家的直接工程帮助，以充分利用 Slinky 的全部能力。

了解详情

运行大规模 GPU 工作负载

大多数组织已在 Slurm 作业脚本上投入多年精力，在向 Kubernetes 迁移时，面临着如何避免维护两个独立环境的挑战。在这篇博客中，了解 Slinky 如何大规模管理 Kubernetes 环境。

阅读博客

什么是 Slinky？

Slinky 是一个开源工具包，用于将 Slurm 与 Kubernetes 集成，非常适合混合计算场景，为 HPC 和云原生 AI 用户提供了灵活性和易用性。

技术

深入了解 Slinky

Slinky 工具包的主要组件包括 Slurm Operator 和 Slurm Bridge。Slurm Operator 可在 Kubernetes 基础设施上运行完整的 Slurm 集群，以 Pod 形式管理 Slurm 守护进程的完整生命周期。Slurm Bridge 可将 Slurm 调度引入原生 Kubernetes 工作负载，使 Slurm 能够充当 Pod 的 Kubernetes 调度器。

Slurm Operator

Slurm Operator 是 Slinky 功能的核心。它能够成功管理 Kubernetes 中 Slurm 节点的扩展。Slinky 整合了 Slurm Operator 以利用 Slurm 的各方面功能，例如作业分配、记账和依赖关系、公平共享以及优先级调度。

Slurm Bridge

Slurm Bridge 可在 Kubernetes 集群中实现快速、智能的工作负载调度。Slinky 使用 Slurm Bridge 来支持 Slurm 与 Kubernetes 工作负载的共置，使两者均能获得 Slurm 调度和可扩展性的优势。

下载 Slinky

Slinky 完全开源且与硬件无关，为资源管理和作业调度提供完全的透明度和灵活性。部署 Slinky，助力其发展，并将其无缝集成到您的基础设施栈中。

在 GitHub 上查看并加入社区！

优势

探索 Slinky 的优势

Slinky 是组织运行 AI 训练和大规模 GPU 工作负载、科学仿真或数据密集型任务以及现代云原生应用的理想之选。它无需维护单独的集群，简化了工作负载管理并提升了效率。

统一资源管理

在同一节点池上运行 Slurm 和 Kubernetes 工作负载，无需重复基础设施。Slinky 消除了在 HPC 和云原生团队之间划分集群的需求，让双方都能在单一调度层下的共享硬件上进行操作。

拓扑感知 GPU 调度

Slinky 使用 Slurm 的拓扑感知调度，将分布式工作负载放置在网络结构中物理距离最近的节点上。这可以最大限度地减少大规模 AI 训练和 HPC 工作负载的通信开销，在这些工作负载中，节点间延迟直接影响性能。

Kubernetes 原生部署

由于 Slinky 在 Kubernetes 中运行 Slurm，集群能够受益于 Kubernetes 原生工具，实现自动扩展、可观测性和生命周期管理。团队可以采用 Slurm 的世界级调度能力，同时继续沿用其现有的 Kubernetes 工具和工作流。

广泛的硬件兼容性

Slinky 设计为可在几乎任何 GPU 加速集群上运行，从本地超级计算机到主流云提供商。这种与硬件无关的方法使组织能够灵活地在异构数据中心环境中部署一致的调度策略，避免供应商锁定。

后续步骤

准备好开始了吗？

在 GitHub 下载并加入社区！

入门

Slurm 和 Slinky 支持

及时了解最新版本动态，并获得直接工程支持。

了解详情

Slinky 文档

获取 Slinky 的版本说明和快速入门指南。

了解详情