Kubernetes v1.36: Key Upgrades to Workload-Aware Scheduling – 8 Essential Insights

By

Kubernetes v1.36 marks a major leap forward for workload-aware scheduling, building on the foundation laid in v1.35. This release focuses on AI/ML and batch workloads, which demand more sophisticated scheduling than simple per-Pod logic. By cleanly separating API concerns, introducing atomic scheduling cycles, and enabling advanced features like topology awareness and preemption, v1.36 equips cluster operators with powerful tools to optimize resource utilization. In this article, we break down the eight most important changes you need to understand.

1. Clean Separation of Workload and PodGroup APIs

In v1.35, pod groups and their runtime states were embedded within the Workload resource, creating tight coupling. v1.36 decouples these concepts: the Workload API now acts as a static template, while the new PodGroup API manages runtime state. This separation streamlines the scheduler’s logic—the kube-scheduler directly reads PodGroup objects without needing to parse the Workload. Controllers define a Workload with podGroupTemplates, then stamp out PodGroup instances. This not only improves performance but also enables per-replica sharding of status updates for better scalability. The APIs reside in the scheduling.k8s.io/v1alpha2 group, completely replacing the previous v1alpha1 version.

Kubernetes v1.36: Key Upgrades to Workload-Aware Scheduling – 8 Essential Insights

2. Introduction of a PodGroup Scheduling Cycle

The kube-scheduler now features a dedicated PodGroup scheduling cycle, enabling atomic processing of entire workload groups. Instead of scheduling Pods individually, the scheduler evaluates all members of a PodGroup together, ensuring gang constraints are satisfied before committing any Pod. This cycle is the backbone for future enhancements like advanced bin-packing and multi-topology placement. Administers benefit from a more predictable scheduling behavior, as the scheduler can now guarantee that a minimum number of Pods (defined by minCount) are simultaneously runnable before making a decision.

3. Topology-Aware Scheduling for Workloads

v1.36 debuts the first iteration of topology-aware scheduling for PodGroups. This feature allows administrators to optimize placements based on node topology—e.g., ensuring Pods of a gang are distributed across failure domains or concentrated in a single rack for low latency. The scheduler uses the PodGroup’s runtime status and topology constraints to make intelligent decisions. While still in early stages, this paves the way for more granular control over data locality and fault tolerance in AI/ML training jobs and batch processing pipelines.

4. Workload-Aware Preemption

Preemption logic gets a major upgrade with workload-aware policies. Previously, the scheduler might evict individual Pods without considering their broader workload context, potentially disrupting gang-scheduled jobs. v1.36 introduces workload-aware preemption, where the scheduler evaluates the impact on all Pods in a group before preempting any. For example, it avoids breaking a gang-scheduling minCount requirement. This reduces job failures and improves overall cluster efficiency, especially under resource contention. The preemption subsystem now incorporates PodGroup status and guarantees atomic evictions.

5. Dynamic Resource Allocation via ResourceClaim Support

Building on the Dynamic Resource Allocation (DRA) framework, v1.36 adds ResourceClaim support for workloads. This enables PodGroups to request specialized hardware (e.g., GPUs, FPGAs, or RDMA devices) abstracted behind ResourceClaims. The scheduler coordinates allocation across all Pods in a group, ensuring consistency—for instance, every replica of a training job gets an identical GPU model. This integration simplifies resource management for AI/ML workloads and reduces boilerplate in YAML manifests. Administrators can now define resource templates at the Workload level and have the scheduler resolve them dynamically.

6. First Phase of Job Controller Integration

To demonstrate real-world readiness, v1.36 delivers the first phase of integration between the Job controller and the new Workload/PodGroup APIs. Jobs can now directly leverage the scheduling improvements without custom controllers. The Job controller creates a Workload template and delegates runtime management to the PodGroup API. This integration means existing Job-based batch workflows can immediately benefit from gang scheduling, topology awareness, and DRA support. Future releases will deepen this integration, extending to CronJobs and custom batch operators.

7. Improved Performance and Scalability

The architectural overhaul yields tangible performance gains. Because the PodGroup API allows per-replica sharding of status updates, the scheduler avoids bottlenecks when watching thousands of groups. The Workload object, now a static template, is updated only when the job definition changes, reducing API server load. Moreover, the scheduler no longer needs to watch Workloads—only PodGroups. This decoupling significantly cuts down control plane traffic for large-scale clusters running many concurrent batch jobs. Early benchmarks show a 30% reduction in scheduling latency for workloads with hundreds of Pods.

8. Streamlined Scheduler Logic and Future-Readiness

By reading PodGroup objects directly, the kube-scheduler’s code becomes simpler and more maintainable. All scheduling decisions rely on a single, self-contained object that includes the gang policy, topology constraints, and runtime status. This paves the way for upcoming features like cost-aware scheduling, advanced bin-packing, and multi-cluster workload orchestration. Administrators will find it easier to extend scheduling plugins without touching core logic. v1.36 sets the stage for a new era of workload-aware scheduling in Kubernetes.

In summary, Kubernetes v1.36 dramatically advances workload-aware scheduling by separating API concerns, introducing atomic PodGroup cycles, and enabling topology-aware placement, intelligent preemption, and DRA support. These eight improvements bring production-grade capabilities to AI/ML and batch workloads, making Kubernetes a stronger platform for modern data-intensive applications. As the ecosystem evolves, these foundations will unlock even more sophisticated scheduling strategies.

Tags:

Related Articles

Recommended

Discover More

The Art of Self-Debugging: From Rubber Ducks to Stack Overflow8 Essential Insights into Durable Workflows in the Microsoft Agent FrameworkA Fresh Look for Launchpad: Canonical Begins Modernizing Ubuntu's Development HubSummer Journalism Internship at Carbon Brief: Apply Now for a Paid Three-Week OpportunityTransforming Your Coding Workflow: How to Use OpenAI Codex as an All-in-One AI Workspace