Revision as of 21:22, 12 January 2024

Internal

YuniKorn Concepts

Overview

YuniKorn core is a universal scheduler that can be used to assign Application resource Allocations to Nodes that expose resources. Its default implementation allocate Kubernetes pods, where multiple pods belong to an application and request resources like memory, cores and GPUs, to Kubernetes nodes. However, Applications, Allocations and Nodes can be mapped onto an arbitrary domain. The scheduler assumes that different Allocation may have different priorities, and performs the higher priority Allocations first. The scheduler also has the concept of preemption.

Application

An application is an abstract programmatic entity that requires resources to execute. The application expresses its needs of resources by issuing Allocation requests, which are handled by the scheduler in an attempt to find a Node that can accommodate the resource need for that specific allocation request. In the default Kubernetes implementation, an application is any higher level workload resource that creates pods: deployments, jobs, etc.

Application Lifecycle

Application states: NEW, ACCEPTED, STARTING, RUNNING, .....

An application gets added as NEW. The application transitions from NEW to ACCEPTED when the first request (Ask) is added to the application. It then moves to STARTING when the Allocation is created. That is the point that the request (Ask) gets assigned to a node. It now shows as an Allocation on the application.

If another Ask was added and a second one gets allocated the application state changes to RUNNING immediately. If there is no other Ask and thus no second Allocation we stay for a maximum of 5 minutes in the STARTING state and then auto transition to RUNNING. This is to support state-aware scheduling. It has no impact on the scheduler or on the pods etc unless you have turned state-aware scheduling on. To configure application to transition to RUNNING after the first allocation Ask, place the tag "application.stateaware.disable": "true" on the AddApplicationRequest when creating the application.

Allocation

Allocation Ask

An allocation ask can be configured with constraints to run on specific nodes.

Allocation Ask Implementation

This is the sequence of operations of an Allocation Ask.

An Allocation Ask update is externally initiated by invoking SchedulerAPI.UpdateAllocation(), which is then turned into an "RM event", which is forwarded to scheduler.Scheduler.
scheduler.Scheduler handles an rmevent.RMUpdateAllocationEvent "update allocation" event in the handleRMEvent() function, which immediately calls into scheduler.ClusterContext#handleRMUpdateAllocationEvent().
scheduler.ClusterContext#handleRMUpdateAllocationEvent() → scheduler.ClusterContext#processAsks().
scheduler.ClusterContext#processAsks() locates the corresponding partition and calls into scheduler.PartitionContext#addAllocationAsk().
scheduler.PartitionContext#addAllocationAsk() locates the corresponding application.
scheduler.PartitionContext#addAllocationAsk() creates a new objects.AllocationAsk instance.
scheduler.PartitionContext#addAllocationAsk() invokes into objects.Application#AddAllocationAsk() with the newly created objects.AllocationAsk instance.
objects.Application#AddAllocationAsk():
- Computes the delta
- If it is "new" or "completing" state, get into "running" state.
- Store the ask in requests.
- Update priority.
- Update total pending resources up the queue hierarchy.

The allocation attempt won't be executed on this thread, but by one of the asynchronous periodic scheduling runs.

Partition

`PartitionContext`

Node

Nodes can be marked to be schedulable or not.

Node Update Implementation

Sequence of operations resulting in a node update:

A node update is externally initiated by invoking SchedulerAPI.UpdateNode() on externally-detected node change, implemented by the Resource Manager client code. The call gets a si.NodeRequest, instantiated by the client code.
SchedulerAPI.UpdateNode() checks whether the Resource Manager ID is correct and then creates a rmevent.RMUpdateNodeEvent event instance, which wraps the si.NodeRequest, to be sent to scheduler.Scheduler for processing.


scheduler.Scheduler handles the rmevent.RMUpdateNodeEvent event in handleRMEvent(), which immediately calls into scheduler.ClusterContext#handleRMUpdateNodeEvent().

scheduler.ClusterContext#handleRMUpdateNodeEvent() calls processNodes().
processNodes() looks at si.NodeInfo.Action and handle CREATE and CREATE_DRAIN.
If the node is to be created, addNode() is invoked. Otherwise updateNode() is invoked.
scheduler.ClusterContext creates the new objects.Node instance.
scheduler.ClusterContext invokes scheduler.PartitionContext#AddNode()

ResourceQueue


https://yunikorn.apache.org/docs/user_guide/queue_config


https://yunikorn.apache.org/docs/design/scheduler_configuration/#queue-configuration


https://yunikorn.apache.org/docs/user_guide/resource_quota_management

Partion Root Queue

The partition root queue has always its max resource limit set to the sum of resources for all nodes in the partition.

Leaf QueueQueue Implementation Details

Max resources for a queue is given by maxResource.

PriorityPreemptionScheduler

The scheduler instance scheduler.Scheduler is the initiator of scheduling runs.

Scheduling Run

The scheduler can automatically and periodically execute scheduling runs with by invoking scheduler.ClusterContext#schedule() with the periodicity of 100 milliseconds, or it can be triggered manually, is it is started with the manual schedule option true. To manually schedule, start the scheduler with "auto" mode disabled and manually invoke scheduler.Scheduler.MultiStepSchedule().

Scheduling Run Implementation

If it is a manually initiated scheduling run, trigger it with scheduler.Scheduler.MultiStepSchedule().
scheduler.ClusterContex#schedule(). This is the main scheduling routine. It processes each partition, it walks over each queue and each application to check if anything can be scheduled.
For a partition, the allocation logic tries in turn: reservations, placeholder replacements and regular pending resource asks allocation.
A regular pending resource ask allocation starts with the partition's root queue and recursively walks down in objects.Queue#TryAllocate(). Allocation logic is only executed for leaf queues, non-leaf queue are just recursively walked down.
For each leaf queue, we get the head room and then we iterated over the sorted applications, skipping those that do not have pending requests.
For each application we try to allocate with objects.Application#tryAllocate().
Understand head room. Compute users' headroom.
Iterate over nodes given by the "node iterator" and invoke objects.Application#tryNodes() that internally iterates on nodes and invokes objects.Application#tryNode() on each.
For the first node that fits the allocation ask, the application creates a new objects.Allocation instance that carries the node's ID.
Call AddAllocation() on the node. This mutates the internal state of the node, by adding the resources asked to allocatedResources and removing them from availableResources.
The call unwinds returning the objects.Allocation instance.

@@ Line 44: / Line 44: @@
 ===Node Update Implementation===
-A node update is externally initiated by invoking <code>SchedulerAPI.UpdateNode()</code>, which is then turned into an event
+Sequence of operations resulting in a node update:
+* A node update is externally initiated by invoking <code>SchedulerAPI.UpdateNode()</code> on externally-detected node change, implemented by the Resource Manager client code. The call gets a <code>si.NodeRequest</code>, instantiated by the client code.
+* <code>SchedulerAPI.UpdateNode()</code> checks whether the Resource Manager ID is correct and then creates a <code>rmevent.RMUpdateNodeEvent</code> event instance, which wraps the <code>si.NodeRequest</code>, to be sent to <code>scheduler.Scheduler<code> for processing.
+* <code>scheduler.Scheduler<code> handles the <code>rmevent.RMUpdateNodeEvent</code> event in <code>handleRMEvent()</code>, which immediately calls into <code>scheduler.ClusterContext#handleRMUpdateNodeEvent()</code>.
+* <code>scheduler.ClusterContext#handleRMUpdateNodeEvent()</code> calls <code>processNodes()</code>.
+* <code>processNodes()</code> looks at <code>si.NodeInfo.Action</code> and handle CREATE and CREATE_DRAIN.
+* If the node is to be created, <code>addNode()</code> is invoked. Otherwise <code>updateNode()</code> is invoked.
+* <code>scheduler.ClusterContext</code> creates the new <code>objects.Node</code> instance.
+* <code>scheduler.ClusterContext</code> invokes <code>scheduler.PartitionContext#AddNode()</code>
 =Resource=

YuniKorn Core Concepts: Difference between revisions

Revision as of 21:22, 12 January 2024

Contents

Internal

Overview

Application

Application Lifecycle

Allocation

Allocation Ask

Allocation Ask Implementation

Partition

`PartitionContext`

Node

Node Update Implementation

Resource

Queue

Partion Root Queue

Leaf Queue

Queue Implementation Details

Priority

Preemption

Scheduler

Scheduling Run

Scheduling Run Implementation

Navigation menu

YuniKorn Core Concepts: Difference between revisions

Revision as of 21:22, 12 January 2024

Internal

Overview

Application

Application Lifecycle

Allocation

Allocation Ask

Allocation Ask Implementation

Partition

PartitionContext

Node

Node Update Implementation

Resource

Queue

Partion Root Queue

Leaf Queue

Queue Implementation Details

Priority

Preemption

Scheduler

Scheduling Run

Scheduling Run Implementation

Navigation menu

Search

`PartitionContext`