YuniKorn Core Concepts: Difference between revisions
Line 102: | Line 102: | ||
==Partion Root Queue== | ==Partion Root Queue== | ||
The partition root queue maps over an entire partition, and | The partition root queue maps over an entire partition, and updates to resources available to the root queue are performed automatically and calculated based on the nodes registered with the partition. | ||
Configuring [[#Max_Resources|max]] and [[#Guaranteed_Resources|guaranteed]] resources on the root queue is not allowed. | Configuring [[#Max_Resources|max]] and [[#Guaranteed_Resources|guaranteed]] resources on the root queue is not allowed. |
Revision as of 23:46, 17 January 2024
Internal
Overview
YuniKorn core is a universal scheduler that can be used to assign Application resource Allocations to Nodes that expose resources. Its default implementation allocate Kubernetes pods, where multiple pods belong to an application and request resources like memory, cores and GPUs, to Kubernetes nodes. However, Applications, Allocations and Nodes can be mapped onto an arbitrary domain. The scheduler assumes that different Allocation may have different priorities, and performs the higher priority Allocations first. The scheduler also has the concept of preemption.
Application
An application is an abstract programmatic entity that requires resources to execute. The application expresses its needs of resources by issuing Allocation requests, which are handled by the scheduler in an attempt to find a Node that can accommodate the resource need for that specific allocation request. In the default Kubernetes implementation, an application is any higher level workload resource that creates pods: deployments, jobs, etc.
What is a reserved application?
Application Lifecycle
Application states: NEW, ACCEPTED, STARTING, RUNNING, .....
An application gets added as NEW. The application transitions from NEW to ACCEPTED when the first request (Ask) is added to the application. It then moves to STARTING when the Allocation is created. That is the point that the request (Ask) gets assigned to a node. It now shows as an Allocation on the application.
If another Ask was added and a second one gets allocated the application state changes to RUNNING immediately. If there is no other Ask and thus no second Allocation we stay for a maximum of 5 minutes in the STARTING state and then auto transition to RUNNING. This is to support state-aware scheduling. It has no impact on the scheduler or on the pods etc unless you have turned state-aware scheduling on. To configure application to transition to RUNNING after the first allocation Ask, place the tag "application.stateaware.disable": "true" on the AddApplicationRequest
when creating the application.
Allocation
Allocation Ask
An allocation ask can be configured with constraints to run on specific nodes.
Allocation Ask Implementation
This is the sequence of operations of an Allocation Ask.
- An Allocation Ask update is externally initiated by invoking SchedulerAPI.UpdateAllocation(), which is then turned into an "RM event", which is forwarded to
scheduler.Scheduler
. scheduler.Scheduler
handles anrmevent.RMUpdateAllocationEvent
"update allocation" event in thehandleRMEvent()
function, which immediately calls intoscheduler.ClusterContext#handleRMUpdateAllocationEvent()
.scheduler.ClusterContext#handleRMUpdateAllocationEvent()
→scheduler.ClusterContext#processAsks()
.scheduler.ClusterContext#processAsks()
locates the corresponding partition and calls intoscheduler.PartitionContext#addAllocationAsk()
.scheduler.PartitionContext#addAllocationAsk()
locates the corresponding application.scheduler.PartitionContext#addAllocationAsk()
creates a newobjects.AllocationAsk
instance.scheduler.PartitionContext#addAllocationAsk()
invokes intoobjects.Application#AddAllocationAsk()
with the newly createdobjects.AllocationAsk
instance.objects.Application#AddAllocationAsk()
:- Computes the delta
- If it is "new" or "completing" state, get into "running" state.
- Store the ask in
requests
. - Update priority.
- Update total pending resources up the queue hierarchy.
The allocation attempt won't be executed on this thread, but by one of the asynchronous periodic scheduling runs.
Partition
PartitionContext
Node
Nodes can be marked to be schedulable or not.
A node has a "resource capacity", which means total resources contributed by that node, and that is accessible with Node.GetCapacity()
. Internally, that is maintained by the node as totalResource
.
Node Update Implementation
Sequence of operations resulting in a node update:
- A node update is externally initiated by invoking
SchedulerAPI.UpdateNode()
on externally-detected node change, implemented by the Resource Manager client code. The call gets asi.NodeRequest
, instantiated by the client code. SchedulerAPI.UpdateNode()
checks whether the Resource Manager ID is correct and then creates armevent.RMUpdateNodeEvent
event instance, which wraps thesi.NodeRequest
, to be sent toscheduler.Scheduler
for processing.scheduler.Scheduler
handles thermevent.RMUpdateNodeEvent
event inhandleRMEvent()
, which immediately calls intoscheduler.ClusterContext#handleRMUpdateNodeEvent()
.scheduler.ClusterContext#handleRMUpdateNodeEvent()
callsprocessNodes()
.processNodes()
looks atsi.NodeInfo.Action
and handle CREATE and CREATE_DRAIN.- If the node is to be created,
addNode()
is invoked. OtherwiseupdateNode()
is invoked. scheduler.ClusterContext
creates the newobjects.Node
instance.scheduler.ClusterContext
invokesscheduler.PartitionContext#AddNode()
.scheduler.PartitionContext#AddNode()
invokesscheduler.PartitionContext#addNodeToList()
.scheduler.PartitionContext#AddNode()
updates its internal representation of resources contributed by that node.- To continue.
Resource
Max Resources
Guaranteed Resources
Queue
Organizatorium
The queue configuration can change while the scheduler is running.
The queues defined in the queue configuration are considered managed queues.
Each queue has priority-related state:
- currentPriority - the current scheduling priority
- priorityOffset
- priorityPolicy
- prioritySortEnabled
The queue also caches priorities of its children, in an internal map.
Each queue has a maximum number of running applications it can accommodate, which can be set in configuration. What happens if the number is reached?
Max resource
Guaranteed resource
name: blue
submitacl: "*"
resources:
guaranteed:
memory: 1
max:
memory: 10
Partion Root Queue
The partition root queue maps over an entire partition, and updates to resources available to the root queue are performed automatically and calculated based on the nodes registered with the partition.
Configuring max and guaranteed resources on the root queue is not allowed.
The partition root queue has always its max resource limit set to the sum of resources for all nodes in the partition.
Leaf Queue
Queue Configuration
maxapplications
maxapplications
is an unsigned integer value that can be used to limit the number of running applications for the configured user or group (it looks like it's per queue).
name: blue
[...]
maxapplications: 1
[...]
For a hierarchy of queues, the parent's maxapplications
must be larger than maxapplications
of any of its children, but not necessarily larger or equal than the sum of maxapplications
values for all of its children. This means that a parent queue can limit the number of applications running across its children independently of the children's maxapplications
values. Internally, the value is maintained as maxRunningApps
.
Queue Implementation Details
Max resources for a queue is given by maxResource
.
Priority
Preemption
Scheduler
The scheduler instance scheduler.Scheduler
is the initiator of scheduling runs.
Scheduling Run
The scheduler can automatically and periodically execute scheduling runs with by invoking scheduler.ClusterContext#schedule()
with the periodicity of 100 milliseconds, or it can be triggered manually, is it is started with the manual schedule option true
. To manually schedule, start the scheduler with "auto" mode disabled and manually invoke scheduler.Scheduler.MultiStepSchedule()
.
Scheduling Run Implementation
- If it is a manually initiated scheduling run, trigger it with
scheduler.Scheduler.MultiStepSchedule()
. scheduler.ClusterContex#schedule()
. This is the main scheduling routine. It processes each partition, it walks over each queue and each application to check if anything can be scheduled.- For a partition, the allocation logic tries in turn: reservations, placeholder replacements and regular pending resource asks allocation.
- A regular pending resource ask allocation starts with the partition's root queue and recursively walks down in
objects.Queue#TryAllocate()
. Allocation logic is only executed for leaf queues, non-leaf queue are just recursively walked down. - For each leaf queue, we get the head room and then we iterated over the sorted applications, skipping those that do not have pending requests.
- For each application we try to allocate with
objects.Application#tryAllocate()
. - Understand head room. Compute users' headroom.
- Iterate over nodes given by the "node iterator" and invoke
objects.Application#tryNodes()
that internally iterates on nodes and invokesobjects.Application#tryNode()
on each. - For the first node that fits the allocation ask, the application creates a new
objects.Allocation
instance that carries the node's ID and the allocated resources. - Call
AddAllocation()
on the node. This mutates the internal state of the node, by adding the resources asked toallocatedResources
and removing them fromavailableResources
. - The call unwinds returning the
objects.Allocation
instance to partition levelscheduler.PartitionContext#tryAllocate()
, which updates its state based on the allocation that has just been performed, inscheduler.PartitionContext#allocate()
.