YuniKorn Core Concepts: Difference between revisions
(78 intermediate revisions by the same user not shown) | |||
Line 9: | Line 9: | ||
An application is an abstract programmatic entity that requires resources to execute. The application expresses its needs of resources by issuing [[#Allocation|Allocation]] requests, which are handled by the scheduler in an attempt to find a [[#Node|Node]] that can accommodate the resource need for that specific allocation request. In the default Kubernetes implementation, an application is any higher level workload resource that creates pods: deployments, jobs, etc. | An application is an abstract programmatic entity that requires resources to execute. The application expresses its needs of resources by issuing [[#Allocation|Allocation]] requests, which are handled by the scheduler in an attempt to find a [[#Node|Node]] that can accommodate the resource need for that specific allocation request. In the default Kubernetes implementation, an application is any higher level workload resource that creates pods: deployments, jobs, etc. | ||
<font color= | <font color=darkkhaki>What is a reserved application?</font> | ||
==<span id='Application_States'></span>Application Lifecycle== | ==<span id='Application_States'></span>Application Lifecycle== | ||
Application states: NEW | Application states (stored as <code>objects.applicationState</code>): | ||
* NEW | |||
* ACCEPTED | |||
* STARTING | |||
* <span id='RUNNING'></span>RUNNING | |||
* REJECTED | |||
* COMPLETING | |||
* COMPLETED | |||
* FAILING | |||
* FAILED | |||
* EXPIRED | |||
* RESUMING | |||
An application gets added as NEW. The application transitions from NEW to ACCEPTED when the first request (Ask) is added to the application. It then moves to STARTING when the Allocation is created. That is the point that the request (Ask) gets assigned to a node. It now shows as an Allocation on the application. | An application gets added as NEW. The application transitions from NEW to ACCEPTED when the first request (Ask) is added to the application. It then moves to STARTING when the Allocation is created. That is the point that the request (Ask) gets assigned to a node. It now shows as an Allocation on the application. | ||
If another Ask was added and a second one gets allocated the application state changes to RUNNING immediately. If there is no other Ask and thus no second Allocation we stay for a maximum of 5 minutes in the STARTING state and then auto transition to RUNNING. This is to support state-aware scheduling. It has no impact on the scheduler or on the pods etc unless you have turned state-aware scheduling on. To configure application to transition to RUNNING after the first allocation Ask, place the tag "application.stateaware.disable": "true" on the <code>AddApplicationRequest</code> when creating the application. | If another Ask was added and a second one gets allocated the application state changes to RUNNING immediately. If there is no other Ask and thus no second Allocation we stay for a maximum of 5 minutes in the STARTING state and then auto transition to RUNNING. This is to support state-aware scheduling. It has no impact on the scheduler or on the pods etc unless you have turned state-aware scheduling on. To configure application to transition to RUNNING after the first allocation Ask, place the tag "application.stateaware.disable": "true" on the <code>AddApplicationRequest</code> when creating the application. | ||
===Application Execution Timeout=== | |||
The maximum amount of time an application can be in a running state. | |||
=Allocation= | =Allocation= | ||
An allocation can be issued by an application, or it can be an independent allocation, which does not belong to any application. | |||
==Allocation Ask== | ==Allocation Ask== | ||
An allocation ask can be configured with constraints to run on specific [[#Node|nodes]]. | An allocation ask can be configured with constraints to run on specific [[#Node|nodes]]. | ||
Each allocation has a key. The key is used by both the scheduler and the resource manager to track allocations. The key does not have to be the resource manager's internal allocation ID, such as the pod name. If the allocation specifies an application ID, the application must be registered in advance, otherwise we get "failed to find application ..." Each allocation ask requests resources. Each allocation ask has a priority. An application goes into a state transition after the first allocation ask, to "[[#RUNNING|RUNNING]]". | |||
<font color=darkkhaki>What is "MaxAllocations"?</font> | |||
===Allocation Ask Implementation=== | ===Allocation Ask Implementation=== | ||
This is the sequence of operations of an [[#Allocation_Ask|Allocation Ask]]. | This is the sequence of operations of an [[#Allocation_Ask|Allocation Ask]]. | ||
Line 39: | Line 58: | ||
=Partition= | =Partition= | ||
Each partition has a single [[#Partition_Root_Queue|root queue]]. | |||
==<tt>PartitionContext</tt>== | ==Partition Configuration== | ||
{{External|https://yunikorn.apache.org/docs/user_guide/queue_config/#partitions}} | |||
<syntaxhighlight lang='yaml'> | |||
partitions: | |||
- name: <partition_name> | |||
placementrules | |||
limits | |||
nodessortpolicy | |||
preemption: | |||
enabled: true | |||
queues: | |||
- [...] | |||
- [...] | |||
</syntaxhighlight> | |||
==Partition Implementation Details== | |||
The total "partition resource" is the sum of its nodes' "capacity" (<code>node.GetCapacity()</code>, which is the node's "total resource"). | |||
===<tt>PartitionContext</tt>=== | |||
=Node= | =Node= | ||
Line 61: | Line 96: | ||
* <code>scheduler.PartitionContext#AddNode()</code> updates its internal representation of resources contributed by that node. | * <code>scheduler.PartitionContext#AddNode()</code> updates its internal representation of resources contributed by that node. | ||
* <font color=darkkhaki>To continue.</font> | * <font color=darkkhaki>To continue.</font> | ||
=Queue= | =Queue= | ||
Line 70: | Line 101: | ||
{{External|https://yunikorn.apache.org/docs/design/scheduler_configuration/#queue-configuration}} | {{External|https://yunikorn.apache.org/docs/design/scheduler_configuration/#queue-configuration}} | ||
{{External|https://yunikorn.apache.org/docs/user_guide/resource_quota_management}} | {{External|https://yunikorn.apache.org/docs/user_guide/resource_quota_management}} | ||
<font color=darkkhaki> | <font color=darkkhaki> | ||
Explain what a queue is. What can one control? What should one use it? | Explain what a queue is. What can one control? What should one use it? | ||
</font> | </font> | ||
Queues provide fine-grained resource management for the resources exposed by nodes. | |||
<span id='Managed_Queue'></span>Queues can be defined in the configuration, and in this case they are referred to as '''managed queues'''. Also see [[#Unmanaged_Queue|unmanaged queues]]. | |||
The queue configuration can change while the scheduler is running. However, updating the configuration of a queue will not trigger further action. For example, if the maximum number of resources of a queue or its parent became smaller, the currently running applications are left running. The scheduler will take the new configuration into account as new applications arrive, and there will be convergence over time. | |||
A managed queue can be removed only by changing the configuration, and if there are no applications running in it. To remove such a queue, the queue is removed from configuration and makes as training, and when a managed draining queue is empty, it will be removed. | |||
==Fully-Qualified Queue Name== | ==Fully-Qualified Queue Name== | ||
The queues in the hierarchy are identified by fully-qualified names, with name fragments separated by dots. Each fragment corresponds to a [[#name|queue name]] in the hierarchy. Example: "root.companyA.development". | The queues in the hierarchy are uniquely identified by fully-qualified names, with name fragments separated by dots ("."). Each fragment corresponds to a [[#name|queue name]] in the hierarchy. Example: "root.companyA.development". A queue name cannot contain dots. As long as no two queue with the same parent share the same name, different fragments that are part of different fully-qualified queue names can be identical. | ||
<font size=-2> | <font size=-2> | ||
Line 83: | Line 120: | ||
│ ├── development | │ ├── development | ||
│ └── production | │ └── production | ||
│ | |||
└── companyB | └── companyB | ||
└── development | └── development | ||
Line 89: | Line 127: | ||
A queue that does not have children is referred to as [[#Leaf_Queue|leaf queue]]. All other queues, including the root queue which must have at least one children, are referred to as [[#Parent_Queue|parent queues]]. | A queue that does not have children is referred to as [[#Leaf_Queue|leaf queue]]. All other queues, including the root queue which must have at least one children, are referred to as [[#Parent_Queue|parent queues]]. | ||
==Queue | ==<span id='Application_Placement_in_Queue'></span>Application Placement in Queue (Placement Rules)== | ||
{{External|https://yunikorn.apache.org/docs/user_guide/placement_rules/}} | |||
{{External|https://yunikorn.apache.org/docs/design/scheduler_configuration/#placement-rules-definition}} | |||
The users can place an application in a queue by specifying the name of the queue upon submission. It is also possible to place an application in a queue dynamically, based on '''placement rules'''. | |||
A placement rule uses application details to place the application in a queue. Applying the rule either produces a fully qualified queue name, or a failure. which means that the next rule in the list is executed. Rules are executed in the order they are defined. The first rule that produces a result will end execution. If all rules are executed and a match does not occur, the application will be rejected. | |||
<font color=darkkhaki> | <span id='Unmanaged_Queue'></span>A rule may allow the creation of a queue, and in that case, the rule may return a queue that is not defined in the configuration. The queue will be created then. The scheduler manages creation and deletion of such queues. When a dynamically created queue is empty, it will be removed. Such queues are named '''unmanaged queues'''. Also see [[#Managed_Queue|managed queues]]. | ||
<font color=darkkhaki>To continue to process https://yunikorn.apache.org/docs/design/scheduler_configuration/#placement-rules-definition</font> | |||
==Organizatorium== | ==Organizatorium== | ||
Each queue has priority-related state: | Each queue has priority-related state: | ||
Line 111: | Line 149: | ||
Each queue has a maximum number of running applications it can accommodate, which can be set in configuration. What happens if the number is reached? | Each queue has a maximum number of running applications it can accommodate, which can be set in configuration. What happens if the number is reached? | ||
==<span id='Root_Queue'></span>Partition Root Queue== | |||
The partition root queue maps over an entire [[#Partition|partition]], and updates to resources available to the root queue are performed automatically and calculated based on the nodes registered with the partition. As such, configuring [[#Max_Resources|max]] and [[#Guaranteed_Resources|guaranteed]] resources on the root queue is not allowed. The partition root queue has always its max resource limit set to the sum of resources for all nodes in the partition. | |||
==Parent Queue== | |||
Each queue, with the exception of the [[#Root_Queue|root queue]], which does not have a parent, must have exactly one parent. The property of being a "parent" is sometimes referred to as the "queue type". | |||
<font color=darkkhaki>Apparently the parent "type" can be set via configuration, why would one want to do that? And indeed if <code>parent</code> is set to <code>true</code>, the queue becomes automatically non-leaf, even if it does not have children.</font> | |||
==Leaf Queue== | |||
Applications can only be submitted to leaf queues. | |||
==Queue Resource Configuration== | |||
[[#Max_Resources|maximum]] and [[#Guaranteed_Resource|guaranteed]] resources can be configured for a queue, and they are both optional configuration parameters. | |||
< | Resources limits are checked recursively up the queue hierarchy. | ||
Resources that are not specified in the <code>resources.max</code> and <code>resources.guaranteed</code> are not limited, for max resources, or guaranteed, in case of guaranteed resources. | |||
resources: | ===<span id='Max_Resources'></span>Maximum Resources=== | ||
<code>[[YuniKorn_Core_Concepts#max|resources.max]]</code> is optional. | |||
The value cannot be set on the root queue (when set, causes parsing error ""root queue must not have resource limits set"). | |||
</ | When set on a non-root queue, places a hard limit on the size of all '''cumulative''' allocations that can be serviced by that queue at any point in time. The difference between resources actually allocated for applications in this queue and the maximum configured value is called "headroom", and the headroom cannot go below zero. It can be zero, though, which means that all resources permitted by <code>resources.max</code> have been allocated. | ||
If an application is placed into a queue with not enough headroom, the application states in "Accepted" state and it is not scheduled until the maximum limit is updated via configuration <font color=darkkhaki>(How, <code>objects.Queue#SetMaxResource()</code> seems to be designed to only be used with the root queue.)</font> | |||
If <code>resources.max</code> is set to 0, the resource is not available to the queue so application that require that resource cannot be scheduled from this queue. | |||
===Guaranteed Resources=== | |||
<code>[[YuniKorn_Core_Concepts#max|resources.guaranteed]]</code> is optional. | |||
<font color=darkkhaki> | |||
Guaranteed resources <code>[[YuniKorn_Core_Concepts#guaranteed|resources.guaranteed]]</code> are used in calculating the share of the queue and during allocation. It is used as one of the inputs for deciding which queue to give the allocation to. Preemption uses the guaranteed resource of a queue as a base which a queue cannot go below.</font> | |||
==Queue Resource Usage== | |||
The resource usage for a [[#Leaf_Queue|leaf queue]] is the sum of all allocated resources for that queue. The resource usage for a [[#Parent_Queue|parent queue]] is the sum of the usage numbers for all its descendants. | |||
Resources: | |||
Max: map[A00:10] | |||
Guaranteed: map[A00:1] | |||
Actual Guaranteed: map[A00:1] | |||
Allocated: map[A00:9] | |||
Pending: map[A00:0] | |||
Preempting: map[] | |||
==<span id='Permissions'></span>Permissions (Access Control Lists)== | |||
{{External|https://yunikorn.apache.org/docs/user_guide/acls/}} | |||
There are scheduler permissions (access control lists, ACL) and queue ACLs. A scheduler administrator is not by default allowed to submit an application or administer the queues in the system. By default, access control is enabled and access id denied. | |||
Queue ACLs are checked recursively up to the root of the queue tree staring with the lowest point in the tree. If permission on the queue do not allow something, the parent is checked, and so on. | |||
Access control lists give access to the users and groups that have been specified in the list. They do not provide the possibility to explicitly remove or deny access to the users and groups specified in the list. | |||
== | <syntaxhighlight lang='text'> | ||
ACL ::= “*” | userlist [ “ “ grouplist ] | |||
userlist ::= “” | user { “,” user } | |||
grouplist ::= “” | group { “,” group } | |||
</syntaxhighlight> | |||
===Application Submission Permissions=== | |||
Submission permission relate to the capability to submit a certain application to a certain queue by specific users or groups. The users or groups that have the permission to submit application to a specific queue can be specified in the queue configuration. | |||
Must explicitly specify <code>[[#submitacl|submitacl]]: "*"</code>, if missing, the application won't be accepted, with "application rejected: no placement rule matched" | |||
===Queue Administration Permissions=== | |||
Administrative permission imply [[#Application_Submission_Permissions|application submission permissions]] and also allow administrative actions such as terminating an application or moving the application to a different queue. | |||
<code>[[#adminacl|adminacl]]</code> | |||
==Queue Configuration== | ==Queue Configuration== | ||
The queue configuration is dynamic and it can be changed while the scheduler is running, without requiring a scheduler restart. The queue configuration will change after invocation of the corresponding Go API method, of the REST based API or after changing the configuration file. | |||
<syntaxhighlight lang='yaml'> | <syntaxhighlight lang='yaml'> | ||
name: somequeue | name: somequeue | ||
parent: true | parent: true|false | ||
maxapplications: 1 | maxapplications: 1 | ||
properties: | properties: | ||
Line 159: | Line 236: | ||
===<tt>parent</tt>=== | ===<tt>parent</tt>=== | ||
A boolean value. See [[#Parent_Queue|Parent Queue]] above. | |||
===<tt>maxapplications</tt>=== | ===<tt>maxapplications</tt>=== | ||
<code>maxapplications</code> is an unsigned integer value that can be used to limit the number of running applications <font color=darkkhaki>for the configured user or group (it looks like it's per queue).</font> | <code>maxapplications</code> is an unsigned integer value that can be used to limit the number of running applications <font color=darkkhaki>for the configured user or group (it looks like it's per queue).</font> | ||
For a hierarchy of queues, the parent's <code>maxapplications</code> must be larger than <code>maxapplications</code> of any of its children, but not necessarily larger or equal than the sum of <code>maxapplications</code> values for all of its children. This means that a parent queue can limit the number of applications running across its children independently of the children's <code>maxapplications</code> values. Internally, the value is maintained as <code>maxRunningApps</code>. | For a hierarchy of queues, the parent's <code>maxapplications</code> must be larger than <code>maxapplications</code> of any of its children, but not necessarily larger or equal than the sum of <code>maxapplications</code> values for all of its children. This means that a parent queue can limit the number of applications running across its children independently of the children's <code>maxapplications</code> values. Internally, the value is maintained as <code>maxRunningApps</code>. | ||
===<tt>properties</tt>=== | |||
<font color=darkkhaki>TODO: https://yunikorn.apache.org/docs/user_guide/queue_config/#properties</font> | |||
====<tt>application.sort.policy</tt>==== | |||
{{External|https://yunikorn.apache.org/docs/user_guide/queue_config/#applicationsortpolicy}} | |||
====<tt>application.sort.priority</tt>==== | |||
{{External|https://yunikorn.apache.org/docs/user_guide/queue_config/#applicationsortpriority}} | |||
====<tt>priority.policy</tt>==== | |||
{{External|https://yunikorn.apache.org/docs/user_guide/queue_config/#prioritypolicy}} | |||
====<tt>priority.offset</tt>==== | |||
{{External|https://yunikorn.apache.org/docs/user_guide/queue_config/#priorityoffset}} | |||
====<tt>preemption.policy</tt>==== | |||
{{External|https://yunikorn.apache.org/docs/user_guide/queue_config/#preemptionpolicy}} | |||
====<tt>preemption.delay</tt>==== | |||
{{External|https://yunikorn.apache.org/docs/user_guide/queue_config/#preemptiondelay}} | |||
==Permissions== | ===<tt>submitacl</tt>=== | ||
Permissions | See [[#Application_Submission_Permissions|Application Submission Permissions]]. | ||
=== | ===<tt>adminacl</tt>=== | ||
See [[#Queue_Administration_Permissions|Queue Administration Permissions]]. | |||
=== | |||
===<tt>resources</tt>=== | |||
Sets queue resource limits. User and group limits are set with <code>[[#limits|limits]]</code>. | |||
====<tt>max</tt>==== | |||
See [[#Max_Resources|Maximum Resources]] above. | |||
====<tt>guaranteed</tt>==== | |||
See [[#Guaranteed_Resources|Guaranteed Resources]] above. | |||
===<tt>limits</tt>=== | |||
Sets user and group [[#Limits|Limits]] below. | |||
===<tt>queues</tt>=== | |||
Array with recursive declaration of queues. | |||
==Queue Implementation Details== | ==Queue Implementation Details== | ||
Line 196: | Line 300: | ||
* Call <code>AddAllocation()</code> on the node. This mutates the internal state of the node, by adding the resources asked to <code>allocatedResources</code> and removing them from <code>availableResources</code>. | * Call <code>AddAllocation()</code> on the node. This mutates the internal state of the node, by adding the resources asked to <code>allocatedResources</code> and removing them from <code>availableResources</code>. | ||
* The call unwinds returning the <code>objects.Allocation</code> instance to partition level <code>scheduler.PartitionContext#tryAllocate()</code>, which updates its state based on the allocation that has just been performed, in <code>scheduler.PartitionContext#allocate()</code>. | * The call unwinds returning the <code>objects.Allocation</code> instance to partition level <code>scheduler.PartitionContext#tryAllocate()</code>, which updates its state based on the allocation that has just been performed, in <code>scheduler.PartitionContext#allocate()</code>. | ||
=Identity= | |||
An application is submitted under a certain identity, that consists of a [[#User|user]] and one or more [[#Group|groups]]. | |||
<font color=darkkhaki>TO PARSE: | |||
* https://yunikorn.apache.org/docs/user_guide/usergroup_resolution/ | |||
</font> | |||
==User== | |||
A user is defined as the identity that submits an application. | |||
==Group== | |||
The identity an application is submitted under may be associated with one or more groups. | |||
=<span id='Limit'></span>Limits= | |||
{{External|https://yunikorn.apache.org/docs/user_guide/queue_config/#limits}} | |||
User and group limits can be defined, on both partitions and queues. | |||
<syntaxhighlight lang='yaml'> | |||
limits: | |||
- limit: <description> | |||
users: | |||
- <user_name or "*"> | |||
- <user_name> | |||
groups: | |||
- <group_name or "*"> | |||
- <group_name> | |||
maxapplications: <1..maxint> | |||
maxresources: | |||
<resource_name 1>: <0..maxint>[suffix] | |||
<resource_name 2>: <0..maxint>[suffix] | |||
</syntaxhighlight> | |||
=Resource Manager (RM)= | |||
YuniKorn communicates with various implementation of resource management systems (Kubernetes, YARN) via a standard interface defined in the <code>[[YuniKorn_Development#yunikorn-scheduler-interface|yunikorn-scheduler-interface]]</code> package. | |||
=Gang Scheduling= | |||
Gang scheduling style can be "hard" (the application will fail after placeholder timeout) or "soft" (after the timeout the application will be scheduled as a normal application). |
Latest revision as of 17:48, 1 February 2024
Internal
Overview
YuniKorn core is a universal scheduler that can be used to assign Application resource Allocations to Nodes that expose resources. Its default implementation allocate Kubernetes pods, where multiple pods belong to an application and request resources like memory, cores and GPUs, to Kubernetes nodes. However, Applications, Allocations and Nodes can be mapped onto an arbitrary domain. The scheduler assumes that different Allocation may have different priorities, and performs the higher priority Allocations first. The scheduler also has the concept of preemption.
Application
An application is an abstract programmatic entity that requires resources to execute. The application expresses its needs of resources by issuing Allocation requests, which are handled by the scheduler in an attempt to find a Node that can accommodate the resource need for that specific allocation request. In the default Kubernetes implementation, an application is any higher level workload resource that creates pods: deployments, jobs, etc.
What is a reserved application?
Application Lifecycle
Application states (stored as objects.applicationState
):
- NEW
- ACCEPTED
- STARTING
- RUNNING
- REJECTED
- COMPLETING
- COMPLETED
- FAILING
- FAILED
- EXPIRED
- RESUMING
An application gets added as NEW. The application transitions from NEW to ACCEPTED when the first request (Ask) is added to the application. It then moves to STARTING when the Allocation is created. That is the point that the request (Ask) gets assigned to a node. It now shows as an Allocation on the application.
If another Ask was added and a second one gets allocated the application state changes to RUNNING immediately. If there is no other Ask and thus no second Allocation we stay for a maximum of 5 minutes in the STARTING state and then auto transition to RUNNING. This is to support state-aware scheduling. It has no impact on the scheduler or on the pods etc unless you have turned state-aware scheduling on. To configure application to transition to RUNNING after the first allocation Ask, place the tag "application.stateaware.disable": "true" on the AddApplicationRequest
when creating the application.
Application Execution Timeout
The maximum amount of time an application can be in a running state.
Allocation
An allocation can be issued by an application, or it can be an independent allocation, which does not belong to any application.
Allocation Ask
An allocation ask can be configured with constraints to run on specific nodes.
Each allocation has a key. The key is used by both the scheduler and the resource manager to track allocations. The key does not have to be the resource manager's internal allocation ID, such as the pod name. If the allocation specifies an application ID, the application must be registered in advance, otherwise we get "failed to find application ..." Each allocation ask requests resources. Each allocation ask has a priority. An application goes into a state transition after the first allocation ask, to "RUNNING".
What is "MaxAllocations"?
Allocation Ask Implementation
This is the sequence of operations of an Allocation Ask.
- An Allocation Ask update is externally initiated by invoking SchedulerAPI.UpdateAllocation(), which is then turned into an "RM event", which is forwarded to
scheduler.Scheduler
. scheduler.Scheduler
handles anrmevent.RMUpdateAllocationEvent
"update allocation" event in thehandleRMEvent()
function, which immediately calls intoscheduler.ClusterContext#handleRMUpdateAllocationEvent()
.scheduler.ClusterContext#handleRMUpdateAllocationEvent()
→scheduler.ClusterContext#processAsks()
.scheduler.ClusterContext#processAsks()
locates the corresponding partition and calls intoscheduler.PartitionContext#addAllocationAsk()
.scheduler.PartitionContext#addAllocationAsk()
locates the corresponding application.scheduler.PartitionContext#addAllocationAsk()
creates a newobjects.AllocationAsk
instance.scheduler.PartitionContext#addAllocationAsk()
invokes intoobjects.Application#AddAllocationAsk()
with the newly createdobjects.AllocationAsk
instance.objects.Application#AddAllocationAsk()
:- Computes the delta
- If it is "new" or "completing" state, get into "running" state.
- Store the ask in
requests
. - Update priority.
- Update total pending resources up the queue hierarchy.
The allocation attempt won't be executed on this thread, but by one of the asynchronous periodic scheduling runs.
Partition
Each partition has a single root queue.
Partition Configuration
partitions:
- name: <partition_name>
placementrules
limits
nodessortpolicy
preemption:
enabled: true
queues:
- [...]
- [...]
Partition Implementation Details
The total "partition resource" is the sum of its nodes' "capacity" (node.GetCapacity()
, which is the node's "total resource").
PartitionContext
Node
Nodes can be marked to be schedulable or not.
A node has a "resource capacity", which means total resources contributed by that node, and that is accessible with Node.GetCapacity()
. Internally, that is maintained by the node as totalResource
.
Node Update Implementation
Sequence of operations resulting in a node update:
- A node update is externally initiated by invoking
SchedulerAPI.UpdateNode()
on externally-detected node change, implemented by the Resource Manager client code. The call gets asi.NodeRequest
, instantiated by the client code. SchedulerAPI.UpdateNode()
checks whether the Resource Manager ID is correct and then creates armevent.RMUpdateNodeEvent
event instance, which wraps thesi.NodeRequest
, to be sent toscheduler.Scheduler
for processing.scheduler.Scheduler
handles thermevent.RMUpdateNodeEvent
event inhandleRMEvent()
, which immediately calls intoscheduler.ClusterContext#handleRMUpdateNodeEvent()
.scheduler.ClusterContext#handleRMUpdateNodeEvent()
callsprocessNodes()
.processNodes()
looks atsi.NodeInfo.Action
and handle CREATE and CREATE_DRAIN.- If the node is to be created,
addNode()
is invoked. OtherwiseupdateNode()
is invoked. scheduler.ClusterContext
creates the newobjects.Node
instance.scheduler.ClusterContext
invokesscheduler.PartitionContext#AddNode()
.scheduler.PartitionContext#AddNode()
invokesscheduler.PartitionContext#addNodeToList()
.scheduler.PartitionContext#AddNode()
updates its internal representation of resources contributed by that node.- To continue.
Queue
Explain what a queue is. What can one control? What should one use it? Queues provide fine-grained resource management for the resources exposed by nodes.
Queues can be defined in the configuration, and in this case they are referred to as managed queues. Also see unmanaged queues.
The queue configuration can change while the scheduler is running. However, updating the configuration of a queue will not trigger further action. For example, if the maximum number of resources of a queue or its parent became smaller, the currently running applications are left running. The scheduler will take the new configuration into account as new applications arrive, and there will be convergence over time.
A managed queue can be removed only by changing the configuration, and if there are no applications running in it. To remove such a queue, the queue is removed from configuration and makes as training, and when a managed draining queue is empty, it will be removed.
Fully-Qualified Queue Name
The queues in the hierarchy are uniquely identified by fully-qualified names, with name fragments separated by dots ("."). Each fragment corresponds to a queue name in the hierarchy. Example: "root.companyA.development". A queue name cannot contain dots. As long as no two queue with the same parent share the same name, different fragments that are part of different fully-qualified queue names can be identical.
root ├── companyA │ ├── development │ └── production │ └── companyB └── development
A queue that does not have children is referred to as leaf queue. All other queues, including the root queue which must have at least one children, are referred to as parent queues.
Application Placement in Queue (Placement Rules)
The users can place an application in a queue by specifying the name of the queue upon submission. It is also possible to place an application in a queue dynamically, based on placement rules.
A placement rule uses application details to place the application in a queue. Applying the rule either produces a fully qualified queue name, or a failure. which means that the next rule in the list is executed. Rules are executed in the order they are defined. The first rule that produces a result will end execution. If all rules are executed and a match does not occur, the application will be rejected.
A rule may allow the creation of a queue, and in that case, the rule may return a queue that is not defined in the configuration. The queue will be created then. The scheduler manages creation and deletion of such queues. When a dynamically created queue is empty, it will be removed. Such queues are named unmanaged queues. Also see managed queues.
To continue to process https://yunikorn.apache.org/docs/design/scheduler_configuration/#placement-rules-definition
Organizatorium
Each queue has priority-related state:
- currentPriority - the current scheduling priority
- priorityOffset
- priorityPolicy
- prioritySortEnabled
The queue also caches priorities of its children, in an internal map.
Each queue has a maximum number of running applications it can accommodate, which can be set in configuration. What happens if the number is reached?
Partition Root Queue
The partition root queue maps over an entire partition, and updates to resources available to the root queue are performed automatically and calculated based on the nodes registered with the partition. As such, configuring max and guaranteed resources on the root queue is not allowed. The partition root queue has always its max resource limit set to the sum of resources for all nodes in the partition.
Parent Queue
Each queue, with the exception of the root queue, which does not have a parent, must have exactly one parent. The property of being a "parent" is sometimes referred to as the "queue type".
Apparently the parent "type" can be set via configuration, why would one want to do that? And indeed if parent
is set to true
, the queue becomes automatically non-leaf, even if it does not have children.
Leaf Queue
Applications can only be submitted to leaf queues.
Queue Resource Configuration
maximum and guaranteed resources can be configured for a queue, and they are both optional configuration parameters.
Resources limits are checked recursively up the queue hierarchy.
Resources that are not specified in the resources.max
and resources.guaranteed
are not limited, for max resources, or guaranteed, in case of guaranteed resources.
Maximum Resources
resources.max
is optional.
The value cannot be set on the root queue (when set, causes parsing error ""root queue must not have resource limits set").
When set on a non-root queue, places a hard limit on the size of all cumulative allocations that can be serviced by that queue at any point in time. The difference between resources actually allocated for applications in this queue and the maximum configured value is called "headroom", and the headroom cannot go below zero. It can be zero, though, which means that all resources permitted by resources.max
have been allocated.
If an application is placed into a queue with not enough headroom, the application states in "Accepted" state and it is not scheduled until the maximum limit is updated via configuration (How, objects.Queue#SetMaxResource()
seems to be designed to only be used with the root queue.)
If resources.max
is set to 0, the resource is not available to the queue so application that require that resource cannot be scheduled from this queue.
Guaranteed Resources
resources.guaranteed
is optional.
Guaranteed resources resources.guaranteed
are used in calculating the share of the queue and during allocation. It is used as one of the inputs for deciding which queue to give the allocation to. Preemption uses the guaranteed resource of a queue as a base which a queue cannot go below.
Queue Resource Usage
The resource usage for a leaf queue is the sum of all allocated resources for that queue. The resource usage for a parent queue is the sum of the usage numbers for all its descendants.
Resources: Max: map[A00:10] Guaranteed: map[A00:1] Actual Guaranteed: map[A00:1] Allocated: map[A00:9] Pending: map[A00:0] Preempting: map[]
Permissions (Access Control Lists)
There are scheduler permissions (access control lists, ACL) and queue ACLs. A scheduler administrator is not by default allowed to submit an application or administer the queues in the system. By default, access control is enabled and access id denied.
Queue ACLs are checked recursively up to the root of the queue tree staring with the lowest point in the tree. If permission on the queue do not allow something, the parent is checked, and so on.
Access control lists give access to the users and groups that have been specified in the list. They do not provide the possibility to explicitly remove or deny access to the users and groups specified in the list.
ACL ::= “*” | userlist [ “ “ grouplist ]
userlist ::= “” | user { “,” user }
grouplist ::= “” | group { “,” group }
Application Submission Permissions
Submission permission relate to the capability to submit a certain application to a certain queue by specific users or groups. The users or groups that have the permission to submit application to a specific queue can be specified in the queue configuration.
Must explicitly specify submitacl: "*"
, if missing, the application won't be accepted, with "application rejected: no placement rule matched"
Queue Administration Permissions
Administrative permission imply application submission permissions and also allow administrative actions such as terminating an application or moving the application to a different queue.
Queue Configuration
The queue configuration is dynamic and it can be changed while the scheduler is running, without requiring a scheduler restart. The queue configuration will change after invocation of the corresponding Go API method, of the REST based API or after changing the configuration file.
name: somequeue
parent: true|false
maxapplications: 1
properties:
adminacl:
submitacl:
resources:
limits:
queues:
- [...]
[...]
name
Specifies the queue name, which will become a fragment of the fully-qualified queue name.
parent
A boolean value. See Parent Queue above.
maxapplications
maxapplications
is an unsigned integer value that can be used to limit the number of running applications for the configured user or group (it looks like it's per queue).
For a hierarchy of queues, the parent's maxapplications
must be larger than maxapplications
of any of its children, but not necessarily larger or equal than the sum of maxapplications
values for all of its children. This means that a parent queue can limit the number of applications running across its children independently of the children's maxapplications
values. Internally, the value is maintained as maxRunningApps
.
properties
TODO: https://yunikorn.apache.org/docs/user_guide/queue_config/#properties
application.sort.policy
application.sort.priority
priority.policy
priority.offset
preemption.policy
preemption.delay
submitacl
See Application Submission Permissions.
adminacl
See Queue Administration Permissions.
resources
Sets queue resource limits. User and group limits are set with limits
.
max
See Maximum Resources above.
guaranteed
See Guaranteed Resources above.
limits
Sets user and group Limits below.
queues
Array with recursive declaration of queues.
Queue Implementation Details
Max resources for a queue is given by maxResource
.
Priority
Preemption
Scheduler
The scheduler instance scheduler.Scheduler
is the initiator of scheduling runs.
Scheduling Run
The scheduler can automatically and periodically execute scheduling runs with by invoking scheduler.ClusterContext#schedule()
with the periodicity of 100 milliseconds, or it can be triggered manually, is it is started with the manual schedule option true
. To manually schedule, start the scheduler with "auto" mode disabled and manually invoke scheduler.Scheduler.MultiStepSchedule()
.
Scheduling Run Implementation
- If it is a manually initiated scheduling run, trigger it with
scheduler.Scheduler.MultiStepSchedule()
. scheduler.ClusterContex#schedule()
. This is the main scheduling routine. It processes each partition, it walks over each queue and each application to check if anything can be scheduled.- For a partition, the allocation logic tries in turn: reservations, placeholder replacements and regular pending resource asks allocation.
- A regular pending resource ask allocation starts with the partition's root queue and recursively walks down in
objects.Queue#TryAllocate()
. Allocation logic is only executed for leaf queues, non-leaf queue are just recursively walked down. - For each leaf queue, we get the head room and then we iterated over the sorted applications, skipping those that do not have pending requests.
- For each application we try to allocate with
objects.Application#tryAllocate()
. - Understand head room. Compute users' headroom.
- Iterate over nodes given by the "node iterator" and invoke
objects.Application#tryNodes()
that internally iterates on nodes and invokesobjects.Application#tryNode()
on each. - For the first node that fits the allocation ask, the application creates a new
objects.Allocation
instance that carries the node's ID and the allocated resources. - Call
AddAllocation()
on the node. This mutates the internal state of the node, by adding the resources asked toallocatedResources
and removing them fromavailableResources
. - The call unwinds returning the
objects.Allocation
instance to partition levelscheduler.PartitionContext#tryAllocate()
, which updates its state based on the allocation that has just been performed, inscheduler.PartitionContext#allocate()
.
Identity
An application is submitted under a certain identity, that consists of a user and one or more groups.
TO PARSE:
User
A user is defined as the identity that submits an application.
Group
The identity an application is submitted under may be associated with one or more groups.
Limits
User and group limits can be defined, on both partitions and queues.
limits:
- limit: <description>
users:
- <user_name or "*">
- <user_name>
groups:
- <group_name or "*">
- <group_name>
maxapplications: <1..maxint>
maxresources:
<resource_name 1>: <0..maxint>[suffix]
<resource_name 2>: <0..maxint>[suffix]
Resource Manager (RM)
YuniKorn communicates with various implementation of resource management systems (Kubernetes, YARN) via a standard interface defined in the yunikorn-scheduler-interface
package.
Gang Scheduling
Gang scheduling style can be "hard" (the application will fail after placeholder timeout) or "soft" (after the timeout the application will be scheduled as a normal application).