Selection Problem: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(36 intermediate revisions by the same user not shown)
Line 3: Line 3:
* [[Algorithms#Selection_Problem|Algorithms]]
* [[Algorithms#Selection_Problem|Algorithms]]
* [[Sorting Algorithms#Overview|Sorting Algorithms]]
* [[Sorting Algorithms#Overview|Sorting Algorithms]]
* [[The Partition Subroutine]]
* [[The Median Maintenance Problem]]


=Overview=
=Overview=


<span id='Order_Statistic'></span><span id='Order_Statistics'></span>The i<sup>th</sup> '''order statistic''' of a set of n numbers is the i<sup>th</sup> smallest number in the set.  
<span id='Order_Statistic'></span><span id='Order_Statistics'></span>The i<sup>th</sup> '''order statistic''' of a set of n numbers is the i<sup>th</sup> smallest number in the set. Finding the i<sup>th</sup> order statistic of a set of n '''distinct''' numbers (distinctness is for simplicity) is known as the '''selection problem'''.  


Finding the i<sup>th</sup> order statistic of a set of n ''distinct'' numbers  is known as the '''selection problem'''. Finding the median is a particular case of the selection problem. The selection problem can be resolved generically by sorting the entire set and then selecting the desired element, by [[Reduction#Overview|reducing]] the selection problem to the sorting problem. However, key comparison sorting [[Comparison_Sorting_Algorithms_Complexity|cannot be done more efficiently than Ω(n lgn)]], and more specialized and faster algorithms exist.
Finding the median is a particular case of the selection problem, where i is n / 2. Finding the minimum in a sorted array is finding the 0 order statistics and finding the maximum in a sorted array is finding the (n - 1)<sup>th</sup> order statistic.


The general selection problem can be resolved with a randomized divide-and-conquer algorithm with an expected running time of Θ(n). The algorithm is somewhat similar to the one used by randomized [[Quicksort#Quicksort_and_Selection_Problem|Quicksort]].
The selection problem is exposed as the [[Data_Structures#SELECT|SELECT()]] operation of totally ordered sets.


<font color=darkgray>TODO [[CLRS]] page 213.</font>
The selection problem can be resolved generically by sorting the entire set and then selecting the desired element, by [[Reduction#Overview|reducing]] the selection problem to the [[Sorting_Algorithms#Sorting_Problem|sorting problem]]. However, key comparison sorting [[Comparison_Sorting_Algorithms_Complexity|cannot be done more efficiently than Ω(n log n)]], and more specialized and faster algorithms exist for the selection problem.
 
Fundamentally, selection is an easier problem than sorting, so if only a certain order statistic is required for an unsorted set, we '''do not need to sort the set'''.
 
The general selection problem can be resolved with a randomized divide-and-conquer algorithm with an expected running time of Θ(n). The algorithm is somewhat similar to the one used by randomized [[Quicksort#Quicksort_and_Selection_Problem|Quicksort]]. There is also a [[#Deterministic_Selection|linear time algorithm for selection that does not use randomization]]; the idea is to use the pivot deterministically in a very careful way using a method called "median of medians".
 
<font color=darkkhaki>TODO [[CLRS]] page 213.</font>
 
=Randomized Selection=
RSelect has small constants and works in place, and it has a time complexity of O(n) on average. The pseudocode is:
<font size=-1>
RSelect(array A, length n, order statistics i) {
  if n == 1 return A[0]
  choose pivot p from A, uniformly at random
  j = partition(A, p) <font color=teal># j is the order statistic that p is, use the [[The_Partition_Subroutine|partition subroutine]] for that</font>
  if i == j return A[i]
  if j > i return RSelect(first part of A, length j - 1, i)
  if j < i return RSelect(second part of A, n - j, i - j)
}
</font>
Also see: {{Internal|The_Partition_Subroutine#Overview|The Partition Subroutine}}
 
==Randomized Selection Analysis==
{{External|https://www.coursera.org/learn/algorithms-divide-conquer/lecture/obhKq/randomized-selection-analysis}}
 
=Deterministic Selection=
{{External|https://www.coursera.org/learn/algorithms-divide-conquer/lecture/vtehr/deterministic-selection-algorithm-advanced-optional}}
{{External|https://www.coursera.org/learn/algorithms-divide-conquer/lecture/2wmHr/deterministic-selection-analysis-i-advanced-optional}}
{{External|https://www.coursera.org/learn/algorithms-divide-conquer/lecture/vOjvG/deterministic-selection-analysis-ii-advanced-optional}}
DSelect is a deterministic O(n) selection algorithm that uses the pivot very carefully using a method called the "median of medians".
=Selection in Binary Search Trees=
{{Internal|Binary_Search_Trees#SELECT_Implementation|SELECT Implementation for Binary Search Trees}}

Latest revision as of 19:50, 13 October 2021

Internal

Overview

The ith order statistic of a set of n numbers is the ith smallest number in the set. Finding the ith order statistic of a set of n distinct numbers (distinctness is for simplicity) is known as the selection problem.

Finding the median is a particular case of the selection problem, where i is n / 2. Finding the minimum in a sorted array is finding the 0 order statistics and finding the maximum in a sorted array is finding the (n - 1)th order statistic.

The selection problem is exposed as the SELECT() operation of totally ordered sets.

The selection problem can be resolved generically by sorting the entire set and then selecting the desired element, by reducing the selection problem to the sorting problem. However, key comparison sorting cannot be done more efficiently than Ω(n log n), and more specialized and faster algorithms exist for the selection problem.

Fundamentally, selection is an easier problem than sorting, so if only a certain order statistic is required for an unsorted set, we do not need to sort the set.

The general selection problem can be resolved with a randomized divide-and-conquer algorithm with an expected running time of Θ(n). The algorithm is somewhat similar to the one used by randomized Quicksort. There is also a linear time algorithm for selection that does not use randomization; the idea is to use the pivot deterministically in a very careful way using a method called "median of medians".

TODO CLRS page 213.

Randomized Selection

RSelect has small constants and works in place, and it has a time complexity of O(n) on average. The pseudocode is:

RSelect(array A, length n, order statistics i) {
  if n == 1 return A[0]
  choose pivot p from A, uniformly at random
  j = partition(A, p) # j is the order statistic that p is, use the partition subroutine for that
  if i == j return A[i]
  if j > i return RSelect(first part of A, length j - 1, i)
  if j < i return RSelect(second part of A, n - j, i - j)
}

Also see:

The Partition Subroutine

Randomized Selection Analysis

https://www.coursera.org/learn/algorithms-divide-conquer/lecture/obhKq/randomized-selection-analysis

Deterministic Selection

https://www.coursera.org/learn/algorithms-divide-conquer/lecture/vtehr/deterministic-selection-algorithm-advanced-optional
https://www.coursera.org/learn/algorithms-divide-conquer/lecture/2wmHr/deterministic-selection-analysis-i-advanced-optional
https://www.coursera.org/learn/algorithms-divide-conquer/lecture/vOjvG/deterministic-selection-analysis-ii-advanced-optional

DSelect is a deterministic O(n) selection algorithm that uses the pivot very carefully using a method called the "median of medians".

Selection in Binary Search Trees

SELECT Implementation for Binary Search Trees