Array vs Stream—”to heap or not to heap ”

Prerequsites : 🌇kth smallest element in an array

Question

Design a class to find the kth largest element in a stream. Note that it is the kth largest element in the sorted order, not the kth distinct element.

Implement KthLargest class:

KthLargest(int k, int[] nums) Initializes the object with the integer k and the stream of integers nums.
int add(int val) Appends the integer val to the stream and returns the element representing the kth largest element in the stream.

Example 1:

Input
["KthLargest", "add", "add", "add", "add", "add"]
[[3, [4, 5, 8, 2]], [3], [5], [10], [9], [4]]
Output
[null, 4, 5, 5, 8, 8]

Explanation
KthLargest kthLargest = new KthLargest(3, [4, 5, 8, 2]);
kthLargest.add(3);   // return 4
kthLargest.add(5);   // return 5
kthLargest.add(10);  // return 5
kthLargest.add(9);   // return 8
kthLargest.add(4);   // return 8

Leetcode

Reframing

In the previous questions we were given an array without any values being added from a strea. With an array, you can sort it and return the kth element either from the start or the end, depending on whether you want the minimum or maximum value. The time complexity for that will be O(n(log(n))).Using a method like that with a fixed size array is bad, but it's even worse when dealing with a stream. When a new element is added to an array for finding the kth largest/smallest element, the array needs to be sorted again. This process must be repeated for each new element, which increases the complexity to O(n*nlogn).
This results in multiple values for the kth element. Moreover?

class KthLargest:

    def __init__(self, k: int, nums: List[int]):
        self.k = k
        self.array = nums
       
        

    def add(self, val: int) -> int:        
        self.array.append(val)
        self.array.sort()
       
        kthLargest = len(self.array)-(self.k)
        return self.array[kthLargest]

Example: Adding 23 to an already sorted array requires sorting the entire array again.

This question is very similar to the previous two. In "The greedy merchant—'No one is wise from another man's woe,'" we created a heap of the k most expensive items by adding them to a min heap. We needed a min heap because we had to remove the least valuable item to make room for a new valuable item if it came. We needed the least valuable item at the top, which is only possible with a min heap. Similar to the "Kth smallest element in an array" problem, we can use a max heap instead of a min heap. By removing the largest minimum and replacing it with the smaller minimum from the array.

If you are confused about why we will use a min heap to get the maximums or a max heap to get the minimums, look at the section "A Counterintuitive Mindset?" in The greedy merchant—"No one is wise from another man's woe”.

We can start by initializing a bucket of maximums with the largest K items up until that point. To remove the smallest of these, we'll use a min heap. After initilization we can check whether incoming items will replace any elements in the heap or bucket. If they do, we replace the minimum in the heap with the incoming item. If not, that element will never be part of the top K elements.

# example
topK3 = [1,2,3]  # min heap
toAdd = 4

add 4 at the expense of 1 

topK3 = [2,3,4] 

1 will never be in the topK3 again

Implementation

During initialization, we add all items from nums to a min heap. However, we need the heap to contain only the k largest elements at initialization and there might be more element in the intial array then k.
The minimums from nums will be at the top, so we remove the smaller items from the top of the heap until we have the k largest items from nums in the heap.

def __init__(self, k: int, nums: List[int]):
  self.k = k
  self.heap = []
 

  for i in range(len(nums)):
      heappush(self.heap, nums[i])
  

  if len(nums) > k:
      while len(self.heap) > k:
          heappop(self.heap)

Adding new elements from the stream

First, we need to ensure that the heap contains K items. There may be cases where nums is less than K and a heap smaller than K is initialized. In such cases, we simply need to add the incoming item to the heap.

# not enough ks 
sizeLessThanK = len(self.heap) < self.k

if sizeLessThanK: 
   heappush(self.heap, val)

Example : Test case

input = 
["KthLargest","add"]
[[2,[0]],[-1]]

If we have at least k elements,

If the incoming element from the stream is greater than the (smallest element in the heap) top element, we can replace that item with the incoming number.

To retain the size of the heap as k, the item at the top will be removed in favor of the incoming item.
The item at that was removed will never be part of the top k again.

topK3 = [1,2,3] 

add = 4
add 4 at the expense of 1 

topK3 = [2,3,4] 

1 will never be in the topK3 again

if the incoming item is less than the topmost element it has no power to replace it, so we disregrad it.

topK3 = [1,2,3] 

add = 0

topK3 = [1,3,4] 

0 can't be in the top 3

TopHeapValueWillNeverBeinTopKAgain = not(sizeLessThanK) and val > self.heap[0]

elif TopHeapValueWillNeverBeinTopKAgain:
    heappop(self.heap)
    heappush(self.heap,val)
#else: pass

return self.heap[0]

not(sizeLessThanK) ensures that self.heap[0] is not run, which helps us avoid a KeyError when the heap is empty. This strategy is commonly referred to as Short Circuit Evaluation.

To find the Kth largest item in a heap or array, simply return the item located K positions away from the maximum. Since it is a min heap the maximum will be at the bottom and k positions away from it will be the top most item in the heap.

return self.heap[0]

Code

from heapq import heappop, heappush, heapify
class KthLargest:

    def __init__(self, k: int, nums: List[int]):
        self.k = k
        self.heap = []
       

        for i in range(len(nums)):
            heappush(self.heap, nums[i])
        

        if len(nums) > k:
            while len(self.heap) > k:
                heappop(self.heap)
            
    def add(self, val: int) -> int:        
    
        
        sizeLessThanK = len(self.heap) < self.k
        TopHeapValueWillNeverBeinTopKAgain = not(sizeLessThanK) and val > self.heap[0]
        
        if sizeLessThanK: 
            heappush(self.heap, val)
        
        elif TopHeapValueWillNeverBeinTopKAgain:
            heappop(self.heap)
            heappush(self.heap,val)
        
        return self.heap[0]

Exmaples:

If new data is constantly arriving or old data is leaving, and we ONLY need to return the maximum or minimum values, think of using heaps
consider the following scenarios.

There queue for running tasks. The highest-priority task is processed and removed from the queue once completed, while new tasks with varying priority are added to it.
Sales conducted throughout the day, with the top three sales always available.

The two indicators for the viability of using a heap are the need to find the Kth largest/smallest element and the use of a stream. Since both of these criteria are met in this case, using a heap is a viable solution.

To Heapify or not to Heapify

We push all the elements to the heap, creating it. This operation has a time complexity of nlog(n).

for i in range(len(nums)):
	  heappush(self.heap, nums[i])

We can simply use heapify which cost O(N).

O(N)??? if you are curious why it is O(N), look at the section Create O(N) / heapify in A heap upclose: Simulating a max heap.
Heapify is most effective when the elements are known beforehand. Since we only know the "nums" numbers beforehand, we can use heapify for that case. However, we can't use it effectively for elements in a stream.

self.heap = nums
heapify(self.heap)

Complexities

Time Complexity:

Non-Heapify:

Pushing all items to the heap separately has a time complexity of O(len(nums) * log(len(nums))).

n is the number of items added after initialization. Adding items after initialization has a time complexity of n * log(k).

The overall time complexity is O(len(nums) * log(len(nums)) + n * log(k)),

The theoretical time complexity is O(n * log(k)).

Heapify:

Heapify has a time complexity of O(len(nums)).

n is the number of items added after heapify. Adding items after heapify has a time complexity of n * log(k).

The overall time complexity is O(len(nums) + n * log(k)).

The theoretical time complexity is O(n * log(k)).

The space complexity is O(k).

O(len(nums)) is insignificant compared to O(n * log(k)) as the number of added items increases.

Implementation— Keep adding elements till k + 1

There is another way to write the code that keeps adding items until the total number of elements surpasses k.
Let's imagine we are adding a new element. We add it to the heap, and when the size of the heap is k + 1, we remove the smallest element from the heap to balance it back to k.

k = 3
nums = [1,2,3]

# adding 1,2,3 before size reaches k
heap = [1,2,3]

# adding 4
heap = [1,2,3,4]

# size exceeded, remove the smallest to have the *k* largest elements

# remove 1 (smallest of the kLargestElements)
heap = [2,3,4]

Bonus: Use the add method to add all existing values in nums to the heap, thus reducing the amount of code.
Always add the incoming value to the heap. If the heap contains more than k elements, remove the smallest element by popping the top item.
This will allow us to keep only the k largest elements in the heap. The top element will be the kth largest, which can then be returned.

Code

from heapq import heappop, heappush, heapify
class KthLargest:

    def __init__(self, k: int, nums: List[int]):
        self.k = k
        self.heap = []

        for number in nums:
            self.add(number) # call add instead of heap's built in function



    def add(self, val: int) -> int:        
        
        heappush(self.heap, val)

        if len(self.heap) > self.k:
            heappop(self.heap)

        return self.heap[0]

Complexities

Time Complexity: O(n * log(k))

Space Complexity: O(k)

Dedications

💡

DRY(Don't Repeat Yourself) Java code

To peak or not to peak

Both of the solution’s for add above are doing the same thing, but in the way they do it is a little different.
Keep adding and popping.

Add the incoming value and remove the smallest element (if the heap contains more than k elements). This leaves k largest items in the heap, and since it is a min heap, the smallest amongst them becomes the Kth largest.

Peaking and adding.

We must be careful when both peaking and then adding. Before adding incoming items, we should check if they are useful for determining the Kth largest elements. For example, if K is 3 and the heap contains 2, 3, and 4, then an incoming item of 1 will never be included in the K largest values. Since the heap is a min heap of size K, and the incoming item is smaller than the smallest item in the heap, it is automatically smaller than the top K items and is discarded.

k = 3
heap = [2,3,4] 
incoming stream = 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

For peaking and adding. We need to ensure that we do not pop out items when we still do not have k values in the heap. We don’t have to worry about that in the add method as we keep adding items till k + 1.

k = 2
heap = [0]
incoming stream = 1

#make sure zero is not kicked out in favour of 1 before we have reached the size 2(k)

Add (and pop)

Peak and then add(and pop)

def add(self, val: int) -> int:        
        
	heappush(self.heap, val)
	
	if len(self.heap) > self.k:
	    heappop(self.heap)
	
	return self.heap[0]

   def add(self, val: int) -> int:        
    
 
    sizeLessThanK = len(self.heap) < self.k
    TopHeapValueWillNeverBeinTopKAgain = not(sizeLessThanK) and val > self.heap[0]
    
    if sizeLessThanK: 
        heappush(self.heap, val)
    elif TopHeapValueWillNeverBeinTopKAgain:
        heappop(self.heap)
        heappush(self.heap,val)
    
    return self.heap[0]

But what about the time complexities

They both will theoretically be the same as they both can have different worst cases. But in some application the peak and then add might have some benefits.
Imagine you have an arcade machine and the score distribution is skewed towards smaller scores but you want the top 8 items in the score board. While there are incoming scores.

It is better to peek first and then add, rather than adding and then popping. For items that will eventually be discarded, the logarithmic complexity of adding and popping is O(log(k)), whereas peeking first and not adding it into the heap is O(1).