0295. Find Median From Data Stream¶

The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value and the median is the mean of the two middle values.

For example, for arr = [2,3,4], the median is 3.
For example, for arr = [2,3], the median is (2 + 3) / 2 = 2.5.

Implement the MedianFinder class:

MedianFinder() initializes the MedianFinder object.
void addNum(int num) adds the integer num from the data stream to the data structure.
double findMedian() returns the median of all elements so far. Answers within 10-5 of the actual answer will be accepted.

Example 1:

Input
["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"]
[[], [1], [2], [], [3], []]
Output
[null, null, null, 1.5, null, 2.0]

Explanation
MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1);    // arr = [1]
medianFinder.addNum(2);    // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3);    // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0

Constraints:

-105 <= num <= 105
There will be at least one element in the data structure before calling findMedian.
At most 5 * 104 calls will be made to addNum and findMedian.

Follow up:

If all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?
If 99% of all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

Analysis¶

To find the median, we need to split data into two halves, and then use (first half's last element + second half's first element) / 2 (even) or first half's last (odd) to find the median. The first half's last should be the value that is greater or equal to all the values from the first half of the data, and the second half's first is less than all the elements following. To find such two elements, we can use a priority queue.

Use one queue for finding the large median (max heap), and use the other queue to find the smaller median (min-heap). To make two queues with the size difference at most 1, we need to find a way to balance the two queues.

Complexity: Time: add takes $O(\log{n})$ + find takes $O(1)$ Space: two queues take $O(n)$

xxxx m yyyy

q1: xxxx
q2: yyyy

xxxxm yyyy z // push from odd

xxxxzm yyyy -> xxxxxx yyyy -> xxxxx yyyyx -> xxxxx yyyyy -> m = (x + y) / 2
xxxxm zyyyy -> xxxxm yyyyy -> xxxxx yyyyy -> m = (x + y) / 2

xxxm yyyy z // push from even

xxxzm yyyy -> xxxx yyyyx -> xxxxm yyyy
xxxm yyyyz -> xxxx yyyyy -> xxxxm yyyy

if there are even num of nodes
  then push to large then push the top of large to small // now it's odd
else 
  then push to small then push the top of small to large // now it's even

Code¶

class MedianFinder {
private:
    priority_queue<double> small, large;
    bool even;
public:
    /** initialize your data structure here. */
    MedianFinder() {
        even = true;// 0 is even
    }

    void addNum(int num)
    {
        if (even) { // small.size() == large.size()
            large.push(-num);
            small.push(-large.top());
            large.pop(); 
        } else { // small.size() - large.size() == 1
            small.push(num);
            large.push(-small.top());
            small.pop(); 
        }
        even = !even;
    }

    double findMedian()
    {
        return !even ? small.top() : 0.5 * (small.top() - large.top());
    }
};

/**
 * Your MedianFinder object will be instantiated and called as such:
 * MedianFinder* obj = new MedianFinder();
 * obj->addNum(num);
 * double param_2 = obj->findMedian();
 */

Solution 2: without even boolean¶

Use two priority queues to store the streaming element.

2
small: 2 -> empty -> 2
large: -2 -> empty
median: 2

2, 3
small: 2 -> 3,2 -> 2
large: empty -> -3 
median: (2-(-3))/2 = 2.5

2, 3, 4
small: 2 -> 4, 2 -> 2 -> 3, 2
large: -3 -> -3, -4 -> -4
median: 3

Code¶

class MedianFinder {
private:
    priority_queue<long> small, large;

public:
    /** initialize your data structure here. */
    MedianFinder() {}

    void addNum(int num) {
        // blindly push in two pq
        small.push(num);
        large.push(-small.top());
        // small is always greater than large in size, at most greater by one, equal is fine
        small.pop(); // leave the new element in large
        if (small.size() < large.size()) {
            small.push(-large.top());
            large.pop(); // large has one more, so move it to small
        }
    }

    double findMedian() {
        return small.size() != large.size() ? small.top()
                                           : 0.5 * (small.top() - large.top());
    }
};

/**
 * Your MedianFinder object will be instantiated and called as such:
 * MedianFinder* obj = new MedianFinder();
 * obj->addNum(num);
 * double param_2 = obj->findMedian();
 */

Followups¶

With all fall in between 0-100?

Code $O(100)$ ¶

class MedianFinder {
    private:
        int A[101], n;

    public:
        MedianFinder() {
            n = 0;
            memset(A, 0, sizeof A);
        }    
        void addNum(int num) {
            A[num]++;
            n++;
        }

        double findMedian() {
            int count = 0, i = 0;
            while (count < n/2) count += A[i++];
            int j = i;
            while (count < n/2 + 1) count += A[j++];
            return (n % 2 == 1) ? j-1 : (i + j - 2) / 2.0;
        }
};

Further Thoughts: https://leetcode.com/articles/find-median-from-data-stream/

There are so many ways around this problem, that frankly, it is scary. Here are a few more that I came across:

Buckets! If the numbers in the stream are statistically distributed, then it is easier to keep track of buckets where the median would land, than the entire array. Once you know the correct bucket, simply sort it find the median. If the bucket size is significantly smaller than the size of input processed, this results in huge time saving. @mitbbs8080 has an interesting implementation here.
Reservoir Sampling. Following along the lines of using buckets: if the stream is statistically distributed, you can rely on Reservoir Sampling. Basically, if you could maintain just one good bucket (or reservoir) which could hold a representative sample of the entire stream, you could estimate the median of the entire stream from just this one bucket. This means good time and memory performance. Reservoir Sampling lets you do just that. Determining a "good" size for your reservoir? Now, that's a whole other challenge. A good explanation for this can be found in this StackOverflow answer.
Segment Trees are a great data structure if you need to do a lot of insertions or a lot of read queries over a limited range of input values. They allow us to do all such operations fast and in roughly the same amount of time, always. The only problem is that they are far from trivial to implement. Take a look at my introductory article on Segment Trees if you are interested.
Order Statistic Trees are data structures which seem to be tailor-made for this problem. They have all the nice features of a BST, but also let you find the $k^{th}$ order element stored in the tree. They are a pain to implement and no standard interview would require you to code these up. But they are fun to use if they are already implemented in the language of your choice.

GNU libstdc++ users are in luck! Take a look at this StackOverflow answer. ↩

Last update: April 1, 2022

0295. Find Median From Data Stream¶

Analysis¶

Solution 1 [Recommend]: use global odd or even to determine median¶

Code¶

Solution 2: without even boolean¶

Code¶

Followups¶

Code O(100)O(100)¶

Code $O(100)$ ¶