Problem Brief

Approximate Percentiles

Hi there! At the moment, FP doesn’t yet support multi-function challenges—like the ones you see on LeetCode—but we’re actively working on adding that capability. Thanks for your patience!

In this problem, you'll implement a method of approximating metric percentiles.

A metric is a numeric measurement that might reflect the performance, health, or behavior of a system over time. For example, a metric of page_load_ms might measure the milliseconds it takes to load your website. In your code, every time your website is visited, you'd measure how long it took and report it to the page_load_ms metric.

A percentile is a common aggregate statistic used by engineers to summarize certain metrics. The n-th percentile is the value where n percent of data points are less than or equal to that value. For example, if the 90th percentile (denoted p90) of page_load_ms is 800, that means 90% of page loads are less than or equal to 800ms.

Looking at a few percentiles (e.g. p50, p90, and p99) helps understand how your system is performing.

While there are several accepted ways to calculate percentiles, for this problem we'll use the following definition:

Let data be a list of all values reported for a metric so far, sorted in ascending order.

Calculate rank = p * n / 100, where p is the percentile you are trying to calculate, ceil is the ceiling function, and n is the length of data.

Finally, percentile p is equal to data[rank-1].

A common challenge of percentiles is calculating them efficiently for a large number of values. You can calculate percentile p through brute force - just sort the data and follow the procedure described above. While this would work, it's not efficient because you need to resort the list every time as values are added.

In this problem, you'll implement a system to efficiently approximate percentiles using a histogram data structure.

Note: we typically attach timestamps to metrics and calculate percentiles over a specific time period, e.g. the last 24 hours. For this problem, we'll ignore time and consider all values reported.

Approximating percentiles

To approximate percentiles, we can use histograms, which group values into fixed-range buckets and track counts in each bucket. For example, with the inclusive range [1, 100] split into 10 buckets, the first bucket counts values in [1, 10], the second bucket [11, 20], and so on.

Given values: [75, 9, 5, 15, 39, 32, 8, 58, 45, 12, 23]

The histogram buckets would be:

[1, 10]: 3
[11, 20]: 2
[21, 30]: 1
[31, 40]: 2
[41, 50]: 1
[51, 60]: 1
[61, 70]: 0
[71, 80]: 1
[81, 90]: 0
[91, 100]: 0

To estimate p60, we:

Calculate the target position: 60% of 11 values is about the 7th value
Count values in buckets until we reach p
Find this lands in bucket [31, 40]
Estimate result as bucket midpoint: 35.5

We're using the bucket midpoint as a simple heuristic - other heuristics are possible. The actual p60, the 7th value, is 32 - our answer of 35.5 is reasonably close. The benefit of this strategy is that even if we have millions of datapoints, our system will still run quickly.

Important note on approximation

This is an "approximate solution" problem. Your implementation will be required to get close to the real answer, but it's not expected to calculate the exact answer.

We recommend not over-optimizing for accuracy. Start simple and pass the early test cases. You can work on improving accuracy if you're still failing tests.

Your task

Implement the Percentiles class with the following methods:

constructor(limit: int): Initialize the system with the guarantee that metrics will have values in the inclusive range [0, limit].
report_data(metric: string, value: int) → void: Report the metric metric with the value value.
percentile(p: int, metric: string) → float: Calculate the p percentile for the metric metric. You can assume that the metric will have some values reported.

🦋 A special thanks to the friend who shared the source!

1Example 1

Input

limit = 100, metric = "page_load_ms", value = 75, p = 60

Output

35.5

Explanation

Given the values [75, 9, 5, 15, 39, 32, 8, 58, 45, 12, 23], the histogram buckets would be:

[1, 10]: 3
[11, 20]: 2
[21, 30]: 1
[31, 40]: 2
[41, 50]: 1
[51, 60]: 1
[61, 70]: 0
[71, 80]: 1
[81, 90]: 0
[91, 100]: 0

To estimate p60, we calculate the target position as 60% of 11 values, which is about the 7th value. Counting values in buckets until we reach p, we find this lands in bucket [31, 40]. We estimate the result as the bucket midpoint, which is 35.5.