CDI SDK
SDK for transporting chunks of data reliably and with low latency using a polled mode network driver.
Loading...
Searching...
No Matches
t_digest.c File Reference

An implementation of the t-digest percentile estimation algorithm developed by Ted Dunning and Otmar Ertl and available here: https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf. More...

#include "t_digest.h"
#include <assert.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "configuration.h"
#include "cdi_logger_api.h"
#include "cdi_os_api.h"
#include "utilities_api.h"

Data Structures

struct  Cluster
 This data structure represents a cluster. Each cluster in the t-digest contains a mean and a weight (number of samples). More...
 
struct  TDigest
 The main structure of the t-digest. More...
 

Macros

#define MAX_POSSIBLE_SAMPLE_VALUE   (UINT32_MAX)
 The maximum value for uint32_t type. Used to initialize the mimimum sample value in a digest.
 
#define TAIL_PERCENT_FOR_SINGLE_SAMPLE   (2)
 The amount of the distribution tail to force to be single-sample clusters.
 
#define MAX_FAILED_MERGE_COUNT   (5)
 The amount of times to retry merging before giving up.
 

Functions

static int TDigestClusterCompare (const void *operand1_ptr, const void *operand2_ptr)
 Function used to compare two nodes by looking at their mean values. This is an input to the qsort function.
 
static uint32_t TDigestInterpolate (const uint32_t input_sample1, const uint32_t input_sample2, int sample_index, int total_interpolation_points)
 Function used to interpolate from input_sample1 to input_sample2 by sample_index/num_samples_of_interest.
 
static void TDigestSort (TDigest *td_ptr)
 Function used to sort a TDigest structure.
 
static int TDigestFindCluster (const TDigest *td_ptr, int desired_sample, int *total_samples_ptr)
 Function used to find the cluster index where a given sample number resides. This function walks through all clusters counting samples in each one until it finds the cluster with the requested sample number.
 
static int TDigestGetClusterLimit (const TDigest *td_ptr, int cluster_index)
 Function used to figure out if a proposed max percentile for a cluster is under the percentile limit for that cluster.
 
static bool TDigestMerge (TDigest *td_ptr)
 Function used to merge all clusters of the t-digest. This function follows Algorithm 1 from the above-referenced white paper by Dunning and Ertl, which provides a means to merge a t-digest with a list of additional samples.
 
static uint32_t TDigestCalculatePercentile (TDigest *td_ptr, int percentile)
 Function to run the calculation for a percentile value.
 
bool TDigestCreate (TDigestHandle *ret_td_handle_ptr)
 Function used to create a new t-digest.
 
void TDigestDestroy (TDigestHandle td_handle)
 Function used to free t-digest memory.
 
void TDigestClear (TDigestHandle td_handle)
 Function used to reset t-digest to begin collecting a new set of statistics. This function initializes all the TDigest data members.
 
void TDigestAddSample (TDigestHandle td_handle, uint32_t value)
 Function used to create a new t-digest.
 
bool TDigestGetPercentileValue (TDigestHandle td_handle, int percentile, uint32_t *value_at_percentile_ptr)
 Function used to get the value at a given percentile.
 
int TDigestGetCount (TDigestHandle td_handle)
 Function used to get the number of samples in the digest.
 

Detailed Description

An implementation of the t-digest percentile estimation algorithm developed by Ted Dunning and Otmar Ertl and available here: https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf.

This algorithm gathers samples of a given metric and stores them in clusters of samples such that each cluster contains a mean and sample count and no other information. As clusters grow and the number of clusters grows, clusters can be combined or created in order to meet algorithm requirements for the max number of clusters and cluster weight (number of samples per cluster). Clusters near the edges of the distribution of samples are scaled such that they contain less samples and clusters near the center of the distribution are scaled such that they contain more samples. Such scaling has the effect of keeping estimation error low. (Nearly) exact values for a given percentile can be calculated from this set of clusters by interpolating between cluster means.

Function Documentation

◆ TDigestAddSample()

void TDigestAddSample ( TDigestHandle td_handle,
uint32_t value )

Function used to create a new t-digest.

Parameters
td_handleHandle for the TDigest object to use.
valueThe value of the new sample to add to the digest.

◆ TDigestCalculatePercentile()

static uint32_t TDigestCalculatePercentile ( TDigest * td_ptr,
int percentile )
static

Function to run the calculation for a percentile value.

Parameters
td_ptrPointer to the TDigest object to use.
percentileInteger representing the percentile being calculated.
Returns
The calculated value at the given percentile.

◆ TDigestClear()

void TDigestClear ( TDigestHandle td_handle)

Function used to reset t-digest to begin collecting a new set of statistics. This function initializes all the TDigest data members.

Parameters
td_handleHandle for the TDigest object to use.

◆ TDigestClusterCompare()

static int TDigestClusterCompare ( const void * operand1_ptr,
const void * operand2_ptr )
static

Function used to compare two nodes by looking at their mean values. This is an input to the qsort function.

Parameters
operand1_ptrPointer to the first cluster to compare.
operand2_ptrPointer to the second cluster to compare.

◆ TDigestCreate()

bool TDigestCreate ( TDigestHandle * ret_td_handle_ptr)

Function used to create a new t-digest.

Parameters
ret_td_handle_ptrPointer for the handle for the new TDigest object.
Returns
True if successful; false if not.

◆ TDigestDestroy()

void TDigestDestroy ( TDigestHandle td_handle)

Function used to free t-digest memory.

Parameters
td_handleHandle for the TDigest object to use.

◆ TDigestFindCluster()

static int TDigestFindCluster ( const TDigest * td_ptr,
int desired_sample,
int * total_samples_ptr )
static

Function used to find the cluster index where a given sample number resides. This function walks through all clusters counting samples in each one until it finds the cluster with the requested sample number.

Parameters
td_ptrPointer to the TDigest object to use.
desired_sampleThe sample number for which we want to find the host cluster.
total_samples_ptrPointer to the total number of samples in all clusters up to and including the cluster selected by this function.
Returns
The zero-based cluster index that is host to the desired sample.

◆ TDigestGetClusterLimit()

static int TDigestGetClusterLimit ( const TDigest * td_ptr,
int cluster_index )
static

Function used to figure out if a proposed max percentile for a cluster is under the percentile limit for that cluster.

Parameters
td_ptrPointer to the TDigest object to use.
cluster_indexThe current cluster index.
Returns
The maximum samples allowed for the given cluster index.

◆ TDigestGetCount()

int TDigestGetCount ( TDigestHandle td_handle)

Function used to get the number of samples in the digest.

Parameters
td_handleHandle for the TDigest object to use.
Returns
The number of samples in the digest.

◆ TDigestGetPercentileValue()

bool TDigestGetPercentileValue ( TDigestHandle td_handle,
int percentile,
uint32_t * value_at_percentile_ptr )

Function used to get the value at a given percentile.

Parameters
td_handleHandle for the TDigest object to use.
percentileThe desired percentile between 0 and 1, inclusive.
value_at_percentile_ptrPointer to the return value where resulting value of percentile search is stored.
Returns
True if successful; false if not.

◆ TDigestInterpolate()

static uint32_t TDigestInterpolate ( const uint32_t input_sample1,
const uint32_t input_sample2,
int sample_index,
int total_interpolation_points )
static

Function used to interpolate from input_sample1 to input_sample2 by sample_index/num_samples_of_interest.

Parameters
input_sample1The starting sample point for the interpolation.
input_sample2The end sample point for the interpolation.
sample_indexThe numerator of the interpolation ratio (sample_index out of total_interpolation_points).
total_interpolation_pointsThe denominator of the interpolation ratio (sample_index out of total_interpolation_points).
Returns
The resulting value from the interpolation.

◆ TDigestMerge()

static bool TDigestMerge ( TDigest * td_ptr)
static

Function used to merge all clusters of the t-digest. This function follows Algorithm 1 from the above-referenced white paper by Dunning and Ertl, which provides a means to merge a t-digest with a list of additional samples.

Parameters
td_ptrPointer to the TDigest object to use.
Returns
True if merge was successful; false if not.

NOTE: We choose to always loop forward to simplify the algorithm, but the white paper referenced above discusses a potential improvement to error rates near q=0 if looping is alternated between forward and reverse from merge to merge.

◆ TDigestSort()

static void TDigestSort ( TDigest * td_ptr)
static

Function used to sort a TDigest structure.

Parameters
td_ptrPointer to the TDigest object to use.