Distributionalʼs testing strategy not only needs to analyze the behavioral consistency of an application, but results of this analysis also need to be understandable by users, regardless of the volume of logs, number of metrics, or complexity of the app. This is why Distributional created the Similarity Index (Sim Index) as an automatically calculated value—between 0 and 100—which defines the deviation between two time periods, or runs.
This allows a user to reject H0 in the following hypothesis test:
In effect, a lower Sim Index should make the user more likely to reject the null hypothesis. This can be done at the application level, for a grouping of metrics, or for a single metric or eval being logged.
In this article, we’ll explore Distributional's strategy behind the Sim Index and how it helps users better understand the behavior of their AI applications.
(For more information on Distributional’s strategy to test for consistent app behavior in production, including an introduction to the H0 hypothesis, see the previous article in this series.)
The core concept for computing a Sim Index starts with a single column of numerical or categorical data—this could be one of the evals that is computed on the text or a column of non-text data that the user has provided. Let’s refer to this single column as me and mb for the columns from the recent time period and baseline time period, respectively.
Distributional’s strategy for computing Sim Index on this column, denoted by sim(me,mb), is powered by a desire to produce evidence of dissimilarity between me and mb. And then, through this evidence, the user can decide if these columns are, in fact, different in a way that is significant or relevant for their needs.
To surface such evidence, we assume that H0 is true, i.e., me and mb are drawn from the same distribution. We then pool results from both me and mb to facilitate an independent and identically distributed (iid) bootstrapping process. This gives an initial sense of what should be happening if the two runs were actually drawn from the same distribution.
From each of these test pairs, common statistics for both me and mb are computed. These include, for example, the 10th percentile, or the prevalence of the most common categories. This bootstrapping strategy has the benefit of “normalizing” and accounting for the scale of the quantities present to permit all subsequent analyses to take place on a fixed scale. These bootstrapped quantities also provide the desired evidence of behavioral deviation that is surfaced to the user.
The final Sim Index value for this column is computed through a weighted averaging of the difference of these common statistics as well as nonparametric measures such as the Mann-Whitney U statistic. Future versions of Sim Index may enable users to customize the weighting process to better align with their sense of behavioral deviation, e.g., more heavily weighting tail behavior rather than central tendency of the distribution.
A Sim Index can also be generated for a text column through combining the Sim Index values for all the quantities derived from that column—for example, the sentiment or toxicity of a given column. That combination is primarily the minimum Sim Index value over the derived metrics. This conservative strategy ensures that any potentially troubling metric deviations are clearly surfaced to users. This also allows users to set thresholds on individual metrics or even statistics of those metrics so they can be notified of any deviations.
A Sim Index value is also generated at the app-level in the same fashion over all the columns in the app. This provides a helpful signal to users as to whether there have been deviations in behavior overall.
Distributional’s approach to AI testing provides a flexible framework that teams can quickly get started with and scale to fit their needs. Through the Sim Index and interpretable metrics, it enables teams with immediate signals on shifts in behaviors, and arms them with the relevant evidence to understand those deviations and pass judgment on whether they matter for their app. Ultimately, providing an automated workflow that AI product teams can use to continuously define, understand, and improve AI application behavior in production.
Learn more about Distributional’s testing strategy by downloading our tech paper on Distributional’s Approach to AI Testing.