# kortical.api.superhuman_calibration

# What is Superhuman Calibration?

Superhuman Calibration is the cornerstone of what we call Superhuman Automation: artificial intelligence which can perform a task at human accuracy or above.

Consider a classification task which is carried out entirely by humans, where the level of accuracy is 80%. Having built and optimised a model, we find that its accuracy is only 70% across the whole dataset; this is lower than the current human-driven process. Most businesses are only willing to automate a task if it surpasses human accuracy, be this for regulatory or customer experience purposes; this is a critical reason why currently 85% of AI projects are considered failures. On the other hand, crossing the superhuman boundary leads to typically 70% - 95% automation of a business function.

Using Superhuman Calibration, we identify a large subset of data where the ML classifier achieves an accuracy of 95%, which is better than humans by 15%. Although the classifier operates on slightly less data (80% automation of the entire task), it performs at superhuman accuracy, so is now feasible for businesses to automate. The following diagram represents the situation before and after applying calibration.

Before calibration (not superhuman, not feasible) After calibration (superhuman and feasible!)

Having calibrated, we can now introduce automation into the process: 80% of all classifications are made by the model, and only 20% of classifications are manual (the subset of data that ML could not address confidently). Wherever the model does make predictions, it does so with better accuracy than humans. Furthermore, Superhuman Automation will be possible for a higher proportion of cases as more training data is collected over time.

Quite simply, Superhuman Calibration allows us to improve ML by identifying a subset of tasks where a model can perform at superhuman levels. This key technology is wrapped with various tools and processes to form Kortical's Superhuman Automation approach, which allows experienced domain experts to keep improving their AI over time and help it adapt to new types of data.

# How does it work?

Many people think of yhat_probs as a synonym for true probability; while they are definitely correlated, we cannot choose a target confidence of 0.7 and expect to be right 70% of the time. We also have to account for the density of observations for various yhat values, non-linearities that this introduces as well as interplay with the other classes and their thresholds. In this optimisation step we find individual thresholds for each class that allows us to best achieve the overall goal.

Our approach is to split the test set in two to create a calibration set and a test set, we can then run an optimiser to find the thresholds for the various classes that allow us to hit the desired target accuracy level. We then test these thresholds on the test set to prove that they generalise to new data, and to also understand the likely error bounds. Wherever the thresholds are not met, we return a not automated or for review class in place. This may be an existing class or a new class that we introduce in this step.

# calibrate

This function calibrates thresholds such that the overall automated accuracy associated with the calibration set is as close to the target accuracy as possible. Where important classes are given, the precisions associated with these classes are also as close to the target accuracy as possible.

Inputs

  • df_fit - The dataset on which the thresholds should be fit.

  • targets - A string, the name of the target to be predicted, or a list of strings in the multi-label setting.

  • target_accuracy - The desired overall (automated) accuracy that should be achieved upon tuning thresholds, as well as the desired (class-specific) precision(s) that should be achieved for the important class(es). Should be given as a float between 0 and 1.

  • non_automated_class (= None) - The label which should represent the decision not to automate a row, given as a string. This may or may not be a label that appears as a label in the training set. This parameter must be specified in the case of a binary classification problem, and must be one of the two labels for the corresponding target.

  • important_classes (= None) - A string or list of strings representing important classes that should have individual thresholds tuned for them. These thresholds are tuned before the generic thresholds to ensure that the important classes have the correct accuracy associated with them. If given as a list, the important classes should be given in order of most important to least important (a determining factor in the order the thresholds are tuned and how certain predictions are handled, depending upon the important_classes_strategy).

  • important_classes_strategy - An enum specifying a strategy to use to handle predictions where two or more important classes meet their thresholds.

Returns

A dictionary containing all the information that is needed to apply these thresholds to data, and should be passed as the second argument to superhuman_calibration.apply.

Raises

ValueError: Raised whenever df_fit represents a binary classification problem, but non_automated_class is not specified. (it must be set to one of the two classes seen in the target column.)

from kortical.api import superhuman_calibration

calibration_data = superhuman_calibration.calibrate(df_fit,
                                                    targets,
                                                    target_accuracy,
                                                    non_automated_class=None,
                                                    important_classes=None,
                                                    important_classes_strategy=ImportantClassesStrategy.max_yhat_in_important_classes)

# apply

This function applies thresholds (tuned in the calibrate function) to a dataset, yielding predictions utilising the given important class strategy.

Inputs

  • df - The dataframe on which to apply thresholds. This should contain the relevant yhat_probs columns.

  • calibration_data - the thresholds that you wish to apply; this is the dictionary returned from superhuman_calibration.calibrate (see above).

  • in_place (= False) - Determines whether to modify the input dataframe (df) in place. By default, a new dataframe is returned (leaving the original input dataframe unmodified).

Returns

A dataframe identical to df but with an updated predicted_{target} column for each target, each one being the result of applying the thresholds and strategy specified by calibration_data to the appropriate yhat_probs columns.

Raises

Exception: Raised whenever an invalid important classes strategy is specified in the calibration_data dict.

from kortical.api import superhuman_calibration

new_df = superhuman_calibration.apply(df, calibration_data, in_place=False)

# score

This function prints automation/accuracy statistics pertaining to the predicted_{target} columns contained in df, as well as an automated f1 table, both overall and at a class-specific level. Also returns a dictionary containing more extensive information for processing by the user.

Inputs

  • df - The dataframe containing the predictions which should be scored. This should contain the relevant predicted_{target} column(s).

  • calibration_data - the thresholds that you wish to apply; this is the dictionary returned from superhuman_calibration.calibrate (see above).

Returns

A dictionary containing overall automation metrics, class-specific automation metrics, f1-metrics over the entire dataset (both overall and class-specific) as well as f1-metrics over the subset of the dataset that has been automated (again both overall and class-specific).

from kortical.api import superhuman_calibration

metrics = superhuman_calibration.score(df, calibration_data)

The returned dictionary is structured as follows:

{
  'automation_overall': {
    'automation': 0.9,
    'accuracy': 0.8
    'for_review': 0.1
  },
  'automation_per_class': {
    'class1': {
      'automation': 0.9,
      'accuracy': 0.8,
      'for_review': 0.1
    }
    ... all classes that can be automated
  },
  'f1_for_automated_rows': {
    'weighted_average': {
      'precision': 0.9,
      'recall': 0.8,
      'f1_score': 0.8,
      'count': 1000
    },
    'classes': {
      'class1': {
        'precision': 0.9,
        'recall': 0.8,
        'f1_score': 0.8,
        'count': 1000
      }
      ... all classes that can be automated
    }
  },
  'f1_for_all_rows': {
    'weighted_average': {
      'precision': 0.9,
      'recall': 0.8,
      'f1_score': 0.8,
      'count': 1000
    },
    'classes': {
      'class1': {
        'precision': 0.9,
        'recall': 0.8,
        'f1_score': 0.8,
        'count': 1000
      }
      ... all classes including non automated
    }
  }
}