# Cross-validation

Cross-validation (opens new window) is the the practice of estimating the performance of a model (or in our case, an end-to-end ML solution) with respect to a problem through one or more rounds of splitting the available data into train and validation folds (non-overlapping partitions of the data), fitting the model on train and then making predictions on validation and applying an evaluation metric to give a view on performance.

The fundamental aim of the method is to make the most efficient use of known data to get the best estimate of how a solution will perform on unknown data. The more rounds of cross-validation are used, the better our estimate of the solution performance on new data at the expense of training time.

Prior to Cross-validation, Kortical performs an initial split into a train and test set (more details here), and then splits the train further into three train/validation folds. The performance on the validation folds helps Kortical AutoML search through the space of new possible Machine Learning solutions to try, while the performance against the test data is used to help rank the solutions already found. The combined performance on the validation + test splits is fed into a scoring function to provide what we call the balanced score (which is the main score you will see on the leaderboard etc). This balanced score introduces a penalty for solutions where the score on the validation folds differs from the score on the test fold.

To adjust the number of cross-validation (sometimes known as crossval) folds, simply add the cross_validation_folds argument to the data_set key of the model specification. For example if you wanted 5 folds, you could add:

- ml_solution:
  - data_set:
    ...
    - cross_validation_folds: 5
    ...

Kortical has two distinct flavours of cross-validation, depending on if we have a date_index present in the data.

# Stratified Cross-validation

In the case where we have no date_index, Kortical will employ Stratified K Fold Cross-validation (opens new window) cross-validation, which samples the target in such a way that each fold gets a balanced sample of classes in cases where we have a classification problem type, or randomly samples without replacement in the case of regression problems.

# Cross-validation Through Time

Where we do have a date_index, Kortical will revert to Cross-validation Through Time (opens new window) which ensures that each validation fold contains observations later in time than those used to train the model. This results in unequal fold size but provides a much better estimate of performance on future data.