# Cross-validation
Cross-validation (opens new window) is the the practice of estimating the
performance of a model (or in our case, an end-to-end ML solution)
with respect to a problem through one or more rounds of splitting the available data into train
and validation
folds
(non-overlapping partitions of the data), fitting the model on train
and then making predictions on validation
and applying
an evaluation metric to give a view on performance.
The fundamental aim of the method is to make the most efficient use of known data to get the best estimate of how a solution will perform on unknown data. The more rounds of cross-validation are used, the better our estimate of the solution performance on new data at the expense of training time.
Prior to Cross-validation, Kortical performs an initial split into a train
and test
set (more details here),
and then splits the train
further into three train
/validation
folds. The performance on the validation
folds helps
Kortical AutoML search through the space
of new possible Machine Learning solutions to try, while the performance against the test
data is used to help rank the
solutions already found. The combined performance on the validation
+ test
splits is fed into a scoring function
to provide what we call the balanced
score (which is the main score you will see on the leaderboard etc). This balanced
score introduces a penalty for solutions where the score on the validation
folds differs from the score on the test
fold.
To adjust the number of cross-validation (sometimes known as crossval
) folds, simply add the cross_validation_folds
argument to the data_set
key of the model specification. For example if you wanted 5 folds, you could add:
- ml_solution:
- data_set:
...
- cross_validation_folds: 5
...
Kortical has two distinct flavours of cross-validation, depending on if we have a date_index
present in the data.
# Stratified Cross-validation
In the case where
we have no date_index
, Kortical will employ Stratified K Fold Cross-validation (opens new window)
cross-validation, which samples the target in such a way that each fold gets a balanced sample of classes in
cases where we have a classification
problem type, or randomly samples without replacement in the case of regression
problems.
# Cross-validation Through Time
Where we do have a date_index
, Kortical will revert to Cross-validation Through Time (opens new window)
which ensures that each validation
fold contains observations later in time than those used to train the model. This
results in unequal fold size but provides a much better estimate of performance on future data.