# Language
The Kortical language is used to define code that drives all aspects of the AutoML process. It allows data scientists using the platform to control training as much or as little as they need.
# Default code
For a given dataset, we generate default code using the metadata extracted from the dataset's columns. This is Kortical's best initial estimate of the configuration to use for training with that dataset. For example, below is the default code generated for the Titanic dataset.
- ml_solution:
- data_set:
- target_column: Survived
- problem_type: classification
- evaluation_metric: area_under_roc_curve
- fraction_of_data_set_to_use: 100%
- cross_validation_folds: 3
- features:
- numeric:
- PassengerId
- Age
- SibSp
- Fare
- categorical:
- Pclass
- Sex
- Parch
- Cabin
- Embarked
- text:
- Name
- Ticket
- date
- models:
- one_of:
- xgboost
- linear
- random_forest
- extra_trees
- decision_tree
- deep_neural_network
- lightgbm
The default code only contains basic information about the dataset, its features and the model types to try. This view of the Kortical AutoML pipeline is deliberately simplified - the details relating to data preprocessing, feature creation and model parameter selection are omitted. The AutoML fills in these blanks by expanding the code and including these choices in its intelligent optimisation process that runs throughout training. This means you start with a broad search to find the best solution while only providing minimal input.
# Customising the code
The power of the language comes from the ability to customise the code to control the solution search space. This allows data scientists to constrain the search space for a variety of reasons, such as wanting to increase iteration speed, or experiment with the impact of particular parameter changes.
For example, after a couple of iterations of feature engineering and training with the default broad search, we might find that certain parts of the solution aren’t that sensitive to the feature changes we are making. Rather than have the AutoML figure out the whole solution from scratch each time, if we know the ballpark of the solution we’re looking for or specific preprocessing steps that work well, we can define these in the model code. Kortical will then fix these parts of the solution and find the best solution within the constrained search space. This can massively reduce the dimensionality of the solution search space and mean that Kortical produces high performing models faster, allowing for more rapid iteration.
For example, in the code below we:
- Fix the numeric column
Fare
to have a log transform - Fix the column
Name
to use TF-IDF for feature creation - Fix the model type we want to generate to Deep Neural Networks
- ml_solution:
- data_set:
- target_column: Survived
- problem_type: classification
- evaluation_metric: area_under_roc_curve
- fraction_of_data_set_to_use: 100%
- cross_validation_folds: 3
- features:
- numeric:
- PassengerId
- Age
- SibSp
- Fare
- preprocessing:
- log
- categorical:
- Pclass
- Sex
- Parch
- Cabin
- Embarked
- text:
- Name:
- create_features:
- tf-idf
- Ticket
- date
- models:
- deep_neural_network
Taking this further, we can get as detailed as we like, even controlling the momentum decay on the gradient descent solver we’re using for the Deep Neural Network, or setting a range for max features for TF-IDF.
- ml_solution:
- data_set:
- target_column: Survived
- problem_type: classification
- evaluation_metric: area_under_roc_curve
- fraction_of_data_set_to_use: 100%
- cross_validation_folds: 3
- features:
- numeric:
- PassengerId
- Age
- SibSp
- Fare
- preprocessing:
- log
- categorical:
- Pclass
- Sex
- Parch
- Cabin
- Embarked
- text:
- Name:
- preprocessing:
- lower_case: True
- strip_special_characters:
- remove: -'`,±§:;
- create_features:
- tf-idf:
- use_idf: False
- smooth_idf: False
- sublinear_tf: True
- max_features: range(30000, 50000)
- Ticket
- date
- models:
- deep_neural_network:
- hidden_layer_width: 12
- number_of_hidden_layers: 3
- hidden_layer_width_attenuation: 0.5
- activation: relu
- solver: adam
- beta_1: 0.85
- beta_2: 0.945
- learning_rate_init: 0.12
The AutoML writes out the full expanded code of the solution, so it’s easy for you to copy from what it creates and use this as the basis for refining your customisations. Essentially, Kortical’s patented way of using code to interact with AutoML fills in the blanks that you don’t care to specify with a close-to-optimal solution.
# Base configuration with overrides
The language also allows you to provide a base configuration that applies to a whole section (i.e. every child deeper in the tree), while also setting overrides for specific children. For example, below we are configuring a default handling strategy for missing numeric values for all numeric columns, while using a different strategy for a single column.
- ml_solution:
...
- features:
- numeric:
# Change default strategy for all columns to fill_mean
- preprocessing:
- handle_nan: fill_mean
- Column1
- Column2
- Column3:
# Override the strategy for Column3 to fill with the column's max value instead
- preprocessing:
- handle_nan: fill_max
# Major sections
# Dataset
The data_set
section of the code defines all top-level information about the dataset, the problem type it represents
and dataset-related parameters for solving that problem. Below is an example dataset configuration from the full
expanded code of a trained Titanic model.
- data_set:
- target_column: ['Survived']
- problem_type: classification
- evaluation_metric: area_under_roc_curve
- exclude_test_data_from_final_model: False
- fraction_of_data_set_to_use: 1.0
- fraction_of_data_to_use_as_test_set: 0.2
- fix_test_set_boundary_when_downsampling: False
- cross_validation_folds: 3
- select_features: 1.0
- shuffle_data_set: True
- data_set_random_seed: 1
- modelling_random_seed: 1536400856
The most important parameters are typically the target variable, the problem type and the evaluation metric. However, we also provide the specific random seeds that are used at the different stages of the training of the machine learning solution that would be required to reproduce the same solution, as well as other parameters regarding how much data to use and what proportions are in the training and test sets.
Note
Whilst in this example the original dataset only contained 11 columns, once all the preprocessing and feature creation steps have been run we actually have 1207 columns that are used in this solution.
# Features
The features
section of the code defines the columns to use in training and their type (numeric, categorical, text
or date). Within each column, it's also possible to define specific parameters for preprocessing
and feature creation. The expanded example below shows how each feature was configured
for training this particular ML solution.
- features:
- numeric:
- Age:
- preprocessing:
- handle_nan: fill_median
- handle_inf: clamp_to_range
- robust_scale:
- with_centering: True
- with_scaling: True
- quantile_range: [25.0, 75.0]
- Fare:
- preprocessing:
- handle_nan: fill_mean
- handle_inf: clamp_to_range
- remove_outliers:
- sigma_threshold: 3
- on_removal: clamp_to_range
- standard_scale:
- with_mean: True
- with_std: True
- Pclass:
- preprocessing:
- handle_nan: fill_zero
- handle_inf: clamp_to_range
- robust_scale:
- with_centering: True
- with_scaling: True
- quantile_range: [25.0, 75.0]
- categorical:
- Sex:
- create_features:
- one_hot_encode
- Embarked:
- create_features:
- one_hot_encode
- text:
- Name:
- preprocessing:
- lower_case: True
- strip_special_characters:
- enabled: False
- create_features:
- glove:
- sentence_representation:
- name: mean_of_words
- size: 255
- window: 4
- min_count: 5
- batch_size: 304
- l_rate: 0.0552
- Ticket:
- preprocessing:
- lower_case: True
- strip_special_characters:
- enabled: True
- replace_with_space: |\t\r\n/\\*(){}[]"#
- remove: -'`,±§:;
- wrap_in_spaces: @&£$€?!^~<>.
- create_features:
- word2vec:
- sentence_representation:
- name: mean_of_words
- size: 174
- window: 7
- min_count: 16
In this example, we can see different preprocessing configurations for different numeric and text columns, as well
as different feature creation choices for the text columns (Name
is using a GloVe embedding
whereas Ticket
is using Word2Vec).
TIP
By looking at your best model candidates on the leaderboard you can find patterns around which preprocessing and feature creation options are working the best for your problem. If you were to train new model versions, you might want to restrict them to those you know work well for this problem.
# Models
The models
section of the code defines both the type and parameters of the models trained as part of the AutoML solution.
The default models configuration is shown below - it directs the AutoML to try models of all of these types and search
for the best model parameter choices for each one.
- models:
- one_of:
- xgboost
- linear
- random_forest
- extra_trees
- decision_tree
- deep_neural_network
- lightgbm
From a fully expanded code sample, we can see the specific parameters chosen for a trained model:
- models:
- deep_neural_network:
- hidden_layer_width: 12
- number_of_hidden_layers: 3
- hidden_layer_width_attenuation: .5
- activation: identity
- alpha: 0.093107
- max_iter: 383
- solver: adam
- beta_1: 0.99
- beta_2: 0.9450000000000001
- epsilon: 1e-08
- batch_size: auto
- shuffle: False
- learning_rate_init: 0.0651
- early_stopping: True
- tol: 0.0001
- validation_fraction: 0.1
- verbose: True
- warm_start: True
Here we can see:
- We used a deep neural network with three hidden layers containing 12, 6 and 3 nodes respectively
- We used a tanh activation function
- We used the adam solver