# Language

The Kortical language is used to define code that drives all aspects of the AutoML process. It allows data scientists using the platform to control training as much or as little as they need.

# Default code

For a given dataset, we generate default code using the metadata extracted from the dataset's columns. This is Kortical's best initial estimate of the configuration to use for training with that dataset. For example, below is the default code generated for the Titanic dataset.

- ml_solution:
  - data_set:
    - target_column: Survived
    - problem_type: classification
    - evaluation_metric: area_under_roc_curve
    - fraction_of_data_set_to_use: 100%
    - cross_validation_folds: 3
  - features:
    - numeric:
      - PassengerId
      - Age
      - SibSp
      - Fare
    - categorical:
      - Pclass
      - Sex
      - Parch
      - Cabin
      - Embarked
    - text:
      - Name
      - Ticket
    - date
  - models:
    - one_of:
      - xgboost
      - linear
      - random_forest
      - extra_trees
      - decision_tree
      - deep_neural_network
      - lightgbm

The default code only contains basic information about the dataset, its features and the model types to try. This view of the Kortical AutoML pipeline is deliberately simplified - the details relating to data preprocessing, feature creation and model parameter selection are omitted. The AutoML fills in these blanks by expanding the code and including these choices in its intelligent optimisation process that runs throughout training. This means you start with a broad search to find the best solution while only providing minimal input.

# Customising the code

The power of the language comes from the ability to customise the code to control the solution search space. This allows data scientists to constrain the search space for a variety of reasons, such as wanting to increase iteration speed, or experiment with the impact of particular parameter changes.

For example, after a couple of iterations of feature engineering and training with the default broad search, we might find that certain parts of the solution aren’t that sensitive to the feature changes we are making. Rather than have the AutoML figure out the whole solution from scratch each time, if we know the ballpark of the solution we’re looking for or specific preprocessing steps that work well, we can define these in the model code. Kortical will then fix these parts of the solution and find the best solution within the constrained search space. This can massively reduce the dimensionality of the solution search space and mean that Kortical produces high performing models faster, allowing for more rapid iteration.

For example, in the code below we:

- ml_solution:
  - data_set:
    - target_column: Survived
    - problem_type: classification
    - evaluation_metric: area_under_roc_curve
    - fraction_of_data_set_to_use: 100%
    - cross_validation_folds: 3
  - features:
    - numeric:
      - PassengerId
      - Age
      - SibSp
      - Fare
        - preprocessing:
          - log
    - categorical:
      - Pclass
      - Sex
      - Parch
      - Cabin
      - Embarked
    - text:
      - Name:
        - create_features:
          - tf-idf
      - Ticket
    - date
  - models:
    - deep_neural_network

Taking this further, we can get as detailed as we like, even controlling the momentum decay on the gradient descent solver we’re using for the Deep Neural Network, or setting a range for max features for TF-IDF.

- ml_solution:
  - data_set:
    - target_column: Survived
    - problem_type: classification
    - evaluation_metric: area_under_roc_curve
    - fraction_of_data_set_to_use: 100%
    - cross_validation_folds: 3
  - features:
    - numeric:
      - PassengerId
      - Age
      - SibSp
      - Fare
        - preprocessing:
          - log
    - categorical:
      - Pclass
      - Sex
      - Parch
      - Cabin
      - Embarked
    - text:
      - Name:
        - preprocessing:
          - lower_case: True
          - strip_special_characters:
            - remove: -'`,±§:;
        - create_features:
          - tf-idf:
            - use_idf: False
            - smooth_idf: False
            - sublinear_tf: True
            - max_features: range(30000, 50000)
      - Ticket
    - date
  - models:
    - deep_neural_network:
      - hidden_layer_width: 12
      - number_of_hidden_layers: 3
      - hidden_layer_width_attenuation: 0.5
      - activation: relu
      - solver: adam
      - beta_1: 0.85
      - beta_2: 0.945
      - learning_rate_init: 0.12

The AutoML writes out the full expanded code of the solution, so it’s easy for you to copy from what it creates and use this as the basis for refining your customisations. Essentially, Kortical’s patented way of using code to interact with AutoML fills in the blanks that you don’t care to specify with a close-to-optimal solution.

# Base configuration with overrides

The language also allows you to provide a base configuration that applies to a whole section (i.e. every child deeper in the tree), while also setting overrides for specific children. For example, below we are configuring a default handling strategy for missing numeric values for all numeric columns, while using a different strategy for a single column.

- ml_solution:
  ...
  - features:
    - numeric:
        # Change default strategy for all columns to fill_mean
        - preprocessing:
          - handle_nan: fill_mean 
        - Column1
        - Column2
        - Column3:
          # Override the strategy for Column3 to fill with the column's max value instead 
          - preprocessing:
            - handle_nan: fill_max

# Major sections

# Dataset

The data_set section of the code defines all top-level information about the dataset, the problem type it represents and dataset-related parameters for solving that problem. Below is an example dataset configuration from the full expanded code of a trained Titanic model.

  - data_set:
    - target_column: ['Survived']
    - problem_type: classification
    - evaluation_metric: area_under_roc_curve
    - exclude_test_data_from_final_model: False
    - fraction_of_data_set_to_use: 1.0
    - fraction_of_data_to_use_as_test_set: 0.2
    - fix_test_set_boundary_when_downsampling: False
    - cross_validation_folds: 3
    - select_features: 1.0
    - shuffle_data_set: True
    - data_set_random_seed: 1
    - modelling_random_seed: 1536400856

The most important parameters are typically the target variable, the problem type and the evaluation metric. However, we also provide the specific random seeds that are used at the different stages of the training of the machine learning solution that would be required to reproduce the same solution, as well as other parameters regarding how much data to use and what proportions are in the training and test sets.

Note

Whilst in this example the original dataset only contained 11 columns, once all the preprocessing and feature creation steps have been run we actually have 1207 columns that are used in this solution.

# Features

The features section of the code defines the columns to use in training and their type (numeric, categorical, text or date). Within each column, it's also possible to define specific parameters for preprocessing and feature creation. The expanded example below shows how each feature was configured for training this particular ML solution.

- features:
    - numeric:
      - Age:
        - preprocessing:
          - handle_nan: fill_median
          - handle_inf: clamp_to_range
          - robust_scale:
            - with_centering: True
            - with_scaling: True
            - quantile_range: [25.0, 75.0]
      - Fare:
        - preprocessing:
          - handle_nan: fill_mean
          - handle_inf: clamp_to_range
          - remove_outliers:
            - sigma_threshold: 3
            - on_removal: clamp_to_range 
          - standard_scale:
            - with_mean: True
            - with_std: True
      - Pclass:
        - preprocessing:
          - handle_nan: fill_zero
          - handle_inf: clamp_to_range
          - robust_scale:
            - with_centering: True
            - with_scaling: True
            - quantile_range: [25.0, 75.0]
    - categorical:
      - Sex:
        - create_features:
          - one_hot_encode
      - Embarked:
        - create_features:
          - one_hot_encode
    - text:
      - Name:
        - preprocessing:
          - lower_case: True
          - strip_special_characters:
            - enabled: False
        - create_features:
          - glove:
            - sentence_representation:
              - name: mean_of_words
            - size: 255
            - window: 4
            - min_count: 5
            - batch_size: 304
            - l_rate: 0.0552
      - Ticket:
        - preprocessing:
          - lower_case: True
          - strip_special_characters:
            - enabled: True
            - replace_with_space: |\t\r\n/\\*(){}[]"#
            - remove: -'`,±§:;
            - wrap_in_spaces: @&£$€?!^~<>.
        - create_features:
          - word2vec:
            - sentence_representation:
              - name: mean_of_words
            - size: 174
            - window: 7
            - min_count: 16

In this example, we can see different preprocessing configurations for different numeric and text columns, as well as different feature creation choices for the text columns (Name is using a GloVe embedding whereas Ticket is using Word2Vec).

TIP

By looking at your best model candidates on the leaderboard you can find patterns around which preprocessing and feature creation options are working the best for your problem. If you were to train new model versions, you might want to restrict them to those you know work well for this problem.

# Models

The models section of the code defines both the type and parameters of the models trained as part of the AutoML solution. The default models configuration is shown below - it directs the AutoML to try models of all of these types and search for the best model parameter choices for each one.

  - models:
    - one_of:
      - xgboost
      - linear
      - random_forest
      - extra_trees
      - decision_tree
      - deep_neural_network
      - lightgbm

From a fully expanded code sample, we can see the specific parameters chosen for a trained model:

  - models:
    - deep_neural_network:
      - hidden_layer_width: 12
      - number_of_hidden_layers: 3
      - hidden_layer_width_attenuation: .5 
      - activation: identity
      - alpha: 0.093107
      - max_iter: 383
      - solver: adam
      - beta_1: 0.99
      - beta_2: 0.9450000000000001
      - epsilon: 1e-08
      - batch_size: auto
      - shuffle: False
      - learning_rate_init: 0.0651
      - early_stopping: True
      - tol: 0.0001
      - validation_fraction: 0.1
      - verbose: True
      - warm_start: True

Here we can see:

  • We used a deep neural network with three hidden layers containing 12, 6 and 3 nodes respectively
  • We used a tanh activation function
  • We used the adam solver