# Add Data

The first step in using Kortical is to add a suitable dataset for analysis. Currently data must be uploaded to the platform in CSV format but in the future we intend to support other data sources, such as databases.

# Upload dataset

# Data page

The first page you see after logging in is the data page:

It is also available on the side menu.

# First upload

After clicking on Add Dataset, select the CSV file you want to upload (from disk or another data source):

The file will be uploaded to the platform. You can pause and resume uploads, and it will also recover automatically if any intermittent internet outage occurs during the upload. If this is not your first upload, you will need to upload additional datasets using the Data Selector.

# Dataset processing

Once the dataset has been uploaded to the platform, we do some initial analysis to extract basic metadata about each column.

# Preparing a dataset for Kortical

In order to train a machine learning model, we need a file to have certain properties and to essentially look like a table. Each row in the file should be a single observation (measurement) of an event and should contain a set of features and a target.

We expect the first row to be a set of column headers that contain the feature names, so ensure no blank rows are present at the top of your file. We also expect the file format to be a CSV file (but aim to support more formats very soon).

As an example of a simple case, consider the following dataset:

Let’s say that here we are trying to predict Weight from Height and Eye Colour. Our target would then be Weight and our input features are Height and Eye Colour.

Not all the features might be relevant for predicting our target, so Kortical also offers feature selection in the form of MPLE which can help tell you which features you can drop (maybe Eye Colour isn’t that useful here!).

This general pattern extends to pretty much every machine learning problem. The input features can contain a number such as with Height, a category such as Eye Colour but these cells could equally contain free text, such as the contents of an email.

The more features (columns) you add, the more observations (rows) you need for machine learning to work properly. There is no fixed ratio for this as it is highly problem-dependent, but a good rule of thumb is to aim for at least 10x the number of rows to columns.

If you are solving a multiclass/multilabel problem, you should probably multiply this again by the number of classes. So if you had 10 columns and are predicting 10 different classes you would want at least 1k observations (if observations are uniformly distributed between classes, more if they are imbalanced).

# Data view

When the dataset analysis is complete, you will see the a sample of the data in tabular form and you'll be asked to select one or more target columns to be predicted.

# Select target columns

# Single target

To find the column that you would like to use as the target, you can either scroll across until you locate it or you can use the column filter box. In this case, we are looking for the SalePrice column:

Clicking on the target column opens a detailed view of the column metadata, which you can use to confirm that the target looks as you would expect. Clicking Set as target will add the column to the target list (as shown in the smaller modal at the top).

In this example, we are only selecting a single target. Therefore, you can now click Done on the target list modal. This will automatically take you to the Explore tab and begin the ML Data Prep process.

# Multiple targets

If you want to analyse a multi-label classification problem, you may select more than one target column. At present, only binary columns may be selected for multi-label use.

To achieve this, instead of clicking Done as in the single target case, close the detailed column view for your first target and click on the next desired target column. For example, in the Titanic dataset you could add both Survived and Sex as targets. After selecting Survived as the first target, opening the detailed view for the Sex column would show:

Because one target is already selected, you can now access additional charts to compare this column to the current target (see vs Target and vs Target (proportional) links on the bottom right).

Click Set as target for the second column, and then continue the above process until all desired targets have been chosen. When you're ready to explore the dataset with these targets, click Done on the target list modal.

# Analysing other columns

Aside from selecting target columns, you can also click on any column in the data view to see its detailed metadata and understand its basic relationship to the current target. Further information about each column is available once ML Data Prep process is run.

# Data and Explore tabs

Once your first dataset has been uploaded, Data and Explore tabs will appear on the data page. You can use these at any time to switch between the data view and the exploration report for the selected dataset. You must have a target selected to use the Explore tab, however.

# Data selector

After you have uploaded and explored your dataset, you will likely want to upload further datasets and switch between them. To do this, click the Data Selector button in the top right corner of the page (regardless of whether you're on the Data or Explore tabs). This will open the Data Selector:

From here, you can:

  • Add more datasets using the Add Dataset button
  • Search through existing datasets with the search box
  • Download any dataset in the list using the download action on the right hand side of each row
  • Select a different dataset to analyse by clicking on the relevant row