On the surface it might seem like rank hypocrisy that the majority of data-scientists seem to reject attempts to bring automation into their roles. Datarobot, Dataiku and others have all faced this challenge and instead they have focused on the non data-scientist market and done so quite successfully. There’s boundless enthusiasm from those without the ability to deliver machine learning (ML) themselves to use an AutoML platform that gives them access to ML capability they wouldn’t otherwise have. So why are data-scientists, the implementers of automation and machine learning, so averse to these tools and is it just emblematic of a mass wave of rejection of Artificial Intelligence (AI) and automation that’s going to bear out across all industries deflating this AI hype?
Looking at the landscape of AI in general, our clients across a broad range of industries have managed to build superhuman performing automation, this shows that the Return On Investment (ROI) is there and if the ROI is there you can bet AI is here to stay. So what’s different about automating data-science?
Data-science is highly skilled, part science in understanding how machine learning works, part art in applying data to a domain problem. While there are some problems where you can use AutoML on raw data, select a column to predict and presto you have an ML solution, this is the exception, not the rule. As such for any non trivial problem a data-scientist needs to work with an automation solution and this is where most of these tools fall down. The main criticisms of AutoML solutions are:
- 1 Control - Can't alter generated solutions
- 2 It doesn’t do enough - Most of the work is elsewhere
- 3 Quality of results - Users don't want to be held back
- 4 Iteration is slow - Rapid iteration is key
- 5 Collaboration / Reuse / Repeatability - Teams build solutions
- 6 Black-box - Limited visibility on how or what has been produced
Let’s go through these in more depth.
Control is invariably the major criticism, having a cookie cutter solution is great as long as you want star shaped cookies, as soon as the data-scientist wants to get in and tweak the solution they have to start from scratch outside the platform. Most platforms have tried to counter this by providing lots of advanced User Interface (UI) controls to try and adapt the solution but being UI based it’s pretty clunky and quite far from how they’d usually work.
Kortical has plenty of talented data-scientists that chafed against this constraint and we have addressed this with a high level language for data-science that can fully specify any machine learning solution down to the lowest level of detail concisely. This allows data-scientists to have full control over the solution but data-scientists writing code isn’t AutoML. Where AutoML comes into play is allowing the data-scientist to write as little or as much of the code for the solution as they like, even starting from a blank page and using AutoML to fill in the blanks to create the best solution possible given the constraints. In this way the data-scientist has the best of both worlds: full control, full automation or anything in between.
It doesn’t do enough
Some platforms like Azure really only select the model and the hyperparameters. This means that the data-scientists still need to do the bulk of the work in the data preparation and cleaning stage which is often where the majority of the time is spent. Others like Datarobot have created templates that include a lot of the standard cleaning and encoding that is needed to get to a good ML model quickly but each template includes a fixed set of components and is inflexible if it’s not exactly right. Kortical using a high level language that includes everything from: data cleaning approaches through to preprocessing, feature creation, feature selection, ML model building, model tuning and testing covers the entire ML model solution end to end. Being a language all these components can be mixed and matched to build bespoke ML solutions for any dataset from scratch using AutoML very quickly. Also by nature of being a coding language if the data-scientist doesn’t like anything about the automated model solution, they can change and adapt it easily.
Quality of results
Platforms such as Kortical, H2O and Datarobot can get high scores on certain data-science competitions and datasets on Kaggle, automatically beating thousands of data-scientists. These competitions will be from that limited set where the problem fits the platform perfectly and requires no additional domain knowledge. For the wider set of competitions the flexibility for the data-scientist to iterate rapidly and easily is required. While platforms such as Datarobot and H2O have missions to democratize data-science to everyone, this mission can drive decisions where there are trade offs between ease of use and not limiting the quality of results. Kortical’s mission is to create the best tool for professional data-scientists, which means never compromising on the data-scientist’s ability to get the best results. Allowing them to leverage cloud scale distributed AI to rip through the solution space at a speed orders of magnitude beyond human capability but giving them the control to get the solution and results they want.
Iteration is slow
Other platforms require you to reupload the dataset any time you want to make minor changes like removing a feature from the model that leaks the class variable. Each time you modify the data, the training starts from scratch again and can lead to long iteration cycles, especially if the data-scientist is trying to apply their knowledge to the problem making minor feature changes to the dataset to improve the model performance. Using the Kortical language data-scientists can limit the solution space to target specific models, techniques or parameters. In this way Kortical allows for much tighter, targeted iteration and much more rapid iteration and progress, even while they add to and change the dataset.
Collaboration / Reuse / Repeatability
For platforms with complex UIs, trying to tell your colleague how to recreate your solution or even iterating yourself, can be difficult. Imagine listing every control, value and instructions for how to find them. By using a high level data-science language Kortical avoids these pitfalls, making it easy to send a solution to your colleagues, merge the efforts of two people working on the same problem and all the other advantages of working in code. With multiple team members iterating on a complex problem, strong collaboration is a must. Also being able to dust off an old solution from source control and get back to where you were exactly is very important and problematic without a code based solution.
Even if a platform lets you figure out exactly what it’s doing at every stage as long as you navigate the myriad UI screens, check out every setting and hold it all in your head, it can be very non obvious and make the output solution seem very opaque. In a simple concise high level language that uses AutoML to fill in the blanks with every detail and parameter present, it’s easy to see the full solution and interact with it.
In our experience the crux of the issue data-scientists have with AutoML and automation tools is not the automation itself but that they’re being asked to work with tools that limit their ability to apply their knowledge, to get the best results or work in an effective way. Hence muting a lot, if not all, of the benefit of automation for them.
We think a simple yet powerful high level language for data-science, with AutoML that writes code avoids a lot of these criticisms. Kortical is by data-scientists, for data-scientists and our goal is to create a tool that data-science professionals can love. So while others focus on making a tool for everyone, we’re focused on being the best AI as a service platform for data-scientists.
Want to try it out?
Join our hackathon!
Or get in touch below, we'd love to hear from you.