The Myth of the "Citizen Data-Scientist"

What it really takes to deliver an AI / ML solution

Andy Gray
Andy Gray

Co-Founder & CEO

Twitter iconLinkedin icon

There still seems to be a lot of mysticism about what actually goes into creating a world class Artificial Intelligence (AI) / Machine Learning (ML) model, so I'm going to elaborate on that here and once we do that, the idea of the citizen data-scientist should sound as plausible as the tooth fairy.

There is a lot of process that goes into going from data, to production ready AI but for the purpose of this article we're covering just the model building part of the process. This is what we call the Proof Of Value (POV). The basic premise is that we want to understand the expected return if we were to go ahead with the project, as fast as possible. Typically this process takes 4 weeks to build and validate the model. This model is expected to be very high performing, tested robustly on historic data and ready to go into production.

At a high level there are a few key stages to the POV; figuring out what to build, sourcing the data, understanding the data and the problem, building the first model, validating and iterating on the model and presenting the results. The main pitfalls one encounters when running an ML POV are: not solving a high value problem, not having data ready in time, miscommunication on the brief, backtracking on great results due to errors and avoiding clients derailing projects after getting data insight they’ve never seen before, requesting further insight that stops progress towards the ultimate objective. We’re not going to focus on how the process helps mitigate these issues but it provides some context for why we run things the way we do.

Before the data-science even begins there is usually a strategic process to determine what data-science problems to go after and in what order. Our approach involves mapping the business, the biggest revenue and cost drivers and how AI can be applied to help. Then plotting the business value of the solutions against the complexity of the solutions to create the roadmap to get the most Return On Investment (ROI) as fast as possible. It takes half a day and you would want key business stakeholders that can map and explain the business, data stakeholders that can inform what data is available and what state it is in, AI specialists that can understand the business and suggest where and how AI can be applied given the data available, as well as score the complexity of the various solutions and finally the stakeholder with sign-off on the road map. A citizen data-scientist would probably struggle knowing how data can be used to solve real business problems, and what would be likely to work.

Sourcing data is often trickier than people expect, especially in organisations that already expect it to be difficult. Ensuring that the data is in place before getting other resources committed is often a good cost saver. It is usually the role of data-scientist working with domain experts to specify what data should be used to solve a given problem.

Once the data is ready it’s time to get stuck in. We advise clients to hold back a portion of the data so they can independently validate the results of the machine learning model at the end of the POV. If it’s forecasting we can send through predictions for data we haven’t seen or decision making, decisions on data where we haven’t been given the answers. This helps validate and build confidence in the ML outputs. Once the data-scientists receive the portion of the data to train the model on, they will then start exploring the data, plotting and graphing it in different ways to understand the “shape” of the data, getting stats about all the various fields. This will often throw up questions about how the data maps to the domain and what things mean, etc. We schedule half hour calls twice weekly as this gives enough time for work to get done to have something to say but also regular enough to be able to draw on the domain expertise as needed. Once this is complete the data-scientists will have a good idea of how complete the dataset is, how it maps to the domain. In the case where the data is not complete, they’ll advise on what data is needed to plug the gaps.

As soon as the data-scientists have a base dataset and understanding to work from, we try to go end to end uploading the data to Kortical, selecting which features to use, what evaluation metric makes sense for the business case, building the model, and seeing how it performs to guide the client conversations. This takes minutes in Kortical and most citizen data-scientists would have just uploaded the raw data, used the default options and not be able to fully interpret the output.

Building the whole business case including the most basic machine learning model end to end the data-scientist will document all assumptions, gaps in the data and work with the client to get a shared understanding that the data-science is representative of the real world problem. By doing this early in the process they uncover missing data and required business decisions with enough time to take action which at this point is more valuable than the data-scientists spending a long time making a model with great accuracy that could be tuned to do the wrong thing. Once the first end to end ML solution is built and all assumptions verified, the data-scientists have to do a meticulous internal review as one misplaced multiplication and the numbers could be very wrong. By using a lot of tried and tested tooling the scope for errors can be minimised drastically but this is still one of the big differences between data-science and normal code development, if code is wrong it usually doesn’t work but data-science errors are generally more insidious.

Once the assumptions and key metrics have all been validated, which typically takes about 2 weeks but can vary a week either way, the data-scientist will typically be in one of two scenarios, either the model seems almost too good to be true or the results are a bit underwhelming and this informs the next steps. In the case where the results are almost too good to be true, the emphasis is heavily on trying to validate the results with the client, often through domain expert review of the outputs and model decision making drivers but also sometimes by sourcing fresh as yet unseen data to corroborate the findings. In the case where the results are underwhelming it’s less likely that the results are wrong and it’s time to start iterating on the model.

To do this the data-scientist will often look at the overall feature importances to find the main drivers and row by row explanations to understand the segments. This is built into the platform and takes just a few clicks. Once they can see the insight they will probably exclude the lowest performing features from the model but the real brainpower is applied to the highest performing features. If we see that postcode is high performing, do we need to add the geo-demographical data for the area? If it’s time of year, is that a poor proxy for weather data, should they add that? There are also more feature transformation type approaches, like salaries that might work better using the log of the value or outlier removal. More complex features like creating a vector space of relative house prices in an area. These sorts of features can often bring huge value but change the shape of the data substantially and as such a different model type, say deep neural networks might work better, where previously xgboost performed best. Using AutoML the data-scientist is focused mainly on intelligently mapping the domain data to ML and switching algorithms is seamless, so they can rapidly iterate to better models.

This process of rapidly iterating on features can go on indefinitely but with a good platform after a week or two, the value add of iterations tends towards fractions of a percent. By this stage hopefully we’re around 4 weeks in and we’ve got a world class, superhuman performing ML model that’s fully validated to solve the business case. The data-scientist will run the predictions for the held back data and the client can review and verify the model performance.

At this point if built in a platform that can meet production SLAs, the model is ready to use in anger to solve business problems and generate ROI, though often historic data validation doesn’t hold enough weight with senior leaders, so there will be a period of shadowing to show that the model really performs as expected on live data before using it is allowed to go prime time.

If you’ve got this far hopefully you can see that data-science is a real expertise, that has quite a human element and that while using an AI as a service / AutoML platform like Kortical can massively accelerate delivery and allow the data-scientist to spend more time doing the more complex interesting things that tax them mentally and generate value. Even something as simple as selecting the right evaluation metric for a model is something a non data-scientist is likely to get wrong.

I love the image one of our competitors paints of Mom and Pop cucumber farmers in Japan using AI to automatically sort their produce but they try to gloss over the fact that the son who set it all up went to MIT. Data-science tools are not a replacement for talent, the best tools magnify and accelerate that talent but it’s like an F1 car, if you don’t know how to start an F1 car, you’re still not going anywhere. Sure some people will figure it out and do ok and they perpetuate the myth but if you want to deliver real meaningful value from AI and ML quickly, forget the citizen data-scientist and get some real talent onboard.

Contact Us

Leave your email and we will contact you shortly or call us on +44(0)20 3998 3129.

By submitting this form, I can confirm I have read and accepted Kortical's privacy policy.