• Oliver Bisseker, Author |
2 min read

Hello everyone, I’m an ML and AI Infrastructure specialist at KPMG UK. I’ve designed and assisted on many solutions from small scale POCs to large multi-million-pound projects. This article is the start of what will be an ongoing “bite-size” series to answer all your questions on Machine Learning Cloud Infrastructure; today, I’ll be answering the question: What is our role, as Cloud Engineers, when working on Machine Learning (ML) projects?

ML projects follow the same lifecycle as almost any other project you’ll work on: design, implement, support. Of course, there are many sub-divisions, e.g. pre-sales, which we can help with.

The most important question you should ask yourself during each phase is: how does this help my data science engineers (DSEs)? They are, after all, the primary customer of our ML platforms. Sure, there may be a business client that wants fancy dashboards with enriched data; but it’s your DSEs that will labour over client data and ML models to provide that business value.

Design

We play a critical role in any system’s design phase, which is never more accurate than in ML projects. These projects can be as simple or complex as you make them. If you use the right technologies, you can create simple, elegant solutions. To help with this, make sure you look at reference architectures. Microsoft has some fantastic ones on their reference architecture site for Azure. From our point of view, ML is an Extract, Transform, Load (ETL) pipeline with extra Transform. So, let’s break it down like that. First, our DSEs need data to work with, then they need an area to experiment and transform that data, and lastly, they need to put that enriched data somewhere. Keep in mind that these steps will eventually be automated with minimal DSE input.

Implement

Next, we implement the solution. You can do this using your favourite tools. I like to use Terraform, and GitHub Actions or Azure DevOps. To save on cloud costs, I highly recommend you implement the solution in phases: phase 1, get the data and store it somewhere that DSEs can access it. Preferably get this stage automated, e.g. you pull data from a client database nightly; phase 2, create a development environment for the DSEs. When the DSEs are ready to start outputting data for reporting, move to the next phase; phase 3, create an area for enriched data to live (likely to be a reporting database).

Support

Yes, even ML projects require support. Depending on your scenario, you may need to allowlist DSE IP addresses as they change, which we all know is more likely now than ever with home working. Or, you may need to bump up the compute size on the DSE machines if they need more power. If you’ve “brought your own compute”, you’ll need to maintain those machines.

An Afterword

I hope this article has answered some of your questions and helped you approach ML projects. Please leave a comment with questions you would like to see me answer in future articles. Good luck and have fun!