Implementing an End-to-End Machine Learning Workflow with Azure Data Factory
From collecting data to sending results, ADF constructs the right MLOps Lifecycle on one screen
The impression I had for implementing Machine Learning up to 3 years back was that of building a model in Python and deploying the project to an automated CI/CD pipeline. While it solved the basic criteria of performing predictions, it could never be called an end-to-end workflow because data storage and reporting were two significant components missing in this workflow and had to be dealt with separately. In this article, I will walk through an entire Machine Learning Operation Cycle and show how to establish every step of the way using Azure Data Factory (ADF). Yes, it is possible, easy, and extremely reliable. As a bonus, it also automatically sets you up to receive alerts for any sort of data anomalies occurring throughout the process, so you do not have to worry about monitoring the workflow manually.
How would you define an End-to-End (E2E) ML Workflow?
My definition of a complete workflow starts from the point that data starts flowing into the system and ends when the end report is ready to be put out in front of the intended audience. The process needs data collection, storage, backend engineering, middleware, and frontend engineering. That’s when a product is ready to ship. And we will achieve this entire workflow with Azure Data Factory.
- Data Engineering: Data acquisition, Preparation, Pre-processing, Managing Storage.
- ML Model Engineering: Model Building, Training, Testing, Refining and Publishing.
- Code Engineering: Integrating the ML model into the final product backend (Service Fabric APIs, etc.)
- Reporting: Data Visualization and Storytelling.
A Perfect Representation of the Machine Learning Cycle from start to end | Image Source: MLOps (Published under Creative Commons Attribution 4.0 International Public License and used with attribution (“INNOQ”))
What is Azure Machine Learning?
Machine learning is a data science technique and falls under a larger Artificial Intelligence umbrella, that allows computers to use historical data loads to forecast future behaviors, outcomes, and trends. By using machine learning or AI, computers learn to perform tasks without being explicitly programmed. Forecasts or predictions from machine learning can make apps and devices smarter. For example, when you shop online, machine learning helps recommend other products you might want based on what you have bought in the last six months. Or when your credit card is swiped, machine learning compares the transaction to a database of transactions and helps detect fraud.
Azure Machine Learning is a service offered by Microsoft Azure that can be used for all kinds of machine learning tasks, from classical ML to deep, supervised, unsupervised, and reinforcement learning. Either by using Python/R SDK to write code or by using ML Studio to implement a low-code/no-code model, you can build, train, and track machine learning and deep-learning models in Azure Machine Learning Workspace.
Azure allows you to begin training on your local machine and then scale out to the cloud. The service also interoperates with popular deep learning and reinforcement open-source tools such as PyTorch, TensorFlow, sci-kit-learn, and Ray RLlib.
Walking through the Workflow Step-by-Step
In the five steps detailed below, we will perform all required tasks from data collection, processing, modeling, training to building the predictive analytics reports for the customer. Before we start digging into the details, I have represented all the steps with brief descriptions in the diagram below.
Flow Diagram of the complete Machine Learning System | Image by Author
1. Azure Function to create a trigger for the Pipeline
We start our workflow by defining a function that will evaluate when the pipelines should start running. This is generally called a ‘Trigger’. It is an event that starts the workflow. Every data pipeline needs data to start working with. This data can be in the form of structured streams (.ss files), text files (.csv, .tsv, etc) or it could be spark streams coming from Azure Event hubs. In our scenario, let’s assume we have structured streams of data that get dumped into a COSMOS location (Cosmos being Azure’s NoSQL store) every 6–12 hours. To inform the Machine Learning service that upstream files are available and new raw data is present to run, the Azure Function runs every hour to check if new files have arrived. The remainder of the workflow will only execute if this Azure Function returns True and specifies that new data is present.
Step 1. Adding an Azure Function to check availability of raw data streams we consume | Image by Author.
a. Azure Function runs every hour to check if new data is available.
b. If new data is available, it returns a True boolean value which in-turn triggers the remainder of the workflow.
c. If no new files are present, it returns False and nothing happens. It runs again after an hour to perform the same check.
2. Azure Databricks for Data Preprocessing and Storing to Data Lakes
ADF supports all modern data structures including structured and unstructured streams of data input through storage services like Data Lakes and Warehouses. But the best way to process data is to integrate ADF with Azure Databricks notebooks. This is a typical Python environment that runs on top of a workspace created in Azure** and can perform every Machine Learning and Data Processing activity that Python has the capacity to run.
Once the Azure Function sends the signal that new data is available to be processed, the Databricks cluster is activated and this Notebook starts running. We import data using Python, perform initial cleaning, sorting, and other preprocessing jobs like imputing null and invalid inputs, rejecting data that will not be used for the Machine Learning model. This processed and clean data, that is ready to be sent to the Machine learning pipeline is put securely into an ADLS (Azure Data Lake Storage) Gen2 location. The data lake is a fast storage option for temporary and permanent storage needs and can directly be accessed by the ML activity.
Step 2. Build the Data Processing Python Notebook that cleans raw data, runs pre-processing scripts, and stores the output in an ADL location | Image by Author
a. Once the Azure Function returns True, the Databricks Python notebook starts collecting new data.
b. This data is cleaned and processed by the notebook and made ready for the ML model.
c. It is then pushed to a Data Lake Storage service, from where the ML activity can fetch it and run the model.
** Detailed steps for creating a Databricks workspace and performing data processing with Python using this workspace are explained in this tutorial — Transform Data using Databricks Notebook.
3. Build and Train models from the Stored Data using Azure Machine Learning
We now arrive at the core of the Machine Learning workflow — the place where the magic happens! Azure Machine Learning offers two ways of building ML pipelines.
You have the option of either using a Python or R SDK to build, train and test ML Models or use the Azure ML Studio to create a code-free drag-drop pipeline
To run from scratch and follow the steps of creating a training and testing model for your ML model, a detailed example is shown in the repository below.
This repo shows an E2E training and deployment pipeline with Azure Machine Learning’s CLI. For more info, please visit…
Irrespective of whichever method you choose, ADF treats the pipeline without bias. The pipeline reads data from the ADL storage account and runs its training and prediction scripts on the new data and refreshes the model at every run to fine-tune the trained algorithm. The output of this machine learning pipeline is a structured dataset stored as a daily output file in Azure Blob Storage.
Step 3. Read data from Step 2 ADL Location and run the Machine Learning Model on it | Image by Author
a. Connect to ADL Storage account to fetch processed data.
b. Run the ML model on the data.
c. Send the output and prediction metrics to Azure Blob Storage.
** An alternate to using Machine Learning Pipelines is implementing MLFlow directly using the Databricks environment. Detailed tutorials of this implementation can be found here — Track Azure Databricks ML experiments with MLflow
4. Push the Final Output into Azure Blob Storage
Step 4. Push the structured schema-based output to Azure Blob Storage | Image by Author
Azure Blob Storage is a cloud-based service that helps you create data lakes for your analytics needs and provides storage to build powerful cloud-native and mobile apps. Storage is scalable and pricing is inversely proportional to the amount of storage used. Blobs are built from the ground up by using SSD-based storage architectures and are recommended for use in serverless applications like Azure Functions. The main reason for my use of Blob Storage is its seamless capability of connecting with Power BI. Complete tutorial to perform this can be seen here.
a. Store analyzed data securely in dedicated blobs.
b. Connect with Power BI to retrieve data for data visualization.
5. Access the Blob Storage from Power BI to build Reporting Dashboards
The best part of staying within the Microsoft Azure ecosystem is the availability of Power BI within the same stack. Power BI is an extremely powerful tool to build reports and present insights drawn from a model. Hence, I personally end my workflow by keeping my Power BI Dashboard updated every time the ADF Pipeline completes running. It gives an option of updating data every day at a given time, so all insights, tables, metrics, and visualizations are up-to-date and in sync with the data processing workflow.
Step 5. Connect your Azure Blob Storage to retrieve data in Power BI | Image by Author
“Storytelling” is the most important part of any analytics process.
The client or consumer of your workflow will never be interested in the code that brings the predictions to the table. They will only be interested in getting answers to the questions they asked when submitting product requirements. Therefore in my dictionary, a workflow is called complete only after the result of the deliverable is in hand. The result here is the report that will show what the Machine Learning model predicted and what it’s been predicting every day, how metrics have improved, and what enhancements are needed. This is the art of Decision Science that Machine Learning gives us the power to perform.
a. Connect Power BI with the Azure Blob Storage to fetch data.
b. Set Power BI to refresh every day, so all the data is updated.
Alerts, Monitoring, and Anomaly Detection
A very tedious, yet extremely task of every DevOps or MLOps workflow is alerting and monitoring. These monitors are required to keep the data in check, avoid data staleness, or get notified for any kind of anomaly or unexpected results seen across any step of the analysis. Azure Data Factory (ADF) conveniently allows us to set up several alerts on this pipeline right from the ADF Monitoring dashboards.
Examples of alerts that can be set-up:
- Staleness in data received from upstream.
- Failure in the pipeline or execution taking longer than a given threshold.
- Results stored in the Data Lake (ADLS) are varying largely from past data.
- File sizes received from upstream or generated by the workflow are abnormally large or small.
Alerts can be set up to notify an email group alias, a security group or create incidents/bugs in your DevOps portal whenever an anomaly or failure is detected.