Take a Data Science Pipeline to Production

The book to learn how to turn your Machine Learning models in a productive Data Science Pipeline

data science pipeline

The following is a review of the book Data Science in Production: Building Scalable Model Pipelines with Python by Ben Weber (Twitter here), one of the best texts for learning to build a perfect Data Science Pipeline in Python.

What is a Data Science Pipeline?

The first step towards understanding why this book is so valuable is knowing what a Data Science Pipeline is.

A Data Science project, for example a Lead Scoring or Demand Forecasting Model is only valuable if it is put in production and used to improve or replace a certain process. For this to happen, the model has to be integrated within the technological infrastructure of wherever or whoever the model is being implemented for.

These projects also have lots of phases: Data Analysis, Feature Extraction, Model Building, Prediction, and output. Each of these phases can be done in a different environment, the definitive model has to be tuned, etc… Also, there is a clear differentiation between Model Training and Model application.

All of this can be tricky to understand, and knowning how to efficiently manage each step, and once all the analytical phases have been done, getting your model ready for its application is fundamental in the sucess of a ML/DS project.

Because of this, Data Scientists need to know how to implement Data Science Pipelines, that in an organised manner implement the sequence of steps that should happen when we want to make use of our model: receiving or fetching the data, preparing it for prediction, making the prediction and then sending the output of this prediction to wherever it needs to go.

Review of Data Science in Production

By learning how to build and deploy scalable model pipelines, data scientists can own more of the model production process and more rapidly deliver data products.

This book provides a hands-on approach to scaling up Python code to work in distributed environments in order to build robust pipelines. Readers will learn how to set up machine learning models as web endpoints, serverless functions, and streaming pipelines using multiple cloud environments. It is intended for analytics practitioners with hands-on experience with Python libraries such as Pandas and Scikit-learn, and will focus on scaling up prototype models to production.

This is something that sometimes is left to Data Engineers but that Data Scientists should feel comfortable with, in order to be able to create end to end applications. We think it makes the difference between an average data scientist or machine learning engineer and an elite one.

Awesome, let see what the book contains!

Contents

The contents of Data Science in Production: Building Scalable Model Pipelines with Python are the following:

  • Chapter 1: Introduction – Introduction to Python and the discipline of applied data science, present the data sets, models, and cloud environments used throughout the book, and provide an overview of automated feature engineering.
  • Chapter 2: Models as Web Endpoints – This chapter shows how to use web endpoints for consuming data and hosting machine learning models as endpoints using the Flask (for web applications) and Gunicorn (for HTTP interfaces) libraries, as well as Sckit-Learn and Keras for the modelling and application.
  • Chapter 3: Models as Serverless Functions – This chapter will build upon the previous chapter and show how to set up model endpoints as serverless functions using AWS Lambda and Google Cloud Platform Cloud Functions.
  • Chapter 4: Containers for Reproducible Models – This chapter will show how to use containers for deploying models with Docker. We’ll also explore scaling up with ECS and Kubernetes, and building web applications with Plotly Dash. Docker and Kubernetes are an industry standard for deploying Machine Learning models, so you should definitely learn about them.
  • Chapter 5: Workflow Tools for Model Pipelines – This chapter focuses on scheduling automated workflows using Apache Airflow. We’ll set up a model that pulls data from BigQuery, applies a model, and saves the results.
  • Chapter 6: PySpark for Batch Modeling – This chapter will introduce readers to PySpark using the community edition of Databricks. We’ll build a batch model pipeline that pulls data from a data lake, generates features, applies a model, and stores the results to a No SQL database. Pipelines in Python are implemented using a mix of Pyspark and standard Python when treating large data sets, so this chapter is also highly interesting.
  • Chapter 7: Cloud Dataflow for Batch Modeling – This chapter will introduce the core components of Cloud Dataflow and implement a batch model pipeline for reading data from BigQuery, applying an ML model, and saving the results to Cloud Datastore.
  • Chapter 8: Streaming Model Workflows – This chapter will introduce readers to Kafka and PubSub for streaming messages in a cloud environment. After working through this material, readers will learn how to use these message brokers to create streaming model pipelines with PySpark and Dataflow that provide near real-time predictions.

As you can see Data Science in Production leaves no stone unturned, touching a huge variety of Data Science Pipeline tools and carefully visits all the different Data Science pipeline steps.

Summary

Data Science in Production: Building Scalable Model Pipelines with Python is a text that was highly needed in the Machine Learning and Data Science landscape.

Many books like Python Machine Learning, Hands-On Machine Learning with Scikit-Learn & Tensorflow, or An Introduction to Statistical Learning, which we highly recommend for learning the theory and practice behind the Machine Learning algorithms, focus heavily on these parts, leaving aside the actual end to end implementation or ‘production’ phase all these models have to go through to provide value.

If you go through Data Science in production after being already versed in how Machine Learning algorithms work, how to train and evaluate them, you will greatly stand out from other Data Scientists around you. In that sense, this book is more similar to Building Machine Learning Powered Applications than to the other ones we’ve mentioned, and we think is an essential piece of the library of any hardcore elite Machine Learning engineering.

You can find it on amazon here:

Data Science in Production: Building Scalable Model Pipelines with Python
  • Weber, Ben G (Author)
  • English (Publication Language)
  • 234 Pages - 01/01/2020 (Publication Date) - Independently published (Publisher)

As always we hope you liked the review. If you have any comments or want us to review a specific text, send us a message to howtolearnmachinelearning@gmail.com. Also, don’t forget to follow us on Twitter and have a great day!

data science pipeline python

Tags: Data Science Pipeline, Data Science in Production, Python Data Science Pipeline, Machine Learning Books.