Designed for ML and Data Engineers

who want to architect the data infrastructure required to support scalable machine learning in production environments

You will:

Build and monitor production services for capturing high-quality data for model training and for serving data computed in your data warehouse

Design batch data pipelines for training models that integrate diverse data sources, avoid data leakage, and run on-time and on-budget

Learn how to make the leap from batch to streaming pipelines to support real-time model features, model evaluation, and even model training

You should:

Be proficient at creating data pipelines in SQL/Python using a cloud data warehouse (Snowflake / Databricks / BigQuery)

Be comfortable with building simple web APIs and working with key-value stores like Redis or DynamoDB

Be familiar with core database join strategies such as hash joins and sort-merge joins

Watch Josh' recent live chat with Eric Sammer on

Stream Processing Systems

Meet Your


Josh Wills

Josh Wills has built and led data engineering and data science teams at Slack, Cloudera, and Google. As an individual contributor, he was the technical lead for Slack’s search indexing pipeline and Google’s ad auction and experimentation library. Josh has also consulted on data pipeline design and machine learning systems at companies like Spotify, Airtable, Apple, and Capital One. He is the co-author of Advanced Analytics with Apache Spark and has given numerous popular talks and lectures about the practice of data science and engineering over the past decade.

Chris Rider

About Josh's

Live Cohort

Over the past decade, leading technology companies have developed and deployed hundreds of machine learning models to help optimize their products. While there is ample material available about training individual models on a static data set, far less information is available about the data engineering best practices required to support production models. During this live course, we will dive into these data engineering best practices and the value they can bring to your ML systems. Specifically, we will answer:

How do you design the data collection systems needed to support robust and reliable machine learning models?

How do you scale from managing features for a single production machine learning model to many?

How do you track the performance of your data pipelines to balance reliability and cost?

When and how do you transition from batch to streaming data pipelines for model evaluation?

Session 1 - Data Collection for Machine Learning Systems

November 8th
1 pm (PST)

In our first session we will discuss how to design the data collection system that you need to support production machine learning. Specifically, you will learn how to:

Evaluate the technical options for data serialization and transport, and choose the combination that is right for your business.

Design a set of data contracts and associated data pipelines that can quickly and accurately map events from your production systems into queryable records in your data warehouse.

Integrate real-time data quality checks and automated alerts into your data collection system to catch and handle problems as soon as they happen.

Session 2 - Data Modeling for Machine Learning Systems

November 10th
1 pm (PST)

In our second session we will understand how to architect the data models and pipelines that empower machine learning engineers to rapidly build trustworthy datasets for training and testing ML models. Specifically, you will learn how to:

Compose data from multiple sources and time scales into coherent datasets that are designed to avoid the most common sources of error in model training.

Evaluate the tradeoffs between using SQL and custom code in Python/Scala for pipeline steps based on the needs/skills of your stakeholders and the complexity of the problem you are trying to solve.

Evolve your data models from supporting a single ML use case into a shared knowledge resource that lets your company bring machine learning everywhere that it is needed.

Session 3 - Monitoring and Observability for Modeling Pipelines

November 15th
1 pm (PST)

In our third session we will discuss how to design data pipelines that treat observability as a first-class concern to minimize cost and maximize throughput for machine learning use cases. Specifically, you will learn how to:

Adapt the classic application performance monitoring (APM) techniques to the needs of data pipelines.

Judge when it is time to move from standard APM tools to more specialized tooling and infrastructure

Session 4 - Tools and Pipeline Design for Model Evaluation

November 17th
1 pm (PST)

In our last session we will cover pipeline design and serving patterns for the data we collect to evaluate the performance of production models. Specifically, you will learn how to:

Design data models to explore how models perform on different subsets of your data

Invent techniques for performing fast and accurate evaluation of ranking and recommender models against your core metrics.

Create streaming data pipelines to aid in rapid model evaluation and debugging

Still have questions?

We’re here to help!

Do I have to attend all of the sessions live in real-time?

You don’t! We record every live session in the cohort and make each recording and the session slides available on our portal for you to access anytime.

Will I receive a certificate upon completion?

Each learner receives a certificate of completion, which is sent to you upon completion of the cohort (along with access to our Alumni portal!). Additionally, ScholarSite is listed as a school on LinkedIn so you can display your certificate in the Education section of your profile.!

Is there homework?

Throughout the cohort, there may be take-home questions that pertain to subsequent sessions. These are optional, but allow you to engage more with the instructor and other cohort members!

Can I get the course fee reimbursed by my company?

While we cannot guarantee that your company will cover the cost of the cohort, we are accredited by the Continuing Professional Development (CPD) Standards Office, meaning many of our learners are able to expense the course via their company or team’s L&D budget. We even provide an email template you can use to request approval.

I have more questions, how can I get in touch?

Please reach out to us via our Contact Form with any questions. We’re here to help!

Book a time to talk with the ScholarSite team