Enroll now to join this live cohort-based course led by Josh Wills
& research design
Master the data engineering best practices required to support reliable & scalable production models.
Format: 4 x 2 hr live workshops (+ recordings of each)
Dates: 1 pm PST; November 8, 10, 15 and 17
Price: $1000 per seat (increasing Nov 4th; expense through L&D)
who want to architect the data infrastructure required to support scalable machine learning in production environments
Build and monitor production services for capturing high-quality data for model training and for serving data computed in your data warehouse
Design batch data pipelines for training models that integrate diverse data sources, avoid data leakage, and run on-time and on-budget
Learn how to make the leap from batch to streaming pipelines to support real-time model features, model evaluation, and even model training
Be proficient at creating data pipelines in SQL/Python using a cloud data warehouse (Snowflake / Databricks / BigQuery)
Be comfortable with building simple web APIs and working with key-value stores like Redis or DynamoDB
Be familiar with core database join strategies such as hash joins and sort-merge joins
Watch Josh' recent live chat with Eric Sammer on
Stream Processing Systems
Josh Wills has built and led data engineering and data science teams at Slack, Cloudera, and Google. As an individual contributor, he was the technical lead for Slack’s search indexing pipeline and Google’s ad auction and experimentation library. Josh has also consulted on data pipeline design and machine learning systems at companies like Spotify, Airtable, Apple, and Capital One. He is the co-author of Advanced Analytics with Apache Spark and has given numerous popular talks and lectures about the practice of data science and engineering over the past decade.
What you get out of
Over the past decade, leading technology companies have developed and deployed hundreds of machine learning models to help optimize their products. While there is ample material available about training individual models on a static data set, far less information is available about the data engineering best practices required to support production models. During this live course, we will dive into these data engineering best practices and the value they can bring to your ML systems. Specifically, we will answer:
Session 1 - Data Collection for Machine Learning Systems
In our first session we will discuss how to design the data collection system that you need to support production machine learning. Specifically, you will learn how to:
Evaluate the technical options for data serialization and transport, and choose the combination that is right for your business.
Design a set of data contracts and associated data pipelines that can quickly and accurately map events from your production systems into queryable records in your data warehouse.
Integrate real-time data quality checks and automated alerts into your data collection system to catch and handle problems as soon as they happen.
Session 2 - Data Modeling for Machine Learning Systems
In our second session we will understand how to architect the data models and pipelines that empower machine learning engineers to rapidly build trustworthy datasets for training and testing ML models. Specifically, you will learn how to:
Compose data from multiple sources and time scales into coherent datasets that are designed to avoid the most common sources of error in model training.
Evaluate the tradeoffs between using SQL and custom code in Python/Scala for pipeline steps based on the needs/skills of your stakeholders and the complexity of the problem you are trying to solve.
Evolve your data models from supporting a single ML use case into a shared knowledge resource that lets your company bring machine learning everywhere that it is needed.
Session 3 - Monitoring and Observability for Modeling Pipelines
In our third session we will discuss how to design data pipelines that treat observability as a first-class concern to minimize cost and maximize throughput for machine learning use cases. Specifically, you will learn how to:
Adapt the classic application performance monitoring (APM) techniques to the needs of data pipelines.
Judge when it is time to move from standard APM tools to more specialized tooling and infrastructure
Session 4 - Tools and Pipeline Design for Model Evaluation
In our last session we will cover pipeline design and serving patterns for the data we collect to evaluate the performance of production models. Specifically, you will learn how to:
Design data models to explore how models perform on different subsets of your data
Invent techniques for performing fast and accurate evaluation of ranking and recommender models against your core metrics.
Create streaming data pipelines to aid in rapid model evaluation and debugging
Still have questions?
We’re here to help!
Do I have to attend all of the sessions live in real-time?
You don’t! We record every live session in the cohort and make each recording and the session slides available on our portal for you to access anytime.
Will I receive a certificate upon completion?
Each learner receives a certificate of completion, which is sent to you upon completion of the cohort (along with access to our Alumni portal!). Additionally, ScholarSite is listed as a school on LinkedIn so you can display your certificate in the Education section of your profile.!
Is there homework?
Throughout the cohort, there may be take-home questions that pertain to subsequent sessions. These are optional, but allow you to engage more with the instructor and other cohort members!
Can I get the course fee reimbursed by my company?
While we cannot guarantee that your company will cover the cost of the cohort, we are accredited by the Continuing Professional Development (CPD) Standards Office, meaning many of our learners are able to expense the course via their company or team’s L&D budget. We even provide an email template you can use to request approval.