



Data Engineering for Machine Learning
Master the data engineering best practices required to support reliable & scalable production models, taught in a live, interactive course by Josh Wills.
Access to: session recordings, curated resources and exclusive events
Designed to help make organizations
Data Driven
Product Leaders
Program managers and product managers that are focused on metrics like growth and revenue, and prioritization decisions
Data Scientists
Data Scientists and Data Science managers who help map strategic decisions to actionable experimental designs and then interpret the results in a trustworthy manner



Engineering Leaders
Engineering managers, directors, VPs, and CTOs who want to make their organizations data-driven with metrics and A/B tests



Designed for ML and Data Engineers
who want to architect the data infrastructure required to support scalable machine learning in production environments
- Build and monitor production services for capturing high-quality data for model training and for serving data computed in your data warehouse
- Design batch data pipelines for training models that integrate diverse data sources, avoid data leakage, and run on-time and on-budget
- Learn how to make the leap from batch to streaming pipelines to support real-time model features, model evaluation, and even model training
- Be proficient at creating data pipelines in SQL/Python using a cloud data warehouse (Snowflake/Databricks/BigQuery)
- Be comfortable with building simple web APIs and working with key-value stores like Redis or DynamoDB
- Be familiar with core database join strategies such as hash joins and sort-merge joins
Live Cohort
Over the past decade, leading technology companies have developed and deployed hundreds of machine learning models to help optimize their products. While there is ample material available about training individual models on a static data set, far less information is available about the data engineering best practices required to support production models. During this live course, we will dive into these data engineering best practices and the value they can bring to your ML systems. Specifically, we will answer:
- How do you design the data collection systems needed to support robust and reliable machine learning models?
- How do you scale from managing features for a single production machine learning model to many?
- How do you track the performance of your data pipelines to balance reliability and cost?
- When and how do you transition from batch to streaming data pipelines for model evaluation?
In traditional software development, CI/CD automates many tasks, including testing, building and deploying software. But CI/CD for ML is a different beast. Testing and deployment of ML can be triggered by many event types, and observability and logging requirements are materially different for ML.Today, no single tool can facilitate end-to-end CI/CD for ML. The process of testing, building and deploying ML requires a symphony of tools and glue code to create an integrated CI/CD system. To offer an entry point that many data scientists and engineers are familiar with, we’ll teach you how to integrate GitHub with other ML tools to build custom CI/CD automations for ML that will increase your engineering efficiency and prevent errors from being released to production.
- Create the flexible and evolvable data ingestion systems required to support streaming data pipelines and ML use cases.
- Analyze the tradeoffs between row-oriented and column-oriented data formats for use during data ingestion, analysis, model training, and model serving.
- Solve a broad class of common ML problems by building tools for moving large datasets from your data warehouse into a low-latency serving system in your production environment.
- Compose data from multiple sources and time scales into coherent datasets that are designed to avoid the most common sources of error in model training.
- Evolve your data models beyond supporting a single ML use case into a shared knowledge resource that lets your company bring machine learning everywhere it is needed.
- Create a data platform for feature evaluation and model training that enables data scientists and ML researchers to easily trade off speed, flexibility, and compute costs.
- Create tools for linking data profiling and quality checks from model training into your production model deployments.
- Understand the benefits and the limitations of using standard application performance monitoring (APM) tools for data and ML monitoring problems.
- Balance the need for comprehensive and thorough data quality checks with the cost and performance overhead required to perform those checks in both the data warehouse and the production environment.
- Understand the unique constraints and opportunities for evaluating ML models in an online serving environment beyond normal A/B testing.
- Design streaming data pipelines for performing rapid evaluation of models for recommendations, ranking, and classification problems.
- Create the data infrastructure required to support reinforcement learning and contextual bandits in order to support ML models that can learn in real time.
Team?
Sphere offers a range of subscription packages that provide discounts on all courses in our library. We help upskill employees at some of the world’s best companies. Learn more about pricing options here or book a time to talk to one of our staff below.
Book a free consultation
Learn live from a world-class
Instructor

Learn live from a world-class
Instructor
Josh Wills has built and led data engineering and data science teams at Slack, Cloudera, and Google. As an individual contributor, he was the technical lead for Slack’s search indexing pipeline and Google’s ad auction and experimentation library. Josh has also consulted on data pipeline design and machine learning systems at companies like Spotify, Airtable, Apple, and Capital One. He is the co-author of Advanced Analytics with Apache Spark and has given numerous popular talks and lectures about the practice of data science and engineering over the past decade.
Learn live from world-class
Instructors

Josh Wills has built and led data engineering and data science teams at Slack, Cloudera, and Google. As an individual contributor, he was the technical lead for Slack’s search indexing pipeline and Google’s ad auction and experimentation library. Josh has also consulted on data pipeline design and machine learning systems at companies like Spotify, Airtable, Apple, and Capital One. He is the co-author of Advanced Analytics with Apache Spark and has given numerous popular talks and lectures about the practice of data science and engineering over the past decade.
Guest Lectures by
Industry Experts

”In my two decades of experience building large-scale data systems in industry and academia, Josh stands out as a singular mentor, practitioner, and teacher. There are few more qualified to impart the benefits of foundational data engineering practices.”

”Josh has an incredible range of experience building data systems at companies of various scales. On top of that, he is a fantastic speaker. I've learned so much from him over the years -- he has an engaging way of explaining difficult concepts!”

”Josh is truly a rare breed with real world experience in applied mathematics, data science, ML, and data infrastructure. His knowledge really gives him a wildly unfair advantage when building data-driven products and systems. He's also one of the best presenters and mentors you'll find.”

Join a diverse and experienced
Community
This cohort gives you access to a rich community of like-minded professionals from some of the best businesses in the world. Even after the course ends, you will continue to learn and build with each other.

Exclusive Content
to advance your business
Get access to exclusive content through live sessions, meetups and our Student Portal (even after you finish the cohort). Ask questions and get personal feedback directly from your instructors and others taking the course.

Still have questions?
We’re here to help!
Do I have to attend all of the sessions live in real-time?
You don’t! We record every live session in the cohort and make each recording and the session slides available on our portal for you to access anytime.
Will I receive a certificate upon completion?
Each learner receives a certificate of completion, which is sent to you upon completion of the cohort (along with access to our Alumni portal!). Additionally, Sphere is listed as a school on LinkedIn so you can display your certificate in the Education section of your profile.
Is there homework?
Throughout the cohort, there may be take-home questions that pertain to subsequent sessions. These are optional, but allow you to engage more with the instructor and other cohort members!
Can I get the course fee reimbursed by my company?
While we cannot guarantee that your company will cover the cost of the cohort, we are accredited by the Continuing Professional Development (CPD) Standards Office, meaning many of our learners are able to expense the course via their company or team’s L&D budget. We even provide an email template you can use to request approval.
I have more questions, how can I get in touch?
Please reach out to us via our Contact Form with any questions. We’re here to help!