Watch Stefan’s recent live chat with Chip Huyen on

ML Ops: Buy vs Build

This course is designed for

ML Engineers & Managers

Machine Learning Engineers & Managers

tasked with building, deploying & maintaining production ML models

Improve latency and throughput by selecting appropriate inference architectures
Reduce outages and MTTR by applying leading model observability approaches
Implement architectures for fast, reliable and scalable model deployment

Meet Your


Stefan Krawczyk

A hands-on leader and Silicon Valley veteran, Stefan has spent the last 15 years working on data and machine learning systems at companies like Stitch Fix, Nextdoor and LinkedIn.

Most recently, Stefan led the Model Lifecycle team at Stitch Fix. Its mission was to streamline the model productionization process for over 100+ data scientists and machine learning engineers. The infrastructure they built created and tracked tens of thousands of models, and provided automated deployment that adheres to MLOps best practices.

A regular conference speaker, Stefan has guest lectured at Stanford’s Machine Learning Systems Design course and is an author of a popular open source framework called Hamilton. Stefan is currently working on a stealth project.

Want to get to know Professor Bernstein?

You’re invited to our live course information session

Professor Berstein is hosting a live 30 minute session free to attend for all. He will be giving a breif introduction to the his upcoming course and then answering audience questions.

4:00-4:30pm PST; March 15

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

About Stefan's

Live Cohort

The last decade has seen the rise of AI & Machine Learning, which now powers product experiences beyond those of the big technology companies. More than ever before, models must perform and function correctly to deliver business value. The cost of deploying a slow or bad model, or not detecting undesirable behavior quickly, could significantly impact customer experience and the business' bottom line.

Though it’s easy to get a machine learning model into production, deploying reliably and catching performance issues remain key challenges. The trade-offs of different design decisions can result in production outages or unhealthy models. In response, a proliferation of tooling (open source and proprietary) and the Machine Learning Operations (MLOps) practice emerged.

In this course we’ll explore where deployment & inference fit into MLOps. Then we’ll review common strategies and tactics that help stop bad models from reaching production and reduce the mean time to resolution of production model issues. We’ll cover common patterns for scaling inference and discuss strategies & tactics within the context of popular open source frameworks (where applicable) while applying them to common machine learning contexts. The course will not assume a particular model framework or architecture to ensure broad content applicability.

Learners will be able to improve production model reliability and performance using tooling-agnostic MLOps patterns. By the end of the course, learners will be able to answer these questions:

What components should my model deployment system have?

Where will my current approach to deployment & inference break down?

Is my current production architecture limiting my model performance?

How can I increase my model throughput?

How can I reduce outages/MTTR by applying model observability approaches?

Session 1 - What makes an ideal ML production system?

November 29th
8:30 am (PST)

Identify key characteristics of common ML outages in order to avoid them by trying to break a few common scenarios.

Enumerate components required for robust deployment and observability by dissecting and analyzing accepted best practices.

Session 2 - Model training, representations, and inference

December 1st
8:30 am (PST)

Demonstrate why model representations are the cornerstone of reliability.

Illustrate common approaches to scaling model inference & throughput via mini case studies.

Connect the impact of inference SLAs on model architectures and representations.

Session 3 - Patterns for successful production model deployment

December 6th
8:30 am (PST)

Demonstrate common patterns to increase production robustness.

Make informed trade-offs between production architectures and internal operations by exploring how Stitch Fix made these decisions.

Session 4 - Approaches to model observability

December 9th
8:30 am (PST)

Apply common approaches to increase model observability via review of industry tooling.

Choose appropriate patterns for avoiding outages/reducing MTTR via mini case studies.

Manage outage instances successfully (without losing your cool) .

Still have questions?

We’re here to help!

Do I have to attend all of the sessions live in real-time?

You don’t! We record every live session in the cohort and make each recording and the session slides available on our portal for you to access anytime.

Will I receive a certificate upon completion?

Each learner receives a certificate of completion, which is sent to you upon completion of the cohort (along with access to our Alumni portal!). Additionally, Sphere is listed as a school on LinkedIn so you can display your certificate in the Education section of your profile.!

Is there homework?

Throughout the cohort, there may be take-home questions that pertain to subsequent sessions. These are optional, but allow you to engage more with the instructor and other cohort members!

Can I get the course fee reimbursed by my company?

While we cannot guarantee that your company will cover the cost of the cohort, we are accredited by the Continuing Professional Development (CPD) Standards Office, meaning many of our learners are able to expense the course via their company or team’s L&D budget. We even provide an email template you can use to request approval.

I have more questions, how can I get in touch?

Please reach out to us via our Contact Form with any questions. We’re here to help!

Book a time to talk with the Sphere team