Machine learning and artificial intelligence are pivotal in helping search engines understand a user's intent. Consequently, search engines – when properly built – are now more capable than ever of showing relevant results to a user's text-based queries.
But the question remains: what's the current state of machine learning in search engines and, more pressingly, what's up next for search?
We recently got the chance to sit down with Doug Turnbull, Shopify's Senior Staff Engineer, who focuses his efforts on improving search relevance. Doug wrote "Relevant Search" back in 2016 and more recently co-wrote "AI-Powered Search" (2022). We were blown away by his level of expertise and what a master he's become in the field.
Before we dive into the discussion, though, let's get clear on the basics of how machine learning has impacted search engines.
Feeling confident about ML's role in search engines? Click here to jump straight to our interview with Doug.
Search engines use machine learning to deliver more consistently relevant results to search queries. It relies on state-of-the-art ML and AI models to better understand the "intent" of each user. In other words, machine learning and AI allow search engines to more accurately pinpoint what a user really wants to know, accomplish, or buy based on a small bit of text input. This process includes concepts like "natural language processing" (NLP) to help computers translate the language in a text query to the results a user wants to see. As noted by Sunil Kumar on Towards Data Science:
"[NLP] provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning."
The more people type in their search queries, the more machine learning models identify larger patterns in behavior to create ranking signals. Over time, these ML models use all of this information to improve the quality of results. Ultimately, they ensure search engines deliver the products or information a user is searching for.
And the importance of displaying relevant results across your website can't be overstated, regardless of your site's niche.
Imagine, for example, that you run an e-commerce website. One of the challenges you encounter when users search for products is that people simply don't know what's available unless they see it. That means your search results need to show the types of products a specific user is looking for, even if that person doesn't explicitly type the product's name.
You can also think about the last time you ran into trouble with a SaaS product. Using a company's internal search function may have allowed you to quickly sift through knowledge bases and autonomously troubleshoot your issue before contacting the company's support team. Not only does this improve the user experience, but it also lowers the resources a company will need to invest in to serve its clients better.
We wanted to learn more about the state of machine learning in search engines and got to sit down with one of the world's leading experts and a senior search relevance engineer at Shopify: Doug Turnbull. Below, you'll find a transcript of our conversation that leads up to the final question, "What's next in search that nobody is talking about yet?" (a video response is included for that question, as well).
Check out the full interview and let us know your thoughts on ML-powered search on our LinkedIn page.
I got into this by happenstance. For a while, during the first chapter of my career, I did some C++ programming. Then, in 2012, I ran into somebody at a block party wearing a nerdy t-shirt and one thing led to another, and I got a job at a local search consulting company. A lot of people were just starting to get into open source search engines around that time, like Solr and Elasticsearch.
The company I joined would build these beautiful-looking apps but when users would enter a term into the search bar, the results didn't make any sense to them. So I got into optimization for performance and then a different kind of optimization problem: I've gotta understand this query and try to get the right results back. Then I got into writing a book and Elasticsearch plugins. Things just took off. Now I'm at Shopify, helping the small e-commerce sites out there that Shopify helps build.
In my book Relevant Search and throughout my career, there's been this idea of a relevance engineer. A lot of new roles that are being recognized around like ML engineering and ML ops really get at this too. It's not just about building some perfect model over here in a Jupyter Notebook or something. Or the engineer trying to hack together something without engineering or data knowledge.
It's really about having a team with individuals who know both a little bit of data and a little bit of engineering. So, as I'm coding, I can make a decision with knowledge of both. For example, "I shouldn't go this way because that won't work in production. We need to go in this other way instead." That is really foundational to me — data and engineering knowledge in one brain as much as possible. And I think that's pretty unique for this space.
There are a lot of machine learning models out there that get publicized and get a lot of press. But the tricky thing is actually bringing those to a production system. There are so many trade-offs that you have to consider. And these conversations should happen between data people and engineering people.
It's about thinking of the search team as being data and engineering, as well as the other competencies like UX. I think what happens a lot in organizations is data science can sometimes be what's referred to as a service line where they consult on projects. They sprinkle their data or machine learning magic here, but then they have to leave and then they go to this other project and they sprinkle their magic over there.
Once you've sprinkled the magic, then someone else must maintain it. It has to have a life and evolve. And if no one is committed to that, then it can die very easily. So, let's say I'm an engineer who was given a model that I got into production. And there's a problem, which I don't quite know how to fix it because no one maintained the data. We really need to think of the search team as one cohesive unit of data and engineering working really closely together.
One of the really big ones that I think people don't think about enough is how hard good training data is difficult to acquire for an ML-powered search system. And by training data, I mean a judgment. It's a label of a document for a query as to whether or not that research result is relevant. So if I search for "shoe" and I get a coat back, that's not relevant. If I search for "shoe" and I get sneakers back, that is relevant.
But the devil really is in the details. In ML-powered search, we're usually thinking about gathering some kind of click data that we can use to determine whether or not the user considered this relevant as they were searching. Now, it seems like it should be straightforward. If the user clicks on this result, it's obviously relevant. And if they don't click on it, it's obviously not relevant. But it's actually a lot more complicated than that.
There are so many biases in how users interact with search results. For example, they can only click on what they're shown. If a type of shoe is never shown to you, you'll never click on it. So, your system will never get a chance to recognize whether or not that result is relevant. And then you will never train your models to recognize those as relevant for those queries.
Or consider how a product looks next to another product when it shows up in the search results. That can really have an impact. It's a problem related to something that we call diversity in search systems — how you give products a chance to be contrasted with each other. An ML-powered search system will learn from these interactions to give more search results.
It doesn't make sense if you don't have a ton of traffic, because you need the clicks to learn from them. In that case, you just sit down with an expert and go query by query and actually try to label things as relevant or not. The legal domain, like LexisNexis, is a classic example. Or medicine search systems that target doctors is another. You're not gonna have a million users. There are some applications that might actually have dozens of users — like for patent examiners.
One of the things that we built at my last company was something called Quepid. It's a system where you sit side by side with someone and you rate results together. So, I search for "myocardial infarction," or some other technical medical term. We make sure we know what's relevant for that and what's not. And then we can tune that up a little bit together — not necessarily using machine learning. So, it's those kinds of cases that are almost a different game than these higher-scale consumer-facing situations.
So there are a lot of new systems like Pinecone and Vespa that have this new kind of search built-in. My book discusses performing relevance on traditional search systems. I suppose maybe one day I'll do a new edition on both!
With traditional search systems, it's about how you want to do search. You build an index for each word by tracking which documents it occurs in. For example, let's say 'cat' occurs in documents one, five, and seven. This data structure's really efficient at saying that if someone searches for 'cat,' I'm gonna go to that word and I'm gonna get all the documents that 'cat' occurs in. And then you do all of the relevance tuning and stuff to create this index to map to how people search with keywords. Maybe you try to massage that data a little bit with things like synonyms, etc. So cat also means 'kitty' and documents one, five, and seven are also relevant for kitty.
What's interesting is that language follows a Zipfian probability distribution. There are words like 'the' that occur in every document. As you go a little bit down to the next most-common word, you likely get 'of' which occurs almost everywhere. Then, you might imagine that you're gonna get to things like 'hat,' which maybe occurs in 5% of your documents, but gradually you get to a very long tail of rare words that almost never occur. Farther along, words become very rare. That's why traditional search systems are called sparse vector systems. Because out of this, out of these millions of documents, the word 'hat' is gonna occur in three of them.
The new cutting-edge vector databases that are coming out don't necessarily care about indexing words themselves. Instead, imagine that we took all of the 'cat' documents and we look at statistical patterns to see what context the word 'cat' appears in: things about pets or things about animals. With this information, we can define a cluster. In a vector, which is just a list of numbers, element one is gonna say that if it's high, it has to do with pets. If it's low, it has nothing to do with pets. Then you might say, okay, well, dimension two, we cluster some things together. And we notice that maybe this has to do with shoes. So, more in dimension two is shoes and less is not shoes. Then, maybe there's some weird thing talking about 'animal shoes' or 'animal footwear,' which is high in both of those dimensions (one and two).
More and more, transformer models like BERT try to cluster language together based on their meaning, rather than the direct text. You get what is called dense vectors. The vector only contains a couple hundred categories like 'pets' or 'footwear' instead of all possible words (as in sparse vectors). Of course, I describe them as categories, but they're not really categories as we would think of them. They're more like clusters that seem to go together or don't go together. To look up data in this dense vector world, you need a different kind of system to say, "I have this document that's high in the animal space and in the shoe space and in the other 200 or so spaces, so I need to find other things that are also similar to that."
Basically, vector search casts a wider net by getting these fuzzier semantic relationships closer together. So if I ask a question about cat shoes, I might get something back about doggy socks. But it's fuzzier, it's much fuzzier, and much more semantic. That new system's really gonna thrive in that use case, but where the traditional systems still thrive and where they still have value, is when the user needs a specific thing. What you're seeing more and more of is people figuring out the systems and techniques that blend these systems. There's one called reciprocal rank fusion that takes the search results of both, and zippers them together with a specific algorithm that tries to get the most of both worlds. Because it turns out that they're both really valuable and complement each other extremely well.
There are challenges around presentation bias — people only click on what's shown to them. I wrote a blog article about it. But maybe we could get better or different results that might be preferable to the user, so how do we explore that?
Take shoes. If we're always showing people sneakers, we're never showing them dress shoes. How are we gonna bleed in some things that are borderline relevant, but not completely out of left field? And I think this is an area and a challenge that every team faces. There are techniques in the field of general machine learning that I think are really gonna help us. One is what's called active learning.
We're going to talk about active learning in my course. So let's say you have a new feature like some attribute of the query or the product. Say it's dress shoes or dressiness or something. And we notice in our training data that this is a gap. Everything is low on dressiness cuz we're only showing people sneakers. Could we strategically keep everything the same, except explore just this one dimension and know that we're taking a very calculated risk?
When we show users results in this active learning space, what it means is that the models or the systems participate in their own learning. So they basically notice the gap and select it as an exploration area so they can get some good distribution of knowledge for their training data. It's sort of like putting the training data back into itself. Not only are the systems constantly learning what should be ranked highly based on what's clicked; they're also learning another kind of model, which is like, knowing what to show users that maximizes their knowledge gain without trading too much relevance.
What is well known and established feeds into another next level of maturity, which is reinforcement learning. So reinforcement learning is like training. Now the machine learning algorithm is training your system. It's like if it does well, it gets a little treat (clicks). Then it's incentivized to go after that a bit more. If it doesn't do well, it doesn't get the treat and it's gonna try something else. And so you're constantly, in real-time, feeding search systems back on themselves to explore and exploit.
We want to give a big thanks to Doug Turnbull for taking the time to chat with us. If you're interested in learning more, you should definitely consider enrolling in Doug's upcoming course on ML-Powered Search. The course is fully accredited, which means most learners have tuition costs covered by their organization's L&D budget.
Click here to secure your seat today!