The Science Behind Product Similarity

Authored by Alicia Perez and Javier Ordonez
Last week we introduced you to our newest feature, Product Similarity and how it can be a merchandising game-changer. Today we’re talking more about the science behind it. Our team of Data Scientists, Javier Ordonez and Alicia Perez, are taking over today’s blog to show you a little of their magic from behind the curtain.
Product similarity is a problem which is very easy to describe and understand, but quite challenging to solve. Similarity can be defined as a measure which indicates how much alike two products are. However, to find what the most similar products to a certain query product are, first we have to define what it means for two products to be similar. The level of similarity between two objects is not an objective measure and highly depends upon the domain and problem to solve. For instance, in a newspaper, similar articles may be those which share the same tags, or those which have been read by the same users. From an e-commerce perspective, recommendation engines traditionally make use of user profiling approaches, defining two products as similar when both have been purchased by customers with a similar profile. In the fashion domain, the perceived similarity between clothing items can even vary from person to person, and while there is not a universal rule which defines when two dresses look alike, features such as color, material, and design definitely play an important role here.
From a data science perspective, the similarity between two products is the distance in the euclidean space whose dimensions represent the features of the products. If such distance is small the similarity between both products is high, while a large distance will indicate a low degree of similarity. The main problem here is how to transform the information we have about a product (color, material, etc) into a vector representation suitable for computing accurate similarity distances.
Image Source: Learning Visual Clothing Style with Heterogeneous Dyadic Co-Occurrences
However, there is one underlying piece of technology which has become a game changer and is the core of current product similarity engines: deep learning. In technical terms, deep learning algorithms make use of many layers of information processing to automatically transform unstructured raw data into excellent abstract data features. In fact, deep learning has already provided state of the art results in problems dealing with unstructured data, and it is the de-facto solution in the industry for dealing with text and images.
Deep learning algorithms are able to automatically learn what is the best vector representation of a product to compute its similarity distance. But optimizing these algorithms is not straightforward. In order to learn such desired transformation function, deep learning algorithms require high quality data, since they learn by the example. That means that in order to learn what is the similarity level between two dresses is, we have to provide examples of similar dresses, and the more the merrier. These data is what we call the “training data” for similarity, and it is required for creating and monitoring the quality of the vector transformation function.
When it comes to good similarity training data, most of the fashion datasets that are publicly available happen to be only good enough for clustering purposes, meaning that they can be used to create a system that is merely capable of grouping together, let’s say, dresses with dresses and shoes with shoes. But in order to create a high quality fashion similarity system we have to address a learning to rank problems. We need a system capable of, not only detecting products from the same category, but also ranking them based on their perceived similarity. At StyleSage, our fashion gurus have created proprietary similarity training data, with the granularity required to solve for this particular problem. We have collected, processed, and curated terabytes of fashion data over the years, and our experts have labelled millions of products to define what is a sensible and realistic measure of similarity in the fashion domain.
So with fashion domain expertise and the power of deep learning, what do you get? Simply stated, best-in-class results when you query for similar products inside our platform. And not only that, but you also have the flexibility to see narrow or broaden your search so that you see, precisely, what’s in your shoppers’ consideration sets. Learn more about Product Similarity here.