Use-cases/Project
26 min read

What drives your customer’s decisions? Find answers with Machine Learning Models! H&M’s Kaggle competition

Introduction

We recently took part in the Kaggle H&M Personalized Fashion Recommendations competition where we were challenged to build a recommendation engine that would predict which articles a customer would buy in a particular week starting on September 23, 2020. 

“Dataism points out that exactly the same mathematical laws apply to both biochemical and electronic algorithms. [...] You may not agree with the idea that organisms are algorithms but you should know that this is current scientific dogma.”

Homo deus by Yuval Noah Harari, chapter 11: The Data Religion

As stated in the quote above, I will present humans' decision making process as if it was an algorithm which is processing multidimensional input information and generates outputs in a form of decisions. I hope such an observation attracts your attention rather than scares you…

Also, I will take you on a tour to a data scientist’s mind and will show you how to represent decision-making process with numbers and how to match machine learning algorithms to real-life concepts.

Recommendations everywhere <3  

Recommendations have become a part of our daily lives and it’s hard to imagine how the world functioned before that era. Personally, I LOVE Spotify, creating personalized playlists based on what I like - I am an explorer and I cater for new experiences - new songs that match my multi-genre music taste. But discovery takes time - this is why I need a finite number of songs that I can listen to every week and add those to my favorites which I like the most. Similarly to Netflix - just based on a short history of what you watched and liked, you can see a list of tv series or movies you will most probably enjoy as well. How much time does that save you?

So apparently H&M also wants to save your time when you are looking for new clothes. When I go to a store I need to search through tens or hundreds of items of clothing I would never buy, just to find the ones that match my current style. I really like the idea of having personalized clothing recommendations and choosing one or two out of 10 that were recommended.

What makes you buy that jacket?

Before solving a problem with machine learning models, you really need to understand it. As a data scientist, I specialize in solving business problems involving people’s decisions, experiences, needs and feelings. At the beginning I always try to put myself in customer's shoes, imagining their decision-making process. Then, I translate it into a numeric representation using feature engineering and model that process with machine learning algorithms (try to imitate reality with numbers).

jacket woman

When I think about buying clothes in September, the first thing that comes to my mind is the weather change. Let’s travel back in time and place ourselves in New York in September 2020, which was the crucial time of the competition. To make it easier, let’s create a persona - Emily, who is in her 20s and goes to college.

weather

Summer has finished, autumn is about to make an appearance and the leaves are starting to change colour. It is mostly sunny with some cloudy days, still rather hot during the day (74°F ≈ 23°C), but cold at night (51°F ≈ 10°C).

“I am going out with friends for a drink on Friday and will be coming back late. It will be cold then, so I need a new jacket.”

Here we have captured the first moment when a NEED for a purchase has occurred. This is one thing that we will try to model - predicting what a customer needs at a particular moment.

What she can immediately do is open the H&M online store and look for jackets she likes. She scrolls through the website and realizes that she would prefer a coat, which she will need for the whole of autumn.

shop-screen

This is another instance when a need was realized in Emily’s brain. She now knows she wants a coat and she will start looking for the one she likes the most.

shop2-screen

Here I would like to distinguish another component that a final decision is influenced by - the customer’s style - what he or she likes, what would match their personality. 

Finally, she goes through all the articles and concludes she wants to feel more elegant this season. She really likes the navy-coloured one, but it costs almost $200 and she can't afford it. So her second choice is the classic, cream coat which costs only $69.99.

kaggle-h-m-competitions-machine-learning-models

This is the third dimension that we can look at - the money. 

When she gets familiarized with the H&M offer, she decides to go to a local store. She does not like buying clothes online without trying them on. She searches for nearest H&M stores using Google Maps and she finds out that there is a store close to Central Park, where she will be meeting her friends in the evening.

map-hm

On her way, it started raining. She had forgotten to check the weather forecast. So now she also needs an umbrella! 

When she is in the store, on her way to the coats section, she notices black pants that would match her look. She grabs them on her way and takes them to the fitting room, together with the coat she has finally found. 

She likes both things and decides to buy them. At the checkout, she grabs a classic black umbrella, which is currently at a reduced price. She does not pay much attention to the umbrella's style so she takes the cheapest one.

When I think about it, she was lucky, because the size of clothes she wanted was available at the store. It’s worth mentioning here that there are also external factors that influence our decisions and one of these is a particular store’s stock at a particular moment in time. If the coat was not available, she would need to:

  • choose a different coat
  • go to another H&M store to find the one she wanted
  • drop the idea of buying the coat at that moment
  • look for a different coat in a competitor's stores

The decision would depend on various factors:

  • how much she liked the particular coat (“it was nice” vs. “I love it and must have it”)
  • how much she needed a coat immediately 
  • what alternatives were available at that particular moment
  • how far it was to the other H&M stores
  • if the coat was available at all anywhere in New York

In hindsight, the crucial factor in selecting THE coat resulted from visiting the H&M online store. Imagine that she had not done that and she went straight to the store. There is a chance she would have missed it if she did not notice it when passing it by. Also, clothes can be presented in an appealing way on pictures, which also influences single decisions. To conclude, a customer’s decisions also depend on the ways they do their shopping:

  • if they look for clothes using an online store, create a shortlist and decide to try things on at a local store
  • if they just order things online, try them on at home and return them if they don’t like them
  • if they prefer the 'old-fashioned' way and only select clothes at their local store.

More factors that influence a customer’s decision that I can think of are:

  • advertising and marketing - passing by a poster where a young, beautiful, fashionable, powerful and undeniably happy model is advertising a leather jacket can make you want it
  • influencers -imagine a few of the most popular, young actresses in your country that started wearing and promoting pink hats - if there were also journalists who would comment on these looks saying they look really fashionable - it would definitely impact the customer’s need of having such a hat
  • occasions - Christmas, Thanksgiving, Easter or All Saints’ Day, birthdays, weddings and the such like, when you meet your families and there is a special atmosphere which creates the need of buying new, nice, elegant clothes.

To sum up the section, I created a mind map which organizes the ideas about all of these factors in a much clearer way.

decision-scheme

Data

Before we get into the conceptualization part, I will generally describe data sources that were provided in the competition. Familiarizing yourself with the data is a necessary step when building Machine Learning Models, because:

  • you become more aware of the business problem
  • possible solutions that can be implemented appear in the process of analyzing datasets
  • you start realizing the constraints and immediately drop some of the ideas due to missing data/insufficient quality

Customers

The customer table e.g. contains information about the customer's age, club membership status or newsletter frequency. There are over 1.3 million customers that we need to provide recommendations for**.**

table1

Unfortunately, postal code information is hashed so we will not be able to calculate any features based on location, proximity to stores etc.

Transactions

Transactional data is the representation of a customer's interests, needs and style. There are 31 788 324 transactions in the table. The schema is as follows:

table2

Article hierarchy

One of the datasets that I spent a lot of time analyzing was, how I called it, article hierarchy. All of the articles have been described using multiple dimensions - we can see which product group, department, section name or garment group the article has been assigned to. Also, there is information about perceived color name and graphical appearance - whether the T-shirt appears to be treated or has an all-over pattern.

table3
Example data on 4 randomly selected articles.

Article description

H&M also provides detailed descriptions of articles. Sometimes, just based on the text you can somehow imagine how the product looks like, what materials it is made of, what style it represents. 

Embeddings derived from the text can give a model more information about the product and features built on them can tell us more about what a customer's style is. Thus, text embeddings could contribute to an ML model performance because they can identify deeper, nuanced features of clothes. 

We also believe that the text description features should work also then, when a customer has not read them, because they describe features of clothing that the customer sees when screening the store during their shopping.

table4
Example description of previously selected 4 articles


As you can see, these tell us a lot about the product. However, please notice that there were also some examples with more generic descriptions.

It is worth adding that the description is present in the product’s site in the online store. Also, the description is the same for different color variants of the same product.

jeans

Article images

What was really cool about the competition was that the dataset consisted of data of different types, including images, which gave the contestants a much wider range of methods they could use to make better-quality recommendations. 

The images were vertical and rather standardized - the majority of them looked like the examples below. They were centered, with a light background and similar frame ratio.

clothes
Example images of 4 previously selected articles.

Conceptualization

Having the Emily’s story in mind, I will create a list of the decision’s components that will be included in the final model concept.

ML Model: General need for an intem of clothing part

Expectations

First and foremost, the model should include features representing a customer’s need for a purchase of a particular clothing type such as “a pair of jeans”, “a raincoat”, “a smartshirt”, etc. This component should be estimated for a particular time period, e.g. “during the next week”. So, I want the model to return a score, or ideally a probability estimate, e.g. 

in the period between 10th and 17th June 2020 there is a 10% chance that the customer will buy a cotton T-shirt and 0.5% chance they will buy a warm jacket

*the chances are not 0%, because some may travel to a place where it’s cold or they need it for a theater classes at school, for example

What is the probability that a customer will buy a t-shirt or a warm jacket in the week starting 10th June 2020?

screenshot 2022 06 21 at 14 26 11

Solution / Machine Learning Models

Propensity-to-buy models on type-of-clothing level

The first group of Machine Learning models that will be used to predict a customer’s need to buy a product in a particular time frame would be classic propensity-to-buy models. They should not try to predict article_id, but something more general e.g. product_type + index_group_name. Let’s see the top 25 products that we would be predicting.

top-products


It seems that such a grouping matches the general way of thinking about customer’s needs when it comes to clothing.

Now let’s choose Machine Learning models which will solve this subproblem.

table5

Time-Series based seasonality of type-of-clothing sales

This was the way to predict a customer's need for a product purchase on an individual level. However, there are more general trends when it comes to person's purchasing patterns throughout the year. This is where a time-series seasonality forecasting model would come in handy. Let’s see monthly shares of sales for a selection of 4 products: sunglasses, dress, cardigan and scarf. 

general-trends

It is quite obvious that not many people would be interested in buying a cardigan when it's warm outside. Its’ sales peak in September, when autumn starts. So, such a model for each product type and for each week of the year would definitely help with making better product recommendations.

table6

ML Model: Customer’s style

Expectations

Secondly, we need a model whose results will tell us that a customer prefers one t-shirt over another one. Ideally, it should sort all articles within a product category from the ones that a customer finds the most matching their style to the least.

clothes

Solution / Machine Learning Models

Image embeddings

Firstly, we plan to generate more features describing clothes. There are statistical techniques that can translate images into a matrix, where each vector will represent a latent (hidden, not understandable at first sight) feature of the object. 

Take the leopard-print for example. An algorithm could learn to detect such a pattern and create a separate feature for it. Values of the vector would range from 0 to 1, where 0 would mean no leopard-print and 1 would mean the leopard-print on the whole t-shirt. 

Then, having such embeddings generated for all clothes, we can use them in classical machine learning algorithms, e.g. we could see “which values of embeddings” the customer buys. Imagine a lady who has recently been buying only leopard print clothes. She would have really high scores in the vector representing the print in her past purchases. Thus, the model would learn that and would start recommending similar products.

similar-products


To achieve such a result, we will use a convolutional autoencoder. If you have never heard about this, there will be a separate blog post about it soon. Shortly, it is a neural network that is trained to reproduce its input image in the output layer. 

The output of the algorithm will be structured like in the example below.

table7

Text embeddings

Similarly to image embeddings, detailed description of articles can give the model another dimension to look at. Take this description for example:

image3

Calf-length dress in crisp cotton poplin. Narrow, detachable, adjustable shoulder straps, wide smocking over bust, and softly draped A-line skirt. Unlined. Made from organic cotton, this dress is part of our hand-painted wildflowers collection. The pattern was developed by our print designers Kavita, Abigail, Holly, and Florentin, who picked their favorite wildflowers and recreated them in watercolor.


The description certainly adds new information that is not available anywhere else. Just based on the image, you could not have figured out that the dress is hand-painted.

However, in order to use this in Machine Learning models, we need to transform it into embeddings. For that we will use a pre-trained transformer which we will apply to tokenized, stemmed and lemmatized article descriptions. 

The output of the algorithm will be structured as like in the example below.

table8

Recommendation engines

When you think about recommendations, content-based and collaborative filtering models come to your mind, because these are the two most popular ML algorithms in this field. 

Content-based filtering

According to the definition found here

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.

So just like in the leopard-print  example above, if a customer likes the pattern, let’s recommend her other products with the same pattern. It’s quite a simple solution, but may work in some cases. It will help to determine which products share a similar style. Then, we will be able to recommend the ones with the highest similarity to what the user bought before.

Collaborative filtering

According to the definition found here

collaborative filtering uses similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B. Furthermore, the embeddings can be learned automatically, without relying on the hand-engineering of features.

As you can see, this algorithm adds another level to the content-based filtering as it can recommend clothes that other customers purchased together, if you share similar products in your historical transactions.

ML Model: Available stock / Lifetime of an article

As there are 105 542 unique article_ids and we have to predict just 12 articles for each customer, we figured out that probably not all articles are offered at every point in time. In order to get a better understanding on what an article_id lifetime looks like, I visualized weekly sales of a sample of articles, together with the line representing a cumulative share of total sales of the article. 

weekly-sales

weekly-sales-2



You can see that a few patterns can be distinguished that can help us determine the probability that the article would be offered in a given week.

Expectations

Estimated volume of sales per article_id per week (in total or per store).

Here you can see that there can be some overlap with the Time-Series model proposed before, but please note that the model before was based on a more general product type level, not on an article_id level. The “stock” model can rather be used for short-time predictions and not for seasonality calculations. 

Solution / Machine Learning Models

table9

ML Model: The customer's final decision

When we already have a list of all the factors that influence a customer's decision, we are able to create a final-decision model which will be an ensemble of all the solutions.

Final concept

The final concept of the solution that we came up with is represented in the schema below.

schema

We start with data sources and generate features using multiple techniques, including embeddings generation with neural networks. All features are then stored in a Feature Store (see Feature Stores comparison blog post written by our colleague Jakub Jurczak). When all features are collected, multiple Machine Learning models can be estimated. 

The expected input to a final Ensemble Model would be a table with an index of: customer_id, week_id, article_id. Columns would represent results from various ML models and possibly some other descriptive statistics and features concerning the customer or article itself.

table

With such a table form with scores for each customer and each article, we have more than enough information to generate great recommendations. Adding a properly designed model validation should create a powerful tool.

Machine Learning Model

To generate final scores with the model ensemble, we planned to use AutoML. However, it was not trivial… but more about it will be described in the next blog posts.

The table below presents a general concept of the final model.

table10

Kaggle’s Machine Learning Models Summary

This is the end of the conceptualization, the data science project part I like the most. 

I hope you found the article interesting and after reading it you now have a different perspective on how you make decisions and how they can be quantified in a multidimensional space. I believe that it also gave you a sneak peek into a Data Scientist’s business problem solving process, where we need to match the worlds of business, psychology, emotions, marketing, math, statistics and machine learning. You definitely cannot say the job is not interesting :)

Stay tuned!

If you are interested more in technicals, we will be publishing more detailed blog posts soon!  In order not to miss the publication, subscribe to our newsletter. The articles will be describing selected Machine Learning solutions that we have used in solving the competition, e.g. How to create embeddings from images and text, What project management techniques are best for data science projects, How to generate thousands of features using feature tools?. And don’t worry if you think they can only be applied to apparel or e-commerce problems! They can be used to solve many more kinds of business problems than you think.

Interested in ML and MLOps solutions? How to improve ML processes and scale project deliverability? Watch our MLOps demo and sign up for a free consultation.

big data
analytics
machine learning
machine learning models
Machine Learning problems
Machine Learning Model
Recommendation Engine
21 June 2022

Want more? Check our articles

0LThQo4TotB93NHz6
Use-cases/Project

Streaming analytics better than classic batch — when and why?

While a lot of problems can be solved in batch, the stream processing approach can give you even more benefits. Today, we’ll discuss a real-world…

Read more
cloudwall getindata streaming
Success Stories

Streaming Analytics for the Digital Asset Risk Management System - Cloudwall Success Story

How to minimize data processing latency to hundreds of milliseconds when dealing with 100 mln messages per hour? How can data quality be secure and…

Read more
bloggcpobszar roboczy 1 4
Tutorial

Data isolation in tenant architecture on the Google Cloud Platform (GCP)

Multi-tenant architecture, also known as multi-tenancy, is a software architecture in which a single instance of software runs on a server and serves…

Read more
semi supervised learning real timeobszar roboczy 1 4
Tutorial

Semi-supervised learning on real-time data streams

Acquiring unlabeled data is inherent to many machine learning applications. There are cases when we do not know the result of the action provided by…

Read more
real time reporting cover getindata
Tutorial

Real-Time Customer-Facing Reporting - Why Showing Users Data Sooner Rather than Later is Better

In today's fast-paced business environment, companies are increasingly turning to real-time data to gain a competitive edge. One of the examples are…

Read more
propozycja2
Tutorial

Deploying efficient Kedro pipelines on GCP Composer / Airflow with node grouping & MLflow

Airflow is a commonly used orchestrator that helps you schedule, run and monitor all kinds of workflows. Thanks to Python, it offers lots of freedom…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy