Building A RFM Segmentation With Python & K Means Clustering

Zhijing Eu
Dev Genius
Published in
20 min readFeb 15, 2024

--

This article provides an overview on Customer Segmentation and a Python code-along guide of how to calculate the Recency, Frequency & Monetary Value (RFM) metrics for a publicly available set of transactional purchase records and customer details from a Brazilian e-commerce site. This RFM data was then used to develop a rule based customer segmentation scheme. Finally the article explores a machine learning method called K Means Clustering to automate customer segmentation and see if it provides any deeper insights compared to the rule based approach.

The saying that ‘one size does not fit all’ is equally true for what customers look for in a product or a service.

Image Source : Dall-E3 Prompt “Cartoon Image Illustrating Why One Size Does Not Fit All”

Way back in 2021 , I built a customer growth and churn analysis web app for a start up consultancy as part of a freelance side project.

As a result , I did quite a lot of research into marketing and growth hacking analytics which resulted in a planned series of articles on topics like lead generation , customer segmentation , customer lifetime value and churn analysis that I never got around to publishing. Over the recent holiday period (Chinese New Year 2024) , I finally took the time to update and refresh the drafts which I will be publishing as a mini series.

Table Of Contents

Follow along on the accompanying GitHub Repo OR the Kaggle Notebook

1. What Is Customer Segmentation ?

Any organization looking to build strong customer relationships needs to understand their customers better in order to develop targeted engagement strategies so that customers continue to buy in higher volumes or higher value items at a higher frequency.

Customer Segmentation is the process of grouping an organization's customer base using characteristics such as demographical info (gender, age, income, etc.) , directly engaging the customers thru surveys (or their product reviews) but also via historical records of on customer purchases of how they’ve actually behaved in the past.

2.What Is An RFM Analysis ?

Image Source : What is RFM analysis? | RFMformarketing

RFM is a marketing analysis tool that can help segment customers based on their purchasing patterns or habits via three dimensions :

  • Recency — how long ago the customer last made a purchase
  • Frequency — how often customers make purchases since their first transaction
  • Monetary Value — how much money customers have spent on their purchases

As an aside — although we will focus on running the RFM analysis from a Sales & Marketing perspective, RFM can also be applied on different use cases such as Sports Audience Segmentation or Online Gaming Platforms User Analysis where instead of Monetary Value the measure is total time spent on the platform, Frequency is the number of logins and Recency remains the same — i.e last login time/date.

3.Introducing The Brazilian e-Commerce Dataset

We will be using an e-commerce public dataset provided by Olist a Brazilian e-commerce company which is a real-life (but anonymized) dataset covering over a hundred thousand customer orders from the period of Sep 2016 to Sep 2018 with information about customers, order quantities, product prices, payment modes and even product reviews by customers.

Normally, the first thing you should do as an analyst is to perform an Exploratory Data Analysis before diving into the data itself and trying out all the fancy machine learning techniques.

Luckily for us , this is a relatively well analyzed data set so we can build on the work of others who already went thru the trouble of doing that:

The two articles above are pretty comprehensive and go into some depth on the entire data-set but a few observations that are relevant for understanding the customer base better:

  • There were over ~96k customers from 4,119 cities
  • Most customers are from Sao Paulo (~16% of total customers)
  • Out of the ~96k customers only ~3k customers made repeat purchases
  • These 3k repeat customers account for 6.5% of the total value of purchases of $20.3m Brazilian Real

4. Running The RFM Analysis In Python

Image Source : Python (mrnet.tech)

4.1 Building The Basic RFM Table

We begin by importing the key libraries and creating the dataframes for the data required

import pandas as pd
import numpy as np

customers = pd.read_csv("olist_customers_dataset.csv")
orders = pd.read_csv("olist_orders_dataset.csv")
orderitems = pd.read_csv("olist_order_items_dataset.csv")
orderpay=pd.read_csv("olist_order_payments_dataset.csv")

And then we’ll merge all these separate dataframes into a single flat table

data = customers.copy()
data = pd.merge(data, orders, on='customer_id')
data = pd.merge(data, orderitems, on='order_id')
data = pd.merge(data, orderpay, on='order_id')
data = pd.merge(data, products, on='product_id')

Looking at the raw unprocessed data yields the following.

Let’s do some very basic data prep to remove any rows where there are blanks or NaN values

data.dropna(inplace=True)
data.describe()

As we can see this drops out about ~4k rows

#convert time to datetime format
data['order_purchase_timestamp']=pd.to_datetime(data['order_purchase_timestamp'], format= '%Y/%m/%d %H:%M:%S')

min_date=min(data['order_purchase_timestamp'])
max_date = max(data['order_purchase_timestamp'])
display("Max Date: "+str(max_date))
display("Min Date: "+str(min_date))
display("Total Dur Days: " +str((max_date-min_date).days))
display("Total Dur Weeks: " +str(round((max_date-min_date).days/7,1)))
display("Total Dur Mths: " +str(round((max_date-min_date).days/30.417,1)))

Based on this , there were 113,391 transactions made by 92,082 unique customers over the period of Oct 2016 to Aug 2018.

Next we group the data to understand the Recency of the purchases by the unique customers (i.e What was the most recent date they placed an order based on ‘Order Approved At’ date against the final date in the data set i.e 29 Aug 2018 )

#group items by unique customers and find the latest (most recent) order date
recency_df = data.groupby(by = 'customer_unique_id', as_index = False)['order_purchase_timestamp'].max()

recency_df["recency"]=max_date-recency_df['order_purchase_timestamp']

#rounding so it is to the nearest number of days
recency_df["recency"]=recency_df["recency"].dt.round("d")

recency_df.rename(columns = {'order_purchase_timestamp':'Most Recent Order Date', "recency":"Recency In Days"}, inplace = True)

#Converting this into a float figure or else the Histogram will not work with time delta data types
recency_df['Recency In Days'] = recency_df['Recency In Days'] / pd.to_timedelta(1, unit='D')

recency_df['Recency In Days'].hist()
recency_df['Recency In Days'].describe()

Next we do the same for Frequency

frequency_df = data.groupby(["customer_unique_id"]).agg({"order_id": "nunique"}).reset_index()
frequency_df.sort_values('order_id', ascending=[False])
frequency_df.rename(columns = {'order_id':'Frequency'}, inplace = True)
frequency_df['Frequency'].hist()
frequency_df['Frequency'].describe()

From the graph above it is clear that the majority of customers only bought once. Taking a closer look by filtering out the frequency = 1 customers yields the below view

frequency_df['Frequency'].value_counts()

Finally, we calculate the total value of all purchases made by each unique customer

monetary_df = data.groupby('customer_unique_id', as_index = False)['payment_value'].sum()
monetary_df.rename(columns = {'payment_value':'Monetary Value'}, inplace = True)
monetary_df.describe()
monetary_df.hist()

Clearly this also has a very skewed distribution so let’s see if we can dig a bit deeper into the spread of the Monetary Value metric. First there is a bizarre quirk where there is a “super customer” who spent $109k worth of purchases. So if we try to find the 99th percentile , we can see that it only goes up to $1.7k so let’s split the histogram into the 1–99th Percentile to get another look at the shape of the distribution

Now that we have all 3 elements R, F and M, we can merge the dataframes together like so:

RFM = frequency_df.copy()
RFM = pd.merge(RFM, recency_df, on='customer_unique_id')
RFM = pd.merge(RFM, monetary_df, on='customer_unique_id')

#Rearranging the columns for readability
RFM=RFM[['customer_unique_id','Recency In Days','Most Recent Order Date','Frequency',"Monetary Value"]]

print(RFM.describe())
RFM.sort_values('Frequency',ascending=[False])
RFM

4.2 Adding A Few Additional Features To The RFM Data

Image Source : Dreamstime.com

Now the challenge with the RFM data alone is it is not particularly intuitive to reason thru so let’s add a few other measures such as

  • Customer “Age In Days” based on when they first made their first order
earliest_order_date=pd.DataFrame(data.groupby(['customer_unique_id'])['order_purchase_timestamp'].min())
earliest_order_date.rename(columns = {'order_purchase_timestamp':'Earliest Order Date'}, inplace = True)
earliest_order_date["Age"]=(max_date-earliest_order_date["Earliest Order Date"]).dt.round("d")
earliest_order_date['Age In Days'] = earliest_order_date['Age'] / pd.to_timedelta(1, unit='D')
earliest_order_date.drop(['Age'], axis=1, inplace=True)
print(earliest_order_date.describe())
earliest_order_date["Age In Days"].hist()
  • Average Time Between Orders In Days based on the “Age In Days” and Frequency
  • Average Monetary Value Per Order based on (total) Monetary Value and Frequency
#Merge the "Age" data with the RFM data
RFM = pd.merge(RFM, earliest_order_date, on='customer_unique_id')

#Calculate the Average Time Between Orders and Average Monetary Value Per Order
RFM["Average Time Between Orders In Days"]=RFM['Age In Days']/RFM['Frequency']

#Calculate the Average Monetary Value Per Order
RFM["Average Monetary Value Per Order"]=RFM['Monetary Value']/RFM['Frequency']

RFM.sort_values('Frequency',ascending=[False])
RFM + Age + Ave Time Between Orders + Ave Monetary Value Per Order

4.3 Segmenting The Customer Base Thru A Rule Based Approach

Now that we have some RFM data, the trick is to develop some way to group and segment the various customers. Given that the spread of the RFM metrics (especially the F and M scale have a very long tail — ie they are not normal) , we’ll use a simple 1–4 rating based on the percentile of the value where the score of 4 is the “best” (i.e Lowest Recency, Highest Frequency, and Highest Monetary Value)

RFM['R Quartile Score'] = pd.qcut(RFM['Recency In Days'], 4, labels=False)

#Resetting the Quartiles so that customers with lower recency are Top Quartile i.e Score of 4
RFM['R Quartile Score'] = RFM['R Quartile Score'].replace([0,1,2,3], [4,3,2,1])

#Because there are so many Freq = 1 customers it causes a Bin Edges Unique ValueError. So use ranking first before qcut
RFM['F Quartile Score'] = pd.qcut(RFM['Frequency'].rank(method='first'), 4, labels=False)

#Resetting the Quartiles so that customers with higher frequency are in the Top Quartile i.e Score of 4
RFM['F Quartile Score'] = RFM['F Quartile Score'].replace([0,1,2,3], [1,2,3,4])

RFM['M Quartile Score'] = pd.qcut(RFM['Monetary Value'], 4, labels=False)

#Resetting the Quartiles so that customers with higher Monetary Value are in the Top Quartile i.e Score of 4
RFM['M Quartile Score'] = RFM['M Quartile Score'].replace([0,1,2,3], [1,2,3,4])

RFM['Age Quartile Score'] = pd.qcut(RFM['Age In Days'], 4, labels=False)

#Resetting the Quartiles so that customers with higher Age are in the Top Quartile i.e Score of 4
RFM['Age Quartile Score'] = RFM['Age Quartile Score'].replace([0,1,2,3], [1,2,3,4])
RFM + Age + Ave Time Between Orders + Ave Monetary Value Per Order + Quartile Scores

Now the next step is a bit of an “art” as it requires domain knowledge to frame a few personas to segment the customer base. For the purpose of illustration only let’s assume the following segments

  • Best Customers- Customers who buy often at high value and have bought recently (R Score , F Score, M Score = 4)
  • Promising New Customers- Customers who only just started buying but have spent large amounts so far (R Score=4, M Score=4, Age Score=1)
  • Loyal Return Customers- Customers who buy regularly even if the value isn’t very large (R Score = 1–3 , F Score =4 , M Score =1–3)
  • At Risk Customers — Customers who have not bought anything recently and at risk of leaving the platform that we would like to retain as they were large spenders (R Score =1 , M Score =3–4, Age Score =3–4)
  • Hibernating/Lost Customers — Customers who have not bought anything for a long time and have likely already left the platform (R Score=3–4, F Score = 1–3 , M Score = 1–4, Age Score=3–4)
  • Etc.

Labelling the data frame with this information is relatively straightforward as per below

def create_new_column(row):
if row['R Quartile Score'] == 4 and row['F Quartile Score'] == 4 and row['M Quartile Score'] == 4 :
return "Best Customer"
elif row['R Quartile Score'] < 4 and row['F Quartile Score'] == 4 and row['M Quartile Score'] < 4 :
return "Loyal Return Customers"
elif row['R Quartile Score'] == 4 and row['M Quartile Score'] == 4 and row['Age Quartile Score'] ==1 :
return "Promising New Customers"
elif row['R Quartile Score'] == 4 and row['M Quartile Score'] > 2 and (row['Age Quartile Score'] ==2 or row['Age Quartile Score'] ==3):
return "At Risk Customers"
elif row['R Quartile Score'] < 3 and row['M Quartile Score'] > 2 and (row['Age Quartile Score'] ==3 or row['Age Quartile Score'] ==4):
return "Lost Customers"
else:
return "Other"

RFM['Manual Customer Segmentation'] = RFM.apply(lambda x: create_new_column(x), axis=1)
With Customer Segmentation

(*Edit — Those of you with very sharp eyes will notice something odd happening where Frequency of 1 is still rated as 4 for F Quartile Score — the reason is the Frequency figures themselves are highly skewed in that 97% of the 92k customers ONLY BOUGHT ONCE i.e Frequency = 1. There is no easy fix for this other than hard-coding the logic for the segmentation)

And thereby creating a few customer segments as per below

Getting this just right is tricky as we need to ensure the “right” granularity in segments to allow for targeted strategies such as providing targeted offers for loyal return customers so they buy higher value items or offering incentives for promising new customers to keep buying more frequently.

4.4 Segmenting The Customer Base Through K-Means Clustering

Image Source : Robot arm color detection and sorting — YouTube

Now that we’ve done the segmentation manually , let’s see if machine learning can help us do this any better via a method called K Means Clustering.

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled dataset into different clusters based on how similar and different they are to one another using continuous numerical values (such as RFM)

As an intuition for how the algorithm works, imagine a simple scatter plot with a bunch of points. Each cluster of points would have a “mean position” called a centroid that is the combined average of all the points that form these clusters.

The K Means algorithm tries to find the “best” centroids for a user specified number of clusters by making an initial set of random guesses on the location of these centroids and then classifying each point by assigning it to its closest centroid. The algorithm then iteratively adjusts the position of these centroids to minimize the distance for each point against it’s assigned centroid for each cluster.

4.4.1 Data Pre-Processing — Scaling The RFM Features For K Means

In any machine learning project, after data exploration and data cleaning, the next critical step is to perform some form of feature scaling.

Scaling helps to ensure fair comparisons and avoid the dominance of certain features as datasets often have features with different (absolute) sizes , min-max ranges and measurement units.

Normally the recommended scaling method is the Standard Scaler in SciKit Learn library which standardizes a feature by subtracting the mean and then scaling to unit variance.

Scaling is performed thru the following code

from sklearn.preprocessing import StandardScaler

# Initialise the Scaler
scaler = StandardScaler()
# To scale data

RFM_Scaled = scaler.fit_transform(RFM[["Recency In Days","Frequency","Monetary Value"]])

(Note — For this article, we will use the Standard Scaler even though we are aware that the Monetary and Frequency variables are quite skewed. There are other options Min-Max Scaler which preserves the shape of the original distribution but this is left for any of you readers to post in the comments if you manage to get better results via other scaling methods)

4.4.2 Selecting The Optimal Number Of Clusters To Apply

The next question is how many segments is the most optimal to differentiate the characteristics within the customer base ? The simplest approach is something called the Elbow Method calculates the average distance of each point in a cluster to its centroid (i.e. the centre point of a cluster) and present it in a plot where the “bend” point in the elbow is the indication of the “best” number of clusters.

This can be implemented in Python as follows:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib.style as style

range_n_clusters = [1, 2, 3, 4, 5, 6, 7]
avg_distance=[]
for n in range_n_clusters:
model = KMeans(n_clusters=n,n_init=10, random_state=42).fit(RFM_Scaled)
avg_distance.append(model.inertia_)

plt.plot(range_n_clusters, avg_distance)
plt.xlabel("No of Clusters")
plt.ylabel("Distance")
plt.show()

The difficulty with the elbow method is some subjectivity in the interpretation but visually from the above graph the sharp change in gradient happens at N=3. There are more advanced ways to find the optimal number of clusters such as the Silhouette method that is outside the scope of this article (ref link below) but for sake of illustration , let’s assume 4 clusters as the “best” no of segments.

model = KMeans(n_clusters=3,n_init=10, random_state=42).fit(RFM_Scaled)
model.labels_
RFM["K Means Automated Customer Segmentation"]=model.labels_
RFM["K Means Automated Customer Segmentation"].value_counts()
Value Counts For Each Of The 3 Segments

(Edit — However as a counterpoint, there is also this really interesting paper that challenges the use of methods like Elbow or Silhouette entirely : Data scientists must be made wary of drawing conclusions from clustering results and must not rely solely on evaluation measures telling them what is best . Reviewers of scientific literature should probably even reject conclusions drawn from choosing the optimal k using such an unreliable method. In the long run, we must accept that there is no optimal solution in cluster analysis, but it is an explorative approach that may yield multiple interest ing solutions, and interestingness necessarily is a subjective decision of the user.)

4.5 Comparing K-Means Customer Segments vs Manual Rule Based Customer Segments

Image Source : Humans Vs. Machines chllocalization.com

Reviewing some of the characteristics of these segments appears to show commonality with the tagging shown earlier….

This Segment 0 Appears To Be Similar (ish) To The Promising New Customers In The Manual Segmentation
This Segment 1 Appears To Be Similar (ish) To Lost Customers In The Manual Segmentation
This Segment 1 Appears To Be Similar (ish) To At Risk Of Leaving Customers In The Manual Segmentation

It is also possible to visualise the data via simple plots such via the following code

XVariable="Frequency"
YVariable="Monetary Value"
ZVariable="Recency In Days"
x = RFM[XVariable]
y = RFM[YVariable]
z = RFM[ZVariable]

import numpy as np
from matplotlib import pyplot as plt


fig, ax = plt.subplots(figsize=(5, 5))
scatter = ax.scatter(x, y, c=RFM["K Means Automated Customer Segmentation"], s=z)

# produce a legend with the unique colors from the scatter
legend1 = ax.legend(*scatter.legend_elements(),
loc="lower right", title="Segments")
ax.add_artist(legend1)

# produce a legend with a cross-section of sizes from the scatter
handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)
legend2 = ax.legend(handles, labels, loc="upper right", title=ZVariable)

# Y Axis truncated for better readability...
ax.set_ylim([0, 20000])

plt.xlabel(XVariable)
plt.ylabel(YVariable)
RFM visualized With x=F , y=M, z=R

It is a bit difficult to see what exactly is going on here visually so let’s rearrange the variables.

RFM visualised With x=R, y=M, z=F

From the above plot it appears that there are two segments determined primarily by Recency with a third segment characterised by the higher Monetary Value / higher Frequency customers.

Clearly even with the automated K-Means segmentation, there is still a fair amount of human judgment and interpretation required to make sense of the resultant customer segments.

And there is also the point that even though the elbow method indicates that 3 segments is “optimal” , it may not be sufficiently granular to allow for a targeted marketing or engagement strategy.

A way around this problem could be to run all the steps above but on a subset of the data effectively filtering for each of the 3 segments to derive further “sub-segments” (This may be more effective than trying to re-run the K Means with larger no of clusters)

Other refinements could be to combine info about these K means assigned clusters with data on the customers such as identifying the best-selling products and if they differ across the clusters.

On that note, it is worth highlighting that even though K Means is easy to understand and apply, it has the limitations of only working with continuous data and clusters that are uniform and spherically shaped.

There are other clustering techniques such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) or GMM (Gaussian Mixture Models) that do not have the spherically shaped cluster limitation. Unfortunately applying these onto categorical data is also not straightforward.

However, one approach that is conceptually similar to K Means and that can be used on a mix of BOTH categorical and continuous variables is the K-Prototype Clustering algorithm.

We won’t get into the details here but the accompanying Jupyter Notebook includes a demonstration of this mixed method by extending the RFM data to include info on the customer’s state of residence (Spoiler alert — it runs but unfortunately doesn’t add much more to the analysis as the results are still dominated by the RFM data rather than the customer state data)

5. Conclusion

This article explained the concept of customer segmentation and focused on how to implement Recency, Frequency & Monetary Value analysis which is a particular type of marketing analysis.

This RFM data was derived from a publicly available set of transactional purchase records and customer details from a Brazilian e-commerce site and used to develop a (manual) rule based customer segmentation scheme which was then compared against a machine learning method called K Means Clustering.

While the analysis outlined in this article has been more for illustrative purposes rather than an actual comprehensive analysis, a few key points stand out:

  • Exploratory Data Analysis and Data Pre-Processing is CRITICAL to get a “feel” for the data and to eliminate poor quality data and scaling the data before feeding it into algorithms — Garbage In Garbage Out
  • K Means does simplify the process of generating customer segments but still requires a fair amount of interpretation to be useful.
  • Extending the approach with K Prototype which blends continuous data from RFM with demographic data may help improve the results.
  • However, determining the “optimal” no of clusters based on the Elbow method for both these “K-family” of methods may still fall short of the required granularity for the results to be practically applied.
  • Manual “rule-based” segmentation may be slightly more tedious but has the advantage of being easier to understand, reason through and ultimately communicate back to marketers.
  • At the end though, running an RFM analysis alone does not tell the full customer story and needs to be combined with domain knowledge and market context to be effectively applied back to business decisions

Coming Soon….

Now that we’ve learnt how to segment our customers, a natural next step is understanding how much value each customer can potentially generate so that we can balance the effort against the costs it takes to acquire new customers or grow/retain existing customers.

Therefore in the next article we will build on the RFM analysis developed here to analyze the Customer Lifetime Value (CLV) for the customer base

.

.

.

.

.

.

.

.

.

.

.

.

6. Bonus Content — Generative AI For Automated Clustering

For those of you who are interested, yes there is a way that you can get Generative AI to do this sort of analysis in an automated manner.

This article and video is a pretty nifty example of how Chat GPT combined with a plug-in called Notable can allow a user to directly run code within the ChatGPT app itself rather than switching between asking ChatGPT to create the code and the user having to cut-paste it into a separate Python Notebook

https://www.toolify.ai/ai-news/master-customer-segmentation-and-targeting-with-chatgpt-plugin-920453

Important Edit — Just realized that as of Dec 2023 the ChatGPT PlugIn Notable SHUT DOWN FOR GOOD :( Noteable.io is shutting down | Hacker News (ycombinator.com) — There are other similar tools but it is still a bit of a shame

However if you are a bit more advanced (which I assume you may be since you scrolled all the way to the end…) there are other options:-

You could run the analysis using other tools like LangChain and the OpenAI API where this article is not just about K Means but actually using ChatGPT’s Natural Language capability to assign title (ontologies) of the segment labels and keywords associated with each label (Which was one of the challenges with using K Means as is in that the “meaning” behind each cluster still required a human to interpret)

Or if you prefer OpenAI themselves might enable similar “run the code” within the same ChatGPT window — there is a Advanced Analytics Beta feature that came out in Q4 2023 for ChatGPT premium subscribers and the tutorial below is a walk thru of how to use this feature to run a simple exploratory data analysis on the World Bank’s carbon emissions dataset.

But wait, you might say —so why would I need to bother learning to do all this Machine Learning stuff if Generative AI can just do it for me ? Well the MIT Sloan article puts it best : “ It is important to continuously check the accuracy of the output you receive. While advanced, this technology does still make errors”. Therefore — for the moment — AI still needs a human who approximately understands the principles behind what it is doing and a good piece of advice from the same article is to “Ask for explanations or descriptions from the software while working is a great way to force it to check its own work

Judgment Day Averted — Yes GenAI Is Great And All But Not Today , Robots

--

--