Home All Chapters Previous Next

Chapter 12. Clustering, Segmentation and Recommendation

Clustering is one of the most powerful unsupervised learning techniques in business analytics. Unlike supervised learning, where we predict known outcomes, clustering discovers hidden patterns and natural groupings in data without predefined labels. In business, clustering enables customer segmentation, product categorization, market analysis, and anomaly detection—all critical for strategic decision-making. This chapter explores the concepts, algorithms, and practical implementation of clustering, with a focus on translating clusters into actionable business strategies.

12.1 Unsupervised Learning in Business Analytics

Unsupervised learning seeks to uncover structure in data without explicit guidance about what to find. Unlike supervised learning, there is no "correct answer" to learn from—the algorithm must discover patterns on its own.

Why Unsupervised Learning Matters in Business:

Common Business Applications:

The Challenge:

Without labels, evaluating unsupervised learning is subjective. Success depends on whether the discovered patterns are interpretable, stable, and actionable  from a business perspective.

12.2 Customer and Product Segmentation

Segmentation divides a heterogeneous population into homogeneous subgroups, enabling tailored strategies for each segment.

Customer Segmentation

Goal:  Group customers with similar characteristics or behaviors to personalize marketing, pricing, and service.

Common Segmentation Bases:

Business Value:

Example:

An online retailer segments customers into:

  1. Bargain Hunters:  Price-sensitive, frequent coupon users.
  2. Loyal Enthusiasts:  High lifetime value, brand advocates.
  3. Occasional Shoppers:  Infrequent purchases, need engagement.
  4. New Explorers:  Recent sign-ups, still evaluating the brand.

Each segment receives customized email campaigns, promotions, and product recommendations.

Product Segmentation

Goal:  Group products with similar attributes, sales patterns, or customer appeal.

Applications:

12.3 Clustering Algorithms

Clustering algorithms vary in their approach, assumptions, and suitability for different data types and business contexts.

12.3.1 k-Means Clustering

Overview:

k-Means is the most widely used clustering algorithm due to its simplicity, speed, and effectiveness. It partitions data into k  distinct, non-overlapping clusters by minimizing the within-cluster variance.

How k-Means Works:

  1. Initialize:  Randomly select k data points as initial cluster centroids.
  2. Assign:  Assign each data point to the nearest centroid (using Euclidean distance).
  3. Update:  Recalculate centroids as the mean of all points in each cluster.
  4. Repeat:  Iterate steps 2-3 until centroids stabilize or a maximum number of iterations is reached.

Mathematical Objective:

Minimize the within-cluster sum of squares (WCSS):

WCSS=i=1∑k​x∈Ci​∑​∣∣x−μi​∣∣2

Where:

Advantages:

Disadvantages:

When to Use k-Means:

12.3.2 Hierarchical Clustering

Hierarchical clustering builds a tree-like structure (dendrogram) of nested clusters, allowing exploration of data at different levels of granularity.

Two Approaches:

  1. Agglomerative (Bottom-Up):  Start with each data point as its own cluster, then iteratively merge the closest clusters until only one remains.
  2. Divisive (Top-Down):  Start with all data in one cluster, then recursively split into smaller clusters.

Linkage Methods:

The "distance" between clusters can be defined in several ways:

Advantages:

Disadvantages:

When to Use Hierarchical Clustering:

Dendrogram Interpretation:

A dendrogram shows how clusters merge at different distances. Cutting the dendrogram at a certain height determines the number of clusters.

12.4 Choosing the Number of Clusters

Determining the optimal number of clusters (k) is one of the most challenging aspects of clustering. Several methods can guide this decision:

1. Elbow Method

Plot the within-cluster sum of squares (WCSS) against the number of clusters. Look for an "elbow" where the rate of decrease sharply changes.

Interpretation:

Limitation:  The elbow is not always clear or may be subjective.

2. Silhouette Score

Measures how similar a point is to its own cluster compared to other clusters. Ranges from -1 to 1:

Average Silhouette Score:  Higher is better. Compare scores across different values of k.

3. Gap Statistic

Compares the WCSS of your data to the WCSS of randomly generated data. A larger gap suggests better clustering.

4. Business Judgment

Ultimately, the number of clusters should be actionable and interpretable . Too few clusters may oversimplify; too many may be impractical to manage.

Questions to Ask:


12.5 Evaluating and Interpreting Clusters

Once clusters are formed, the real work begins: understanding what each cluster represents and how to act on it.

Quantitative Evaluation

Within-Cluster Sum of Squares (WCSS):

Lower WCSS indicates tighter, more cohesive clusters.

Silhouette Score:

Measures cluster separation and cohesion. Higher scores indicate better-defined clusters.

Davies-Bouldin Index:

Ratio of within-cluster to between-cluster distances. Lower is better.

Calinski-Harabasz Index:

Ratio of between-cluster variance to within-cluster variance. Higher is better.

Qualitative Interpretation

Cluster Profiling:

Examine the characteristics of each cluster by computing summary statistics (mean, median, mode) for each feature.

Example:

Cluster

Avg Age

Avg Income

Avg Purchase Frequency

Avg Spend

1

28

$45K

2.1/month

$120

2

52

$95K

5.3/month

$450

3

35

$62K

0.8/month

$80

Naming Clusters:

Assign meaningful names based on defining characteristics:

Visualization:

Stability and Validation

Stability Testing:

Run clustering multiple times with different initializations or subsets of data. Stable clusters should remain consistent.

Cross-Validation:

Split data, cluster each subset, and compare results. High agreement suggests robust clusters.

12.6 Implementing Clustering in Python

Let's walk through a complete clustering workflow in Python, including critical preprocessing steps.

Step 1: Load and Explore Data

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.decomposition import PCA

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Load customer data

df = pd.read_csv('customer_data.csv')

# Display first few rows

print(df.head())

print(df.info())

print(df.describe())

# Check for missing values

print(df.isnull().sum())

Step 2: Handle Missing Values

# Option 1: Drop rows with missing values (if few)

df = df.dropna()

# Option 2: Impute missing values

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')  # or 'mean', 'most_frequent'

df[['Age', 'Income']] = imputer.fit_transform(df[['Age', 'Income']])

Step 3: Handle Categorical Variables

# Identify categorical columns

categorical_cols = df.select_dtypes(include=['object']).columns

print("Categorical columns:", categorical_cols)

# Option 1: Label Encoding (for ordinal variables)

le = LabelEncoder()

df['Education_Level'] = le.fit_transform(df['Education_Level'])

# Option 2: One-Hot Encoding (for nominal variables)

df = pd.get_dummies(df, columns=['Region', 'Membership_Type'], drop_first=True)

print(df.head())

Step 4: Feature Selection

# Select relevant features for clustering

# Exclude identifiers and target variables if present

features = ['Age', 'Income', 'Purchase_Frequency', 'Avg_Transaction_Value',

            'Days_Since_Last_Purchase', 'Total_Spend']

X = df[features]

print(X.head())

Step 5: Standardization

# Standardize features to have mean=0 and std=1

# This is crucial because k-Means uses distance metrics

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame for easier interpretation

X_scaled_df = pd.DataFrame(X_scaled, columns=features)

print(X_scaled_df.describe())

Why Standardization Matters: k-Means uses Euclidean distance, which is sensitive to feature scales. Without standardization, features with larger ranges (e.g., Income: $20K-$200K) will dominate features with smaller ranges (e.g., Purchase Frequency: 1-10), leading to biased clusters.

Step 6: Determine Optimal Number of Clusters

#Elbow Method

wcss = []

silhouette_scores = []

K_range = range(2, 11)

for k in K_range:

    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)

    kmeans.fit(X_scaled)

    wcss.append(kmeans.inertia_)

    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

# Plot Elbow Curve

plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)

plt.plot(K_range, wcss, marker='o')

plt.xlabel('Number of Clusters (k)')

plt.ylabel('WCSS')

plt.title('Elbow Method')

plt.grid(True)

# Plot Silhouette Scores

plt.subplot(1, 2, 2)

plt.plot(K_range, silhouette_scores, marker='o', color='orange')

plt.xlabel('Number of Clusters (k)')

plt.ylabel('Silhouette Score')

plt.title('Silhouette Score by k')

plt.grid(True)

plt.tight_layout()

plt.show()

Step 7: Fit k-Means with Optimal k

# Based on elbow and silhouette analysis, choose k=4

optimal_k = 4

kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10, max_iter=300)

df['Cluster'] = kmeans.fit_predict(X_scaled)

print(f"\nCluster assignments:\n{df['Cluster'].value_counts().sort_index()}")

Step 8: Evaluate Clustering Quality

# Silhouette Score

sil_score = silhouette_score(X_scaled, df['Cluster'])

print(f"Silhouette Score: {sil_score:.3f}")

# Davies-Bouldin Index (lower is better)

db_score = davies_bouldin_score(X_scaled, df['Cluster'])

print(f"Davies-Bouldin Index: {db_score:.3f}")

# Calinski-Harabasz Index (higher is better)

ch_score = calinski_harabasz_score(X_scaled, df['Cluster'])

print(f"Calinski-Harabasz Index: {ch_score:.3f}")

Step 9: Profile and Interpret Clusters

# Compute cluster profiles using original (unscaled) features

cluster_profiles = df.groupby('Cluster')[features].mean()

print("\nCluster Profiles (Mean Values):")

print(cluster_profiles)

# Add cluster sizes

cluster_sizes = df['Cluster'].value_counts().sort_index()

cluster_profiles['Cluster_Size'] = cluster_sizes.values

print("\nCluster Profiles with Sizes:")

print(cluster_profiles)

# Visualize cluster profiles with heatmap

plt.figure(figsize=(10, 6))

sns.heatmap(cluster_profiles[features].T, annot=True, fmt='.1f', cmap='YlGnBu')

plt.title('Cluster Profiles Heatmap')

plt.xlabel('Cluster')

plt.ylabel('Feature')

plt.show()

Step 10: Visualize Clusters

2D Visualization using PCA:

# Reduce to 2 dimensions for visualization

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

# Create scatter plot

plt.figure(figsize=(10, 7))

scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['Cluster'],

                      cmap='viridis', alpha=0.6, edgecolors='k', s=50)

plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')

plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')

plt.title('Customer Clusters (PCA Projection)')

plt.colorbar(scatter, label='Cluster')

plt.grid(True, alpha=0.3)

plt.show()

print(f"Total variance explained by 2 PCs: {pca.explained_variance_ratio_.sum():.2%}")

Step 11: Statistical Comparison Across Clusters

# Compare clusters statistically

for feature in features:

    print(f"\n{feature} by Cluster:")

    print(df.groupby('Cluster')[feature].describe())

   

# Visualize distributions with box plots

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

axes = axes.flatten()

for idx, feature in enumerate(features):

    df.boxplot(column=feature, by='Cluster', ax=axes[idx])

    axes[idx].set_title(feature)

    axes[idx].set_xlabel('Cluster')

   

plt.suptitle('Feature Distributions by Cluster', y=1.02)

plt.tight_layout()

plt.show()

Step 12: Save Results

# Save clustered data

df.to_csv('customer_data_clustered.csv', index=False)

# Save cluster profiles

cluster_profiles.to_csv('cluster_profiles.csv')

print("Clustering complete! Results saved.")

12.7 From Clusters to Actionable Strategies

Clustering is only valuable if it leads to action. Here's how to translate clusters into business strategies:

Step 1: Name and Characterize Each Cluster

Based on the cluster profiles, assign meaningful names:

Example:

Step 2: Develop Targeted Strategies

Cluster 0: Budget-Conscious Infrequents

Cluster 1: High-Value Loyalists

Cluster 2: Mid-Tier Regulars

Cluster 3: Lapsed High-Potentials

Step 3: Measure and Iterate

Track the performance of cluster-specific strategies:

Refine strategies based on results and re-cluster periodically as customer behavior evolves.

12.8 Introduction to Recommendation Systems and Collaborative Filtering

Recommendation systems have become ubiquitous in modern business, powering product suggestions on e-commerce platforms, content recommendations on streaming services, and personalized marketing campaigns. At their core, recommendation systems solve a fundamental business problem: matching users with items they're likely to value , thereby increasing engagement, sales, and customer satisfaction.

This section introduces the foundational concepts of recommendation systems, with a focus on Collaborative Filtering (CF) , one of the most widely used and effective approaches.

12.8.1 Why Recommendation Systems Matter for Business

Recommendation systems deliver measurable business value across multiple dimensions:

Business Impact

Example

Typical Improvement

Revenue Growth

Amazon product recommendations

35% of revenue from recommendations

Engagement

Netflix content suggestions

80% of watched content is recommended

Customer Retention

Spotify personalized playlists

25-40% increase in session length

Conversion Rate

E-commerce "You may also like"

2-5x higher click-through rates

Inventory Optimization

Promote slow-moving items

15-20% reduction in excess inventory

Customer Satisfaction

Personalized experiences

10-15% improvement in NPS scores

Common Business Applications:

12.8.2 Types of Recommendation Systems

There are three main approaches to building recommendation systems:

1. Content-Based Filtering

Recommends items similar to those a user has liked in the past, based on item attributes.

How it works:

Example:  If you watched sci-fi movies, recommend more sci-fi movies.

Pros:

Cons:

2. Collaborative Filtering (CF)

Recommends items based on patterns in user behavior, leveraging the "wisdom of the crowd."

How it works:

Example:  "Users who liked items A and B also liked item C."

Pros:

Cons:

3. Hybrid Systems

Combine multiple approaches to leverage their complementary strengths.

Common Hybrid Strategies:

Example:  Netflix uses content features + collaborative patterns + contextual signals (time of day, device).


12.8.3 Collaborative Filtering: Core Concepts

Collaborative Filtering is based on a simple but powerful insight: users who agreed in the past tend to agree in the future .

The User-Item Matrix

At the heart of CF is the user-item interaction matrix :

Item 1

Item 2

Item 3

Item 4

Item 5

User A

5

3

?

1

?

User B

4

?

?

2

5

User C

1

1

5

5

4

User D

?

3

4

?

?

The Goal : Predict the missing values to generate recommendations.

Two Flavors of Collaborative Filtering

1. User-Based Collaborative Filtering

"Find users similar to me, and recommend what they liked."

Process:

  1. Calculate similarity between users (e.g., User A and User B)
  2. Find the k most similar users (neighbors)
  3. Predict ratings based on neighbors' ratings
  4. Recommend highest-predicted items

Similarity Metrics:

2. Item-Based Collaborative Filtering

"Find items similar to what I liked, and recommend those."

Process:

  1. Calculate similarity between items (e.g., Item 1 and Item 2)
  2. For each item a user liked, find similar items
  3. Predict ratings based on similar items' ratings
  4. Recommend highest-predicted items

Why Item-Based Often Works Better:


12.8.4 Implementing Collaborative Filtering in Python

Let's build a simple recommendation system using the transactions dataset.

Step 1: Prepare the Data

import pandas as pd

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

from scipy.sparse import csr_matrix

import matplotlib.pyplot as plt

import seaborn as sns

# Load transaction data

df = pd.read_csv('transactions.csv')

df['transaction_date'] = pd.to_datetime(df['transaction_date'])

print("=== Transaction Data ===")

print(df.head())

print(f"\nShape: {df.shape}")

print(f"Unique customers: {df['customer_id'].nunique()}")

print(f"Unique transactions: {df['transaction_id'].nunique()}")

# For this example, we'll create a simplified scenario where we have product purchases

# Since our dataset has transactions, we'll simulate product IDs based on transaction patterns

np.random.seed(42)

# Create synthetic product IDs (in real scenario, you'd have actual product data)

# We'll assign products based on transaction amount ranges to create realistic patterns

def assign_product(amount):

    if amount < 5:

        return np.random.choice(['Product_A', 'Product_B', 'Product_C'], p=[0.5, 0.3, 0.2])

    elif amount < 15:

        return np.random.choice(['Product_D', 'Product_E', 'Product_F'], p=[0.4, 0.4, 0.2])

    else:

        return np.random.choice(['Product_G', 'Product_H', 'Product_I'], p=[0.3, 0.4, 0.3])

df['product_id'] = df['amount'].apply(assign_product)

# Create implicit ratings (purchase frequency as proxy for preference)

# In real scenarios, you might have explicit ratings (1-5 stars)

user_item_matrix = df.groupby(['customer_id', 'product_id']).size().reset_index(name='purchase_count')

print("\n=== User-Item Interactions ===")

print(user_item_matrix.head(10))

print(f"\nTotal interactions: {len(user_item_matrix)}")

Step 2: Create User-Item Matrix

# Pivot to create user-item matrix

interaction_matrix = user_item_matrix.pivot(

    index='customer_id',

    columns='product_id',

    values='purchase_count'

).fillna(0)

print("\n=== User-Item Matrix ===")

print(f"Shape: {interaction_matrix.shape}")

print(f"Sparsity: {(interaction_matrix == 0).sum().sum() / (interaction_matrix.shape[0] * interaction_matrix.shape[1]) * 100:.1f}%")

print("\nSample of matrix:")

print(interaction_matrix.head())

# Visualize the matrix

plt.figure(figsize=(12, 8))

sns.heatmap(interaction_matrix.iloc[:20, :], cmap='YlOrRd', cbar_kws={'label': 'Purchase Count'})

plt.title('User-Item Interaction Matrix (First 20 Users)', fontsize=14, fontweight='bold')

plt.xlabel('Product ID', fontsize=11)

plt.ylabel('Customer ID', fontsize=11)

plt.tight_layout()

plt.show()

Step 3: User-Based Collaborative Filtering

# Calculate user-user similarity using cosine similarity

user_similarity = cosine_similarity(interaction_matrix)

user_similarity_df = pd.DataFrame(

    user_similarity,

    index=interaction_matrix.index,

    columns=interaction_matrix.index

)

print("\n=== User Similarity Matrix ===")

print(user_similarity_df.iloc[:5, :5])

# Function to get recommendations for a user

def get_user_based_recommendations(user_id, user_item_matrix, user_similarity_df, n_recommendations=5):

    """

    Generate recommendations using user-based collaborative filtering

    """

    if user_id not in user_item_matrix.index:

        return f"User {user_id} not found in the dataset"

   

    # Get similarity scores for this user with all other users

    similar_users = user_similarity_df[user_id].sort_values(ascending=False)

   

    # Exclude the user themselves

    similar_users = similar_users.drop(user_id)

   

    # Get top 5 most similar users

    top_similar_users = similar_users.head(5)

   

    print(f"\n{'='*80}")

    print(f"RECOMMENDATIONS FOR USER {user_id}")

    print(f"{'='*80}")

    print(f"\n📊 Top 5 Most Similar Users:")

    for sim_user, similarity in top_similar_users.items():

        print(f"   • User {sim_user}: Similarity = {similarity:.3f}")

   

    # Get items the target user has already interacted with

    user_items = set(user_item_matrix.loc[user_id][user_item_matrix.loc[user_id] > 0].index)

   

    # Calculate weighted scores for items

    item_scores = {}

    for product in user_item_matrix.columns:

        if product not in user_items:  # Only recommend new items

            # Weighted sum of similar users' ratings

            score = 0

            similarity_sum = 0

            for sim_user, similarity in top_similar_users.items():

                if user_item_matrix.loc[sim_user, product] > 0:

                    score += similarity * user_item_matrix.loc[sim_user, product]

                    similarity_sum += similarity

           

            if similarity_sum > 0:

                item_scores[product] = score / similarity_sum

   

    # Sort and get top recommendations

    recommendations = sorted(item_scores.items(), key=lambda x: x[1], reverse=True)[:n_recommendations]

   

    print(f"\n🎯 Current Purchases:")

    for item in user_items:

        print(f"   • {item}: {user_item_matrix.loc[user_id, item]:.0f} purchases")

   

    print(f"\n⭐ Top {n_recommendations} Recommendations:")

    for i, (product, score) in enumerate(recommendations, 1):

        print(f"   {i}. {product} (Score: {score:.3f})")

   

    print(f"{'='*80}\n")

   

    return recommendations

# Test with a specific user

test_user = interaction_matrix.index[5]

recommendations = get_user_based_recommendations(

    test_user,

    interaction_matrix,

    user_similarity_df,

    n_recommendations=3

)

Step 4: Item-Based Collaborative Filtering

# Calculate item-item similarity

item_similarity = cosine_similarity(interaction_matrix.T)

item_similarity_df = pd.DataFrame(

    item_similarity,

    index=interaction_matrix.columns,

    columns=interaction_matrix.columns

)

print("\n=== Item Similarity Matrix ===")

print(item_similarity_df)

# Visualize item similarities

plt.figure(figsize=(10, 8))

sns.heatmap(item_similarity_df, annot=True, fmt='.2f', cmap='coolwarm',

            center=0, vmin=-1, vmax=1, square=True,

            cbar_kws={'label': 'Cosine Similarity'})

plt.title('Item-Item Similarity Matrix', fontsize=14, fontweight='bold')

plt.xlabel('Product ID', fontsize=11)

plt.ylabel('Product ID', fontsize=11)

plt.tight_layout()

plt.show()

# Function to get item-based recommendations

def get_item_based_recommendations(user_id, user_item_matrix, item_similarity_df, n_recommendations=5):

    """

    Generate recommendations using item-based collaborative filtering

    """

    if user_id not in user_item_matrix.index:

        return f"User {user_id} not found in the dataset"

   

    # Get items the user has interacted with

    user_items = user_item_matrix.loc[user_id]

    user_purchased_items = user_items[user_items > 0]

   

    print(f"\n{'='*80}")

    print(f"ITEM-BASED RECOMMENDATIONS FOR USER {user_id}")

    print(f"{'='*80}")

   

    print(f"\n📦 User's Purchase History:")

    for item, count in user_purchased_items.items():

        print(f"   • {item}: {count:.0f} purchases")

   

    # Calculate scores for all items

    item_scores = {}

    for candidate_item in user_item_matrix.columns:

        if candidate_item not in user_purchased_items.index:  # Only new items

            score = 0

            similarity_sum = 0

           

            # For each item the user purchased, find similar items

            for purchased_item, purchase_count in user_purchased_items.items():

                similarity = item_similarity_df.loc[purchased_item, candidate_item]

                score += similarity * purchase_count

                similarity_sum += abs(similarity)

           

            if similarity_sum > 0:

                item_scores[candidate_item] = score / similarity_sum

   

    # Sort and get top recommendations

    recommendations = sorted(item_scores.items(), key=lambda x: x[1], reverse=True)[:n_recommendations]

   

    print(f"\n⭐ Top {n_recommendations} Recommendations:")

    for i, (product, score) in enumerate(recommendations, 1):

        # Find which purchased items are most similar

        similar_to = []

        for purchased_item in user_purchased_items.index:

            sim = item_similarity_df.loc[purchased_item, product]

            if sim > 0.3:  # Threshold for "similar"

                similar_to.append(f"{purchased_item} ({sim:.2f})")

       

        similar_str = ", ".join(similar_to[:2]) if similar_to else "general pattern"

        print(f"   {i}. {product} (Score: {score:.3f})")

        print(f"      → Similar to: {similar_str}")

   

    print(f"{'='*80}\n")

   

    return recommendations

# Test item-based recommendations

test_user = interaction_matrix.index[5]

item_recommendations = get_item_based_recommendations(

    test_user,

    interaction_matrix,

    item_similarity_df,

    n_recommendations=3

)

Step 5: Matrix Factorization (Advanced CF)

Matrix factorization is a more sophisticated CF approach that decomposes the user-item matrix into lower-dimensional latent factors.

from sklearn.decomposition import NMF

# Apply Non-negative Matrix Factorization

n_factors = 3  # Number of latent factors

nmf_model = NMF(n_components=n_factors, init='random', random_state=42, max_iter=200)

user_factors = nmf_model.fit_transform(interaction_matrix)

item_factors = nmf_model.components_

print("\n=== Matrix Factorization ===")

print(f"User factors shape: {user_factors.shape}")

print(f"Item factors shape: {item_factors.shape}")

# Reconstruct the matrix (predictions)

predicted_matrix = np.dot(user_factors, item_factors)

predicted_df = pd.DataFrame(

    predicted_matrix,

    index=interaction_matrix.index,

    columns=interaction_matrix.columns

)

print("\n=== Predicted Ratings (Sample) ===")

print(predicted_df.head())

# Function to get recommendations using matrix factorization

def get_mf_recommendations(user_id, original_matrix, predicted_matrix, n_recommendations=5):

    """

    Generate recommendations using matrix factorization

    """

    if user_id not in original_matrix.index:

        return f"User {user_id} not found"

   

    # Get user's actual and predicted ratings

    actual = original_matrix.loc[user_id]

    predicted = predicted_matrix.loc[user_id]

   

    # Find items user hasn't purchased

    unpurchased = actual[actual == 0].index

   

    # Get predictions for unpurchased items

    recommendations = predicted[unpurchased].sort_values(ascending=False).head(n_recommendations)

   

    print(f"\n{'='*80}")

    print(f"MATRIX FACTORIZATION RECOMMENDATIONS FOR USER {user_id}")

    print(f"{'='*80}")

   

    print(f"\n📦 User's Purchase History:")

    purchased = actual[actual > 0]

    for item, count in purchased.items():

        print(f"   • {item}: {count:.0f} purchases")

   

    print(f"\n⭐ Top {n_recommendations} Recommendations:")

    for i, (product, score) in enumerate(recommendations.items(), 1):

        print(f"   {i}. {product} (Predicted Score: {score:.3f})")

   

    print(f"{'='*80}\n")

   

    return recommendations

# Test matrix factorization recommendations

test_user = interaction_matrix.index[5]

mf_recommendations = get_mf_recommendations(

    test_user,

    interaction_matrix,

    predicted_df,

    n_recommendations=3

)

12.8.5 Evaluating Recommendation Systems

Measuring the effectiveness of recommendations requires different metrics than traditional ML models.

Offline Evaluation Metrics

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, mean_absolute_error

# Split data into train/test

train_data = []

test_data = []

for user in interaction_matrix.index:

    user_interactions = user_item_matrix[user_item_matrix['customer_id'] == user]

    if len(user_interactions) >= 2:

        train, test = train_test_split(user_interactions, test_size=0.2, random_state=42)

        train_data.append(train)

        test_data.append(test)

train_df = pd.concat(train_data)

test_df = pd.concat(test_data)

print("=== Train/Test Split ===")

print(f"Training interactions: {len(train_df)}")

print(f"Test interactions: {len(test_df)}")

# Rebuild matrix with training data only

train_matrix = train_df.pivot(

    index='customer_id',

    columns='product_id',

    values='purchase_count'

).fillna(0)

# Calculate predictions for test set

# (Using item-based CF as example)

train_item_similarity = cosine_similarity(train_matrix.T)

train_item_sim_df = pd.DataFrame(

    train_item_similarity,

    index=train_matrix.columns,

    columns=train_matrix.columns

)

# Predict ratings for test set

predictions = []

actuals = []

for _, row in test_df.iterrows():

    user = row['customer_id']

    item = row['product_id']

    actual = row['purchase_count']

   

    if user in train_matrix.index and item in train_matrix.columns:

        # Get user's training purchases

        user_purchases = train_matrix.loc[user]

        purchased_items = user_purchases[user_purchases > 0]

       

        # Predict based on similar items

        if len(purchased_items) > 0:

            score = 0

            sim_sum = 0

            for purch_item, purch_count in purchased_items.items():

                if purch_item in train_item_sim_df.index:

                    sim = train_item_sim_df.loc[purch_item, item]

                    score += sim * purch_count

                    sim_sum += abs(sim)

           

            predicted = score / sim_sum if sim_sum > 0 else 0

            predictions.append(predicted)

            actuals.append(actual)

# Calculate metrics

rmse = np.sqrt(mean_squared_error(actuals, predictions))

mae = mean_absolute_error(actuals, predictions)

print("\n=== Prediction Accuracy ===")

print(f"RMSE: {rmse:.3f}")

print(f"MAE: {mae:.3f}")

Key Evaluation Metrics

Metric

Description

When to Use

RMSE/MAE

Prediction error for ratings

Explicit ratings (1-5 stars)

Precision@K

% of top-K recommendations that are relevant

Implicit feedback (clicks, purchases)

Recall@K

% of relevant items found in top-K

Measuring coverage

NDCG

Normalized Discounted Cumulative Gain

Ranking quality

Hit Rate

% of users with at least 1 relevant item in top-K

User satisfaction

Coverage

% of items that can be recommended

Diversity

Novelty

How unexpected recommendations are

Discovery

Serendipity

Relevant but unexpected recommendations

User delight

# Calculate Precision@K and Recall@K

def precision_recall_at_k(recommendations_dict, test_set, k=5):

    """

    Calculate Precision@K and Recall@K

   

    recommendations_dict: {user_id: [list of recommended items]}

    test_set: DataFrame with actual user-item interactions

    """

    precisions = []

    recalls = []

   

    for user, recommended_items in recommendations_dict.items():

        # Get actual items user interacted with in test set

        actual_items = set(test_set[test_set['customer_id'] == user]['product_id'])

       

        if len(actual_items) == 0:

            continue

       

        # Get top K recommendations

        top_k = recommended_items[:k]

       

        # Calculate metrics

        relevant_recommended = len(set(top_k) & actual_items)

        precision = relevant_recommended / k if k > 0 else 0

        recall = relevant_recommended / len(actual_items) if len(actual_items) > 0 else 0

       

        precisions.append(precision)

        recalls.append(recall)

   

    return np.mean(precisions), np.mean(recalls)

print("\n=== Ranking Metrics ===")

print(f"Precision@3: {np.random.uniform(0.15, 0.25):.3f}")  # Placeholder

print(f"Recall@3: {np.random.uniform(0.10, 0.20):.3f}")     # Placeholder

print(f"Coverage: {np.random.uniform(0.70, 0.85):.1%}")     # Placeholder

12.8.6 Challenges and Best Practices

Common Challenges

Challenge

Description

Solutions

Cold Start

New users/items have no data

Use content features, demographics, popularity

Sparsity

Most user-item pairs are missing

Matrix factorization, hybrid approaches

Scalability

Millions of users × items

Approximate nearest neighbors, sampling

Filter Bubble

Only recommending similar items

Add diversity, exploration vs. exploitation

Popularity Bias

Over-recommending popular items

Normalize by popularity, boost long-tail

Temporal Dynamics

Preferences change over time

Time-weighted similarity, session-based

Implicit Feedback

No explicit ratings

Use purchase, click, view as proxy

Best Practices

1. Start Simple

2. Handle Cold Start

def hybrid_recommendation(user_id, has_history=True):

    """Hybrid approach for cold start"""

    if has_history:

        # Use collaborative filtering

        return get_item_based_recommendations(user_id)

    else:

        # Fall back to popular items or content-based

        return get_popular_items()

3. Balance Accuracy and Diversity

def diversify_recommendations(recommendations, similarity_threshold=0.7):

    """Remove highly similar items from recommendations"""

    diverse_recs = [recommendations[0]]  # Keep top recommendation

   

    for rec in recommendations[1:]:

        # Check if too similar to already selected items

        is_diverse = all(

            item_similarity_df.loc[rec, selected] < similarity_threshold

            for selected in diverse_recs

        )

        if is_diverse:

            diverse_recs.append(rec)

   

    return diverse_recs

4. Monitor Business Metrics

5. A/B Test Everything

12.8.7 AI Prompts for Recommendation Systems

PROMPT: "I have a user-item interaction matrix with 10,000 users and 1,000 products.

The matrix is 98% sparse. What collaborative filtering approach should I use? Provide

Python code to implement item-based CF with cosine similarity and handle the sparsity."

PROMPT: "My recommendation system suffers from cold start for new users. I have user

demographics (age, location, gender) and product categories. How can I create a hybrid

system that uses content-based filtering for new users and collaborative filtering for

existing users? Provide implementation code."

PROMPT: "Implement matrix factorization using SVD for my recommendation system. Show me

how to: 1) Choose the optimal number of latent factors, 2) Handle missing values,

3) Generate predictions, and 4) Evaluate using RMSE and Precision@K."

PROMPT: "My recommendations are too focused on popular items. How can I add diversity

and promote long-tail products? Provide code to: 1) Calculate item popularity bias,

2) Implement a diversity penalty, and 3) Balance accuracy vs. diversity."

PROMPT: "Create a recommendation evaluation framework that calculates: Precision@K,

Recall@K, NDCG, Coverage, and Novelty. Include train/test split logic and visualization

of results across different K values."


11.9.8 Real-World Example: E-Commerce Product Recommendations

# Complete end-to-end recommendation pipeline

print("\n" + "="*100)

print("=== E-COMMERCE RECOMMENDATION SYSTEM: COMPLETE PIPELINE ===")

print("="*100)

# Step 1: Data Summary

print("\n📊 DATASET OVERVIEW:")

print(f"   • Total Customers: {interaction_matrix.shape[0]}")

print(f"   • Total Products: {interaction_matrix.shape[1]}")

print(f"   • Total Interactions: {(interaction_matrix > 0).sum().sum()}")

print(f"   • Matrix Sparsity: {(interaction_matrix == 0).sum().sum() / (interaction_matrix.shape[0] * interaction_matrix.shape[1]) * 100:.1f}%")

print(f"   • Avg Purchases per Customer: {interaction_matrix.sum(axis=1).mean():.1f}")

print(f"   • Avg Purchases per Product: {interaction_matrix.sum(axis=0).mean():.1f}")

# Step 2: Generate recommendations for multiple users

print("\n🎯 GENERATING RECOMMENDATIONS FOR SAMPLE USERS:")

print("="*100)

sample_users = interaction_matrix.index[:3]

for user in sample_users:

    print(f"\n{'─'*100}")

    print(f"USER {user} RECOMMENDATION REPORT")

    print(f"{'─'*100}")

   

    # User profile

    user_purchases = interaction_matrix.loc[user]

    purchased_items = user_purchases[user_purchases > 0]

   

    print(f"\n📦 Purchase History ({len(purchased_items)} products):")

    for item, count in purchased_items.items():

        print(f"   • {item}: {count:.0f} purchases")

   

    # Item-based recommendations

    item_recs = get_item_based_recommendations(user, interaction_matrix, item_similarity_df, n_recommendations=3)

# Step 3: Business Impact Projection

print("\n💰 PROJECTED BUSINESS IMPACT:")

print("="*100)

# Simulate recommendation acceptance

acceptance_rate = 0.15  # 15% of users click on recommendations

conversion_rate = 0.05  # 5% of clicks convert to purchases

avg_order_value = df['amount'].mean()

total_users = interaction_matrix.shape[0]

potential_clicks = total_users * 3 * acceptance_rate  # 3 recommendations per user

potential_conversions = potential_clicks * conversion_rate

potential_revenue = potential_conversions * avg_order_value

print(f"\n   Assumptions:")

print(f"   • Recommendation Acceptance Rate: {acceptance_rate:.1%}")

print(f"   • Click-to-Purchase Conversion: {conversion_rate:.1%}")

print(f"   • Average Order Value: ${avg_order_value:.2f}")

print(f"\n   Projected Results:")

print(f"   • Total Users: {total_users:,}")

print(f"   • Expected Clicks: {potential_clicks:.0f}")

print(f"   • Expected Conversions: {potential_conversions:.0f}")

print(f"   • Projected Additional Revenue: ${potential_revenue:,.2f}")

print(f"   • Revenue Lift per User: ${potential_revenue/total_users:.2f}")

print("\n" + "="*100)

Key Takeaways:

  1. Collaborative Filtering leverages collective intelligence  to find patterns in user behavior without requiring item metadata
  2. Two main approaches : User-based (find similar users) and Item-based (find similar items), with item-based often performing better in practice
  3. Matrix Factorization  (SVD, NMF) provides a more sophisticated approach by discovering latent factors that explain user preferences
  4. Cold start problem  is a major challenge—address with hybrid systems that combine collaborative and content-based approaches
  5. Evaluation requires multiple metrics : accuracy (RMSE), ranking quality (Precision@K, NDCG), and business metrics (CTR, revenue)
  6. Balance is critical : Accuracy vs. diversity, exploitation vs. exploration, personalization vs. serendipity

When to Use Collaborative Filtering:

When to Consider Alternatives:

Exercises

Exercise 1: Apply k-Means Clustering to a Customer Dataset and Visualize the Results

Dataset:  Use a customer dataset with features like Age, Income, Purchase Frequency, Average Transaction Value, and Days Since Last Purchase.

Tasks:

  1. Load the dataset and perform exploratory data analysis (EDA).
  2. Handle missing values and encode categorical variables if present.
  3. Standardize the features using StandardScaler .
  4. Apply k-Means clustering with k=3, 4, and 5.
  5. Visualize the clusters using PCA for dimensionality reduction.
  6. Create a heatmap of cluster profiles.

Deliverable:  Python code, visualizations, and a brief interpretation of each cluster.


Exercise 2: Experiment with Different Numbers of Clusters and Compare Cluster Quality

Tasks:

  1. Use the Elbow Method to plot WCSS for k ranging from 2 to 10.
  2. Calculate and plot Silhouette Scores for the same range of k.
  3. Compute Davies-Bouldin and Calinski-Harabasz indices for each k.
  4. Based on these metrics, determine the optimal number of clusters.
  5. Discuss any trade-offs between cluster quality metrics and business interpretability.

Deliverable:  Plots, a table summarizing metrics for each k, and a recommendation for the optimal k with justification.

Exercise 3: Profile Each Cluster and Propose Targeted Marketing or Service Strategies

Tasks:

  1. Using the optimal k from Exercise 2, profile each cluster by computing mean, median, and standard deviation for each feature.
  2. Assign meaningful names to each cluster based on their characteristics.
  3. For each cluster, propose:
  1. Estimate the potential business impact (e.g., revenue increase, retention improvement) of implementing these strategies.

Deliverable:  A cluster profile report with actionable strategies for each segment.

Exercise 4: Reflect on the Limitations and Risks of Over-Interpreting Clusters

Scenario:  Your clustering analysis identified 5 customer segments. Management is excited and wants to immediately implement highly differentiated strategies for each segment, including separate product lines, pricing tiers, and marketing teams.

Tasks:

  1. Stability Concerns:  What if the clusters are not stable over time or across different samples? How would you test for stability?
  2. Over-Segmentation:  What are the risks of creating too many segments? How might this impact operational complexity and costs?
  3. Spurious Patterns:  Clustering algorithms will always produce clusters, even from random data. How can you validate that your clusters represent real, meaningful patterns?
  4. Actionability:  What if some clusters are too small or too similar to justify separate strategies? How would you handle this?
  5. Ethical Considerations:  Could clustering lead to discriminatory practices (e.g., excluding certain segments from offers)? How would you ensure fairness?

Deliverable:  A written reflection (1-2 pages) addressing these questions, with recommendations for responsible use of clustering in business decision-making.

Exercise 5: Build and Evaluate a Product Recommendation System

Build a collaborative filtering recommendation system, evaluate its performance, and present actionable business insights to stakeholders.

Scenario: You are a data analyst at an online retail company. The marketing team wants to implement a "Customers who bought this also bought..." feature on product pages to increase cross-sell revenue. They've asked you to:

  1. Build a recommendation system using historical transaction data
  2. Evaluate its accuracy and business potential
  3. Provide specific recommendations for implementation

Part 1: Data Preparation and Exploration

  1. Load the data_ppp.csv  dataset and create a user-item interaction matrix
  2. Calculate and report:
  1. Create a visualization showing:
  1. Identify and discuss any data quality issues (e.g., customers with only 1 purchase, very sparse products)

Deliverable : Code, summary statistics table, and 2 visualizations with interpretations

Part 2: Build Recommendation Models

Implement two  of the following three approaches:

Option A: Item-Based Collaborative Filtering

Option B: User-Based Collaborative Filtering

Option C: Matrix Factorization

Requirements for each model:

Deliverable : Python code with functions, sample recommendations for 3 users/products, and brief explanation of your approach

Part 3: Model Evaluation (25 points)

  1. Split your data  into training (80%) and test (20%) sets
  1. Calculate the following metrics:
  1. Compare your two models  using a comparison table
  2. Analyze errors :

Deliverable : Evaluation code, metrics comparison table, and analysis of model strengths/weaknesses

Part 4: Business Impact Analysis (15 points)

Create a business case for implementing your recommendation system:

  1. Revenue Projection :
  1. Segment Analysis :
  1. Implementation Recommendations :

Deliverable : 1-page business impact summary with revenue projections and implementation roadmap

Part 5: Executive Presentation

Create 3 visualizations  for an executive presentation:

  1. Model Performance Dashboard : Show key metrics (accuracy, coverage, diversity) in an easy-to-understand format
  2. Sample Recommendations : Visualize actual recommendations for 2-3 example products/users with explanations
  3. Business Impact Projection : Chart showing projected revenue lift over 6-12 months

Requirements:

Deliverable : 3 polished visualizations with brief captions

Bonus Challenges (Optional)

  1. Cold Start Solution : Implement a hybrid approach that handles new users or products with no interaction history
  2. Diversity Enhancement : Modify your recommendation algorithm to increase diversity (reduce similarity between recommended items)
  3. Temporal Analysis : Analyze how recommendations change over time—do recent purchases matter more than old ones?
  4. A/B Test Design : Design a detailed A/B test plan to evaluate the recommendation system in production, including sample size calculation, success metrics, and duration

Summary

Clustering is a powerful tool for discovering hidden patterns and segmenting customers, products, or markets. However, successful clustering requires careful preprocessing (handling missing data, encoding categorical variables, and standardization), thoughtful selection of the number of clusters, and rigorous interpretation. Most importantly, clusters must translate into actionable strategies  that create business value. By combining technical rigor with business judgment, analysts can leverage clustering to drive personalization, efficiency, and strategic insight—while remaining mindful of the limitations and risks of over-interpreting algorithmic outputs.



Based on the comprehensive research and the TOC you've provided, here's Chapter 13: Using LLMs in Business Analytics :