eCommerce Lever
  • Categories
    • eCommerce Operations
    • Analytics
    • SEO
    • CRO
    • UX
    • On-site Search
  • Contact
HomeOn-site Search An Easy Guide to Keyword Clustering with AI & Python

An Easy Guide to Keyword Clustering with AI & Python

Alla Vovnenko on April 11, 2026
On-site Search SEO
13 Min Read

Keyword clustering is one of the most effective ways to turn a messy list of search queries into a clear, actionable structure for SEO. Instead of working with thousands of individual keywords, you group them by meaning and intent, which makes it much easier to plan categories, content, and site architecture.

Table of Contents

  1. What You’ll Need (Setup)
  2. Dataset: Preparing Keywords
  3. Clean Keywords (Handle Intent Noise)
  4. Generate Embeddings
  5. Cluster Keywords (K-means)
  6. Analyze Clusters
  7. Improve Results
  8. Auto-Label Clusters with AI
  9. Build Final Dataset
  10. Limitations & Real-World Notes

In this guide, I will walk through a practical, end-to-end approach to clustering keywords using Python and AI. To keep things simple and reproducible, I will use a generated beauty dataset focused on makeup and perfumery as an example. The same method works with real data from tools like Google Search Console.

beauty_keywords_10k

Download

keywords_final

Download

Although the focus here is SEO, this approach is just as useful for on-site search analysis. If you collect search queries from your website, you can cluster them in the same way to better understand user behavior, identify demand patterns, and improve navigation or product discovery.

1. What You’ll Need (Setup)

Before we start clustering, you only need a few basic things set up. Nothing complex, but it helps to have everything ready so you can follow along smoothly.

1. Python

You’ll need Python installed on your computer.
If you don’t have it yet, download it from the official website and install it.

To check if it’s installed, you need to open the terminal:

  • Windows:
    Press Win + R, type cmd, and press Enter.
    Or open the Start menu and search for “Command Prompt”.
  • Mac:
    Press Cmd + Space, type “Terminal”, and open it.

Once the terminal is open, run:

python --version

If you see a version number, Python is installed correctly.

2. Required Libraries

We’ll use a few Python libraries. Install them with:

pip install pandas numpy scikit-learn openai tqdm

3. OpenAI API Key

To generate embeddings, you’ll need an API key from OpenAI.

  1. Go to the API Keys on OpenAI platform
  2. Create an API key (button: create new secret key)
  3. Copy it and keep it safe

4. Basic Command Line Usage

You don’t need advanced knowledge, just a few basics:

  • Navigate to a folder:
cd path_to_your_folder
  • Run a Python script:
python script_name.py

5. Project Folder

Create a simple folder for this project, for example, “keyword-clustering”. Inside it, we’ll keep datasets (CSV files), Python scripts, and final outputs.

2. Dataset: Preparing Keywords

Before we can cluster anything, we need a dataset of keywords. In a real project, this usually comes from tools like Google Search Console, Ahrefs, or internal search data. For this guide, we’ll use a generated beauty dataset so you can easily reproduce every step.

Our dataset is a simple CSV file with three columns:

keywordclicksimpressions
best foundation for oily skin451200
lipstick matte red30900
perfume for women floral601500

3. Clean Keywords (Handle Intent Noise)

At this stage, the goal is to make clustering focus on product meaning, not search intent.

Real keywords often include words like:

  • buy
  • best
  • price
  • review

If we use these as-is, clustering tends to group keywords by intent instead of product type.

We don’t remove these keywords from the dataset. Instead, we create a cleaned version of each keyword that ignores intent words during clustering. We’ll call this column “keyword_clean”.

keywordkeyword_clean
buy foundationfoundation
best foundation for oily skinfoundation for oily skin

Create clean_keywords.py

  1. Enable file extensions in Windows:
    • In Windows Explorer → View → enable File name extensions
  2. In your project folder:
    • Right-click → New → Text Document
  3. Name the file:
clean_keywords.py

Make sure it does not end with .txt

  1. Open the file in an editor. I use Notepad++
  2. Paste the script and save. Make sure you replace the dataset name in the script with your file name.
import pandas as pd
import re

# load dataset
df = pd.read_csv("<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-luminous-vivid-amber-color">beauty_keywords_10k.csv</mark></strong>")

def clean_keyword(kw):
    kw = kw.lower()

    # words to ignore during clustering
    kw = re.sub(r'\b(buy|best|price|review|cheap|sale|online|shop)\b', '', kw)

    # clean extra spaces
    kw = re.sub(r'\s+', ' ', kw).strip()

    return kw

# create new column
df["keyword_clean"] = df["keyword"].apply(clean_keyword)

# save result
df.to_csv("keywords_cleaned.csv", index=False)

print("Done! Created keywords_cleaned.csv")

Now your Python script is ready to run. First, open Command Prompt (or Terminal) and navigate to the folder where your script is saved. Then run the script:

python clean_keywords.py

This will create a new file keywords_cleaned.csv with cleaned keywords in your folder.

4. Generate Embeddings

Now we convert keywords into a format that allows us to group them by meaning.

Embeddings are numerical representations of text. Keywords with similar meaning will have similar vectors, which makes clustering possible.

Create embeddings.py

Create a new file: embeddings.py

import pandas as pd
from openai import OpenAI

client = OpenAI(api_key="<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-luminous-vivid-amber-color">YOUR_API_KEY</mark></strong>")

# load cleaned dataset
df = pd.read_csv("keywords_cleaned.csv")

keywords = df["keyword_clean"].tolist()

embeddings = []

for kw in keywords:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=kw
    )
    embeddings.append(response.data[0].embedding)

# add embeddings to dataframe
df["embedding"] = embeddings

# save result
df.to_csv("keywords_with_embeddings.csv", index=False)

print("Done! Created keywords_with_embeddings.csv")

Run the script

python embeddings.py

You’ll get a new file keywords_with_embeddings.csv. The script may take some time to run depending on the size of your dataset. In my case, it took about 15 minutes.

Each keyword now has an embedding vector. We will use these vectors in the next step to group keywords into clusters.

5. Cluster Keywords (K-means)

Now that each keyword has an embedding, we can group them into clusters based on meaning.

We’ll use K-means, a simple and effective algorithm that groups similar vectors together.

Create script clustering.py

Create a new file: clustering.py

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# load data
df = pd.read_csv("keywords_with_embeddings.csv")

# convert embeddings from text to list
embeddings = df["embedding"].apply(eval).tolist()
X = np.array(embeddings)

# number of clusters
k = 200

# run K-means
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
df["cluster"] = kmeans.fit_predict(X)

# save result
df.to_csv("keywords_clustered.csv", index=False)

print("Done! Created keywords_clustered.csv")

Run the script

python clustering.py

You’ll get a new file keywords_clustered.csv, where each keyword is assigned to a cluster.

6. Analyze Clusters

At this point, each keyword has a cluster number, but the numbers themselves don’t tell us much. We need to look inside the clusters to understand what they represent.

Check cluster sizes

Create a new file: analyze_clusters.py

import pandas as pd

df = pd.read_csv("keywords_clustered.csv")

cluster_sizes = df.groupby("cluster").size().sort_values(ascending=False)

print(cluster_sizes.head(10))

Run it:

python analyze_clusters.py

This will show the largest clusters in your dataset. In this case, the biggest are 5, 42, 16:

Each number represents a cluster ID, and the value shows how many keywords belong to that cluster.

Inspect a cluster

Create another file: inspect_cluster.py

Add this script:

import pandas as pd

df = pd.read_csv("keywords_clustered.csv")

cluster_id = 5  # <strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-luminous-vivid-amber-color">change this number</mark></strong>

sample = df[df["cluster"] == cluster_id].sort_values("impressions", ascending=False)

print(sample[["keyword", "impressions"]].head(20))

Run it:

python inspect_cluster.py

What to look for

  • Good clusters have a clear topic
    Example: lipstick, matte lipstick, red lipstick
  • Weak clusters contain mixed or unrelated keywords

This is a good example of a high-quality cluster. All keywords are centered around the same product: makeup brushes.

Even though the queries include different modifiers such as “best”, “cheap”, “professional”, and brand names, the core meaning remains consistent. This shows that clustering is working as expected and grouping keywords by product rather than intent.

You can repeat this process for other clusters by changing the cluster_id and rerunning the script to review additional groups.

7. Improve Results

At this stage, you will likely see that some clusters are very clean, while others are mixed or too broad. This is normal.

Clustering is an iterative process, and small adjustments can significantly improve the results.

Adjust the Number of Clusters

One of the most important parameters is the number of clusters (k).

  • Too few clusters → different products get mixed together
  • Too many clusters → very small or fragmented groups

Try increasing or decreasing k in your clustering.py script:

k = 200

Rerun the clustering and compare the results.

Iterate and Compare

There is no single perfect setup. The goal is to reach clusters that are:

  • consistent in meaning
  • useful for SEO
  • easy to interpret

Make one change at a time, rerun the scripts, and review the output.

8. Auto-Label Clusters with AI

At this point, each cluster contains a group of related keywords, but we still need to understand what each cluster represents.

Manually naming clusters does not scale well, especially when you have dozens or hundreds of them. Instead, we can use AI to generate labels automatically.

Create script label_clusters.py

Create a new file: label_clusters.py

import pandas as pd
from openai import OpenAI

client = OpenAI(api_key="<strong><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-luminous-vivid-amber-color">YOUR_API_KEY</mark></strong>")

df = pd.read_csv("keywords_clustered.csv")

results = []

cluster_ids = df["cluster"].unique()

for cluster_id in cluster_ids:
    group = (
        df[df["cluster"] == cluster_id]
        .sort_values("impressions", ascending=False)
        .head(15)
    )

    keywords = group["keyword_clean"].dropna().tolist()

    if not keywords:
        continue

    prompt = f"""
    Here are keywords from one cluster:
    {keywords}

    1. Give a short cluster name (2-4 words)
    2. Classify as one of:
       - Category (product group)
       - SEO page (modifier or intent)
       - Misc (unclear / mixed)

    Answer EXACTLY in format:
    Name: ...
    Type: ...
    """

    try:
        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[{"role": "user", "content": prompt}]
        )

        text = response.choices[0].message.content

        name = ""
        type_ = ""

        for line in text.split("\n"):
            line = line.strip()

            if line.lower().startswith("name"):
                name = line.split(":", 1)[-1].strip()

            elif line.lower().startswith("type"):
                type_ = line.split(":", 1)[-1].strip()

        if not name:
            name = text[:60]

        if not type_:
            type_ = "Unknown"

        results.append({
            "cluster": cluster_id,
            "name": name,
            "type": type_
        })

    except Exception as e:
        print(f"Error on cluster {cluster_id}: {e}")
        continue

labels_df = pd.DataFrame(results)
labels_df.to_csv("cluster_labels.csv", index=False)

print("Done! Created cluster_labels.csv")

Run the script

python label_clusters.py

You’ll get a new file cluster_labels.csv with cluster names and types. Here is what the labeled clusters may look like:

clusternametype
150Luxury Makeup BrushesCategory
53Everyday Natural PrimersSEO page
158Setting Spray TypesCategory (product group)
100Makeup KitsCategory

Each cluster now has:

  • a name that describes the topic
  • a type that helps decide how to use it

9. Build Final Dataset

Now we combine everything into one file that is easy to work with by merging keywords with their clusters, along with cluster names and types, into a single dataset.

Create script merge_clusters.py

Create a new file: merge_clusters.py

import pandas as pd

# load clustered keywords
df_keywords = pd.read_csv("keywords_clustered.csv")

# load cluster labels
df_labels = pd.read_csv("cluster_labels.csv")

# merge on cluster column
df_final = df_keywords.merge(df_labels, on="cluster", how="left")

# keep useful columns
df_final = df_final[
    ["keyword", "clicks", "impressions", "cluster", "name", "type"]
]

# sort for convenience
df_final = df_final.sort_values(
    ["cluster", "impressions"], ascending=[True, False]
)

# save result
df_final.to_csv("keywords_final.csv", index=False)

print("Done! Created keywords_final.csv")

Run the script

python merge_clusters.py

You’ll get a new file keywords_final.csv.

Each keyword now has its cluster, cluster name, and cluster type.

keywordclicksimpressionsclusternametype
lipstick for sensitive skin253530Lip Products for Sensitive SkinCategory
lip glosss for sensitive skin452770Lip Products for Sensitive SkinCategory
professional lip glosss for sensitive skin102550Lip Products for Sensitive SkinCategory

This is your final dataset and the main output of the clustering process. At the end of this process, you move from a raw list of keywords to a structured dataset grouped by meaning.

AI makes this process much easier. Previously, keyword clustering relied on matching specific words, which made it difficult to group synonyms or variations. With embeddings, keywords are grouped by meaning automatically, so similar queries are clustered together even if they use different wording.

10. Limitations & Real-World Notes

  • Clustering is not perfect. Some clusters will be mixed and require manual review.
  • Results depend on input quality. Cleaner keywords lead to better clusters.
  • There is no single correct number of clusters. You need to experiment.
  • AI labeling is helpful but not always accurate and may need adjustments.
  • Real datasets are more complex and may include multiple languages, brands, and noise.
  • Clustering works best when combined with human judgment.
Alla Vovnenko on April 11, 2026 On-site Search SEO
previous article

About ME

Alla Vovnenko

eCommerce Mechanic

  • Let’s connect on LinkedIn

This blog is based on my experience building and working on eCommerce websites, from scratch to revenue-generating stores. I work across everything eCommerce involves, but especially love analytics, SEO, and conversion optimization, and have a soft spot for on-site search.

categories

  • Analytics
  • CRO
  • eCommerce Operations
  • On-site Search
  • SEO
  • UX

LATEST POSTS

  • An Easy Guide to Keyword Clustering with AI & Python
  • LTV report in GA4 for eCommerce
  • Can personalization hurt SEO?
  • Turning First-Time Buyers into Loyal Customers
  • SEO Checklist for eCommerce Platform Migration (No Domain Change)
Read next
Can personalization hurt SEO? 2 Min
Can personalization hurt SEO?
Alla Vovnenko on March 29, 2026
Personalization sounds like a clear win. Show each visitor what is most relevant, adapt content to behavior, and...
SEO Checklist for eCommerce Platform Migration (No Domain Change) 6 Min
SEO Checklist for eCommerce Platform Migration (No Domain Change)
Alla Vovnenko on March 1, 2026
Website migration is a high-risk SEO and revenue operation, especially for eCommerce stores where organic traffic...

Subscribe to my Newsletter

eCommerce Lever

Practical notes on how eCommerce works day to day, from traffic and SEO to conversion, on-site search, and analytics. A mix of observations, experiments, and small things that tend to matter more than they seem. 

Linkedin

categories

  • Analytics
  • SEO
  • CRO

categories

  • UX
  • eCommerce Operations
  • On-site Search
© 2026 — eCommerceLever. All Rights Reserved.
Back to top