On-Site Search Quality in eCommerce

7 Min Read

Table of Contents

Two Levels of Search Evaluation
The Three Criteria of Good On-Site Search
2.1 Completeness
2.2 Ranking
2.3 Relevance
How to Evaluate Search Quality in Practice
What “Good Enough” Looks Like

On-site search plays very different roles depending on the type of eCommerce business.

In smaller stores and in categories like apparel, search is often used sparingly. Many users prefer browsing through categories, filters, and collections. In some cases, search usage can stay in the low double digits or even below 10% of sessions.

But as catalogs grow, behavior changes. In large eCommerce sites with thousands or millions of products, search becomes a primary navigation method. It’s common to see 30–60% of users relying on search, especially when they know what they are looking for or when navigation becomes too complex.

This shift matters because expectations change with it. Search is not just another feature; it’s a direct expression of intent. And because of that, even small issues in the quality become immediately visible.

When users browse, they explore.
When users search, they expect precision.

So, how to measure the quality of internal search results?

Two Levels of Search Evaluation

Evaluating on-site search works best when separated into two distinct stages.

First stage

The first stage focuses on the search system itself. The goal is to understand whether the results are correct and whether they fully reflect the product catalog. This goes beyond simple keyword matching. It includes how well search captures intent — for example, whether “running shoes” returns appropriate products even if naming varies across the catalog.

At this stage, evaluation is manual and relies on a strong understanding of the assortment. The question is straightforward: does search return what it should?

Second stage

Once this baseline is established, the second stage looks at how users interact with those results. This is where behavioral data from analytics becomes useful. Metrics help identify patterns, edge cases, and opportunities for refinement.

To make the distinction clearer:

Aspect	Level 1: Search Quality	Level 2: Behavioral Performance
Goal	Validate correctness of results	Improve performance based on user behavior
Focus	Completeness, sorting, relevance	CTR, conversions, zero-result rate
Input	Product catalog knowledge, real queries	Analytics data, user interactions
Nature	Manual, qualitative evaluation	Quantitative, data-driven analysis
Key question	Does search return the right products?	How do users interact with results?
When to use	First, as a foundation	After search quality is reliable

This separation helps keep the evaluation grounded. Search first needs to work as a system before it can be optimized as a performance channel.

The Three Criteria of Good On-Site Search

These three criteria are not evaluated in parallel. They follow a specific order.

The foundation is completeness. First, it is necessary to ensure that all relevant products are present in the results.

Once that baseline is reached, the focus shifts to ranking, or how those products are ordered.

Only after both are reliable does it make sense to look at relevance more strictly, refining which products should or should not appear at all.

Each step builds on the previous one, so the sequence matters.

Completeness

Completeness is met when all products relevant to a search query are present in the results.

This is the foundation of search quality. If relevant items are missing, the system cannot be considered reliable.

Evaluating completeness requires a strong understanding of the product catalog. It is necessary to know what should appear for a given query, including variations in naming, attributes, and categorization.

In practice, this means testing multiple queries and comparing the results against the actual assortment. The goal is to confirm that search consistently surfaces the full set of relevant products, not just a portion of them.

Different types of queries should be used during testing, for example:

product names;
SKUs;
product groups;
categories and subcategories;
product types;
combinations like product group plus attribute;
models;
brands.

Testing across these variations helps ensure that search works reliably for different ways users express intent, not just for a narrow set of queries.

Ranking

Ranking defines the order in which products appear in search results.

Once completeness is achieved, the focus shifts to how those results are organized. All relevant products may be present, but if the most relevant ones are buried lower in the list, the overall quality of search is still poor.

The goal of ranking is to ensure that products that best match the query appear at the top. Less relevant items can still be present, but they should not compete with stronger matches.

For example, for the query “black running shoes”:

black running shoes should appear at the top
running shoes in other colors may appear lower

All of these products can be relevant to some extent, but their position should reflect how closely they match the query.

Evaluating ranking follows the same approach as completeness. It requires testing queries and comparing the order of results against expectations based on product knowledge. The key question is not whether the right products exist in the list, but whether they appear in the right positions.

Well-functioning ranking makes search feel accurate immediately, without requiring users to scan or refine results.

Relevance

Relevance defines which products should not appear in the results.

After completeness and ranking are in place, the focus shifts to refining the result set by removing items that do not match the user’s intent. Even if a product shares keywords with the query, it does not necessarily mean it belongs in the results.

For example, for the query “charcoal”:

products like charcoal bags or briquettes are relevant
charcoal grills may appear due to keyword match, but they do not match the intent and should be excluded

This distinction is important because keyword matching alone often introduces noise. Without filtering irrelevant results, the search may technically return matching items but still feel inaccurate.

Evaluating relevance requires understanding what the user is actually looking for behind the query, not just how the query maps to product data. The goal is to keep the result set focused, so every product shown is a valid answer to the search.

How to Evaluate Search Quality in Practice

The evaluation process starts with a fixed set of queries.

This list should cover different query types and reflect how users search across the catalog. It becomes a baseline for testing and should be reused consistently. Every time search settings are adjusted, the same queries are used to review how results change.

The next step is iterative testing.

Search engines usually provide multiple ways to tune results. The most impactful controls are typically related to fields and their values, such as product name, attributes, categories, or other structured data. Adjusting how these fields are weighted or interpreted directly affects completeness, ranking, and relevance.

The process is straightforward:

define a set of queries;
review results;
adjust search settings;
review results again.

This cycle is repeated until results consistently meet the three criteria.

The goal is not to rely on a single change, but to gradually shape the search system through controlled iterations, using the same queries as a reference point.

What “Good Enough” Looks Like

Search is not something that can be fully completed and left unchanged.

Product catalogs evolve, new items are added, attributes change, and the way users search also shifts over time. In large catalogs, there can be thousands or even millions of unique queries. Because of that, search tuning is an ongoing process with constant opportunities for improvement.

At the same time, a practical baseline is needed.

For large catalogs, a useful rule is to consider search “good” at the first level when all testing queries consistently meet the three criteria: completeness, ranking, and relevance. If the predefined set of queries produces correct results, the system can be considered stable enough to move forward.

There are also edge cases that highlight limitations of search as a tool.

For example, a broad query like “shoes” may correctly return all relevant products. If the catalog contains 300 shoes, search returning all 300 is technically correct, but not useful. Users are unlikely to explore such a large result set.

In situations like this, search alone is not the right solution. Broad queries are better handled as structured landing pages with filters and navigation options.

This is why working with on-site search goes beyond tuning results. It requires continuous adjustments and, in some cases, rethinking how certain queries should be handled altogether.

Alla Vovnenko on April 15, 2026 On-site Search