Enhancing Image Similarity Search with an AI Embedding & Image Retrieval System

January 09, 2026 | Doan Thai, MediaX

At MediaX, I focus on researching and developing AI systems that are scalable, robust, and highly applicable to real-world scenarios. In this study, I designed and evaluated an AI-powered image similarity search system based on visual embedding models, targeting use cases such as product lookup, image inspection, and intelligent operational support.

System Overview

The system is built on an image embedding + vector search architecture, where each product image is transformed into a high-dimensional feature vector representing its visual characteristics. These vectors are efficiently stored and queried using a vector database, enabling fast retrieval of visually similar images.

I employed the CLIP (Contrastive Language–Image Pretraining) model with a visual encoder architecture to generate image embeddings, combined with ChromaDB for vector storage and similarity search. Product metadata is managed in parallel using MongoDB, ensuring scalability and seamless integration with the existing backend infrastructure.

Experimental Setup

The image dataset was organized into product groups (Nike, Adidas, Converse), which served as reference labels during evaluation. For each original image, multiple augmented versions were generated—including rotations, flips, color adjustments, and lighting variations—to simulate common real-world visual changes.

These augmented images were used as query inputs, while the vector database contained only the original images. The evaluation objective was to measure the system’s ability to retrieve the exact corresponding original image at the top-1 position, thereby assessing the stability and robustness of the embedding representation under visual perturbations.

Evaluation Results

Retrieval Stability (Top-1 Augmented Self-Retrieval Accuracy)

The evaluation was conducted on a subset of 700 original product images, evenly sampled across three brands (Nike, Adidas, Converse). For each original image, five augmented variants were generated using transformations such as horizontal flipping, rotation, color jittering, and illumination changes, resulting in a total of 3,500 augmented query images.

These augmented images were used as query inputs, while the vector database contained only the original images. A retrieval was considered correct if the top-1 result matched the exact corresponding original image, thereby measuring the system’s robustness under visual perturbations.

The system achieved consistently high retrieval accuracy across all product groups:

Nike: ~0.97
Adidas: ~0.97
Converse: ~0.96

These results demonstrate that the embedding model maintains strong visual identity preservation, even when input images undergo common real-world variations such as flipping, color shifts, and lighting changes.

Error Analysis

All failed retrieval cases were logged by storing both the original images and their corresponding augmented versions. Analysis of these errors revealed that failures were primarily caused by:

overly aggressive augmentation that removed key visual features
extremely high visual similarity between different product samples
color noise or contrast distortions that reduced local feature discriminability

Example images with incorrect search results:

Incorrect retrieval example with similar footwear

Incorrect retrieval example with product flat lay

Incorrect retrieval example with reflective sneaker

Discussion and Future Work

The experimental results demonstrate that MediaX’s AI image search system is robust, generalizable, and well-suited for real-world image retrieval tasks. Nevertheless, several directions for further improvement have been identified.

One promising enhancement involves integrating an object detection model such as YOLO to perform automatic object localization and cropping prior to embedding generation. By isolating the main product region and reducing background noise, this approach is expected to improve embedding consistency and retrieval accuracy, particularly in scenarios with cluttered backgrounds or varying shooting conditions.

Additional future improvements include:

refining augmentation strategies to better balance robustness and preservation of critical visual features
evaluating higher-resolution or more expressive embedding models to capture fine-grained details
incorporating additional metadata signals (e.g., brand, category, attributes) for result re-ranking and disambiguation

In subsequent phases, this research will be extended toward robustness benchmarking under diverse visual conditions, fine-grained similarity analysis, and end-to-end performance optimization for production environments, with the goal of delivering a scalable and reliable visual search solution.

This research is part of MediaX’s broader strategy to develop practical, scalable computer vision AI solutions, aimed at building intelligent systems with long-term applicability and impact.

Find out more about our research.