What is pgvector?

pgvector is an open-source extension for PostgreSQL that provides efficient storage, retrieval, and similarity search of vector data. It is particularly useful for applications involving semantic search, recommendation systems, and natural language processing, which are critical for AI-driven applications, including Large Language Models (LLMs).

CockroachDB introduced vector search capabilities in the 24.2 release. This implementation uses the same interface as pgvector and aims to be compatible with pgvector’s API. This compatibility facilitates seamless integration with tools like Langchain and Hugging Face, making it easier to incorporate data into AI models.

The integration of vector search in CockroachDB combines the strengths of a vector database and an operational database into a single, horizontally scalable solution. This approach simplifies the architecture and eliminates the operational and financial costs associated with maintaining a dedicated vector database. Additionally, CockroachDB’s distributed SQL engine allows for complex SQL operations on vector data, enhancing the performance and efficiency of AI and machine learning workloads.

pgvector adds support for vector data types and vector similarity search to the PostgreSQL database. It is particularly useful for applications involving machine learning, natural language processing, and other domains where vector representations of data (such as embeddings) are commonly used.

Key features of pgvector include:

  • Vector Data Type: It introduces a new data type for storing vectors (arrays of floating-point numbers) directly in PostgreSQL tables.

  • Similarity Search: It provides functions for performing similarity searches on vector data, such as finding the nearest neighbors based on various distance metrics (e.g., Euclidean distance, cosine similarity).

  • Indexing: It supports indexing mechanisms to speed up similarity searches, making it efficient to query large datasets.

What is vector search?

Vector search refers to the process of finding vectors in a dataset that are similar to a given query vector. Vector search is a powerful tool for handling high-dimensional data and is widely used in modern AI and data-driven applications to provide relevant and accurate results based on the inherent similarities in the data. 

Common applications of vector search

Vector search is essential in applications where data is represented as high-dimensional vectors, such as:

  • Machine Learning: Finding similar data points in embedding spaces, such as word embeddings, image embeddings, or user embeddings.

  • Natural Language Processing (NLP): Searching for semantically similar text documents, sentences, or words based on their vector representations.

  • Recommendation Systems: Identifying items (e.g., products, movies) that are similar to a user's preferences, represented as vectors.

  • Computer Vision: Finding images that are visually similar to a query image based on their feature vectors.

Check out this video showcasing how to use CockroachDB to build a product recommendation engine for ecommerce with AI:

How does vector search work?

Vector search typically involves the following steps:

  • Vector Representation: Data is converted into vector representations using techniques like word embeddings (e.g., Word2Vec, GloVe), sentence embeddings (e.g., BERT), or image embeddings (e.g., convolutional neural networks).

  • Similarity Metric: A similarity metric (e.g., Euclidean distance, cosine similarity) is chosen to measure the closeness between vectors.

  • Indexing: Efficient indexing structures (e.g., KD-trees, ball trees, HNSW) are used to speed up the search process, especially for large datasets.

  • Querying: The query vector is compared against the indexed vectors to find the most similar ones based on the chosen similarity metric.

Why is pgvector commonly used?

pgvector has become one of the most popular extensions for working with vectors in part because it allows easy integration with PostgreSQL. Users can thus leverage the robustness, scalability, and familiarity of PostgreSQL, making it easier for developers to integrate vector search capabilities into their existing databases without needing a separate specialized system. PostgreSQL has a large and active community, and pgvector benefits from this ecosystem. Users can leverage existing PostgreSQL tools, extensions, and best practices while working with vector data.

By extending PostgreSQL, pgvector allows users to work with vector data using standard SQL queries. This reduces the learning curve and simplifies the development process. pgvector also supports efficient indexing and querying, and can handle large-scale vector data and perform similarity searches quickly, which is crucial for real-time applications. pgvector also allows users to choose the best approach for their specific use case with various distance metrics and indexing methods.

pgvector-compatible vector search in CockroachDB

CockroachDB is deeply invested in providing robust database offerings so that organizations can run their business-critical applications at scale. To that end, the Cockroach Labs team continues to focus on expanding generative AI capabilities, leading to the release of vector search features in v24.2. 

Our vector search features make it easier for users to deploy AI-driven applications with CockroachDB. CockroachDB's vector search functionality is compatible with the pgvector extension for PostgreSQL, allowing users to store and manipulate vectors within the database. 

Although CockroachDB does not currently support vector indexing, it allows for the storage of billions of vectors and supports various vector comparison operators, such as Euclidean distance, inner product, and cosine distance, to facilitate similarity searches. This capability enables developers to build fault-tolerant AI applications that leverage vectors stored in CockroachDB, benefiting from its horizontal data distribution and multi-region abstractions.

pgvector and Vector Search FAQ

What is pgvector?

pgvector is an open-source PostgreSQL extension for efficient storage, retrieval, and similarity search of vector data. It's useful for semantic search, recommendation systems, and natural language processing, critical for AI applications like Large Language Models (LLMs).

What are the key features of pgvector?

The key features of pgvector include storing vectors in PostgreSQL tables, similarity search, and vector indexing.

What is vector search?

Vector search finds vectors in a dataset similar to a query vector. It's used in AI and data-driven applications to provide relevant results based on data similarities.

What are common applications of vector search?

The common applications of vector search include natural language processing to search for semantically similar text in documents, sentences, or words, and identifying similar items, such as products, movies, books, documents, or videos in recommendation systems.

How does vector search work?

Generally vector search works by first converting the data into vectors using embeddings. Then measuring the closeness between vectors via a similarity metric. Additionally, one can use indexing to speed up searches, and then can search for similar vectors by comparing a query vector against the indexed vectors.

Why is pgvector commonly used?

pgvector is popular because it's open-source, integrates with PostgreSQL, and supports efficient indexing and querying. It handles large-scale vector data and performs quick similarity searches, crucial for real-time applications.

What is vector search in CockroachDB?

Starting in v24.2, vector search entered preview in CockroachDB. Vector search in CockroachDB finds data points similar to a query using vectors. Compatible with pgvector, it stores and manipulates vectors, useful for AI applications like LLMs, and supports various vector comparison operators for similarity searches.

Hear it from DoorDash
Watch the full talk from Alessandro Salvatori, Principal Engineer at DoorDash, on DoorDash's journey from Aurora Postgres to CockroachDB.