VectorDB

Since version 3.00.1, DolphinDB introduces VectorDB, a vector retrieval database system built on the TSDB storage engine. It applies rapid approximate nearest neighbor search (ANNS) to array vector columns. Key features of DolphinDB VectorDB include:

  • Persistent storage of vector indexes, allowing pre-built indexes to be read from disk without rebuilding for each query.

  • Efficient vector similarity search for massive vector data using rapid approximate nearest neighbor searches.

  • Hybrid searches combining keyword-based search algorithms with vector search techniques for comprehensive and precise results.

Implementation

This section introduces the implementation of vector storage, indexing and search in VectorDB.

Vector Storage

In DolphinDB, vector data can be stored as array vectors, a data form natively supported by the TSDB storage engine. This allows for creating a DFS table with vector indexing using TSDB as its underlying architecture for efficient vector data management.

Vector Indexing

Vector indexing is constructed based on Facebook's open-source library, Faiss. When creating a DFS partitioned table, if an index is specified for an array vector column, a vector index is constructed based on the specified index type and stored in each level file. Index types supported in VectorDB and their trade-offs will be explained in the following section.

Vector Search

VectorDB uses Euclidean distance (L2 distance) to measure similarity between high-dimensional vector data. L2 distance is widely used and performs well in low-dimensional spaces, though its effectiveness can deteriorate in high-dimensional spaces due to the curse of dimensionality.

For a vector search query written in the following format, VectorDB selects an appropriate search method to retrieve the K nearest vectors to the query vector.

SELECT {col1,...,colx} 
FROM table [where] 
ORDER BY rowEuclidean(<vectorCol>, queryVector)
LIMIT <TOPK>

Basic Vector Search

If no where clause is specified, the basic search process is as follows:

  1. Iterate through all level files (N in total) on disk and read out vector indexes, obtaining N * K preliminary query results.
  2. Perform brute-force search on data (if any) in the cache engine, yielding an additional K results.
  3. Merge sort the (N + 1) * K query results based on their L2 distance from the query vector in ascending order, and return the K closest results.

Hybrid Search

If where conditions are specified, VectorDB uses hybrid searches combining keyword-based search algorithms with vector search techniques to improve the accuracy and relevance of search results. The process for hybrid search is as follows:

  1. Filter data based on the where conditions.
  2. Divide the selected data by level file.
  3. For each level file, search and obtain query results.
  4. Merge sort these query results and return the K closest results.

Supported Index Types

Vector index is a data structure for efficient vectors retrieval, featuring more time and space-efficient vector similarity query capabilities. Through vector indexes, users can quickly query several vectors most similar to a target vector. Currently, VectorDB supports the following index types:

  1. Flat: Exact nearest neighbor retrieval through brute-force search. Ideal for small datasets or when highest accuracy is required.
  2. PQ (Product Quantization): A lossy algorithm that splits vectors into sub-vectors and applies quantization for memory efficiency. Suitable for retrieving large datasets such as large databases and video libraries.
  3. IVF (Inverted File): Uses k-means clustering to divide vector data into clusters and builds an inverted index for each cluster. Common in image retrieval and text retrieval applications.
  4. IVFPQ (Inverted File Product Quantization): Combines IVF and PQ techniques. Balanced approach for large-scale datasets requiring speed, accuracy, and memory efficiency, such as large-scale recommendation systems or user matching in social networks.
  5. HNSW (Hierarchical Navigable Small World): Builds a multi-level navigable graph structure for efficient search. Ideal for very large datasets with high requirements for speed, accuracy, and dynamic updates, such as real-time recommendation systems, online search, and Retrieval-Augmented Generation (RAG) applications.
Index Index Size Retrieval Speed Accuracy
Flat 100 - 10,000 Slow High
PQ 100,000 - 10M+ Medium Low
IVF 10,000 - 1M+ Fast Medium
IVFPQ 1M - 10M+ Very Fast Low-Medium
HNSW 100M - 1B+ Fastest Medium

Considerations When Using VectorDB

To create a DFS partitioned table in VectorDB, you can use either the createPartitionedTable function or the SQL CREATE statement, and specify the vector column, index type, and vector dimension for the indexes parameter. For detailed function usage, see the function page. Note the following considerations when using VectorDB:

(1) Creating Databases

VectorDB databases can only be created with the TSDB engine, i.e., must set engine="TSDB" when using the database function to create a database.

(2) Creating Tables

  • Vector indexing is supported only for DFS partitioned tables with keepDuplicates = ALL (i.e., without record deduplication).
  • The table can contain only one vector column of FLOAT[] type. Each row in this column represents a vector, and all these vectors must have the same dimension. The dimension (i.e., the number of elements in each vector) is specified by dim in the indexes parameter.
  • The dimension of vectors inserted into the vector column must be consistent with dim.

(3) Searching Vectors

VectorDB will only use the vector indexing to accelerate the retrieval if the following conditions are met, otherwise it will only load data from all level files and perform an exhaustive search.

  • No table joins are used in queries.
  • The order by clause must sort in ascending order and can only use rowEuclidean to compute distances.
  • The first parameter passed to rowEuclidean must be the vector column, i.e., rowEuclidean(<vectorCol>, queryVector).
  • A limit clause must be specified.
  • If a where clause is specified, it must not include any sort columns.
  • The query cannot include clauses such as group by and having.

Note: Vectors in the cache engine do not have indexes and cannot be retrieved using vector indexing. To accelerate retrieving these vectors, you can use flushTSDBCache to write the data from the cache to disk before querying.

Usage Example

// Create a database
db = database(directory="dfs://indexesTest", partitionType=VALUE, partitionScheme=1..10, engine="TSDB")

// Create a table schema with col3 of FLOAT[] type
schematb = table(1:0, `col0`col1`col2`col3, [INT, INT, TIMESTAMP, FLOAT[]])

// Specify a vector index for 'col3'
pt = createPartitionedTable(dbHandle=db, table=schematb, tableName=`pt, partitionColumns=`col0, sortColumns=`col1`col2, indexes={"col3":"vectorindex(type=flat, dim=5)"})

tmp = cj(table(1..10 as col0), cj(table(1..10 as col1), table(now()+1..10 as col2))) join table(arrayVector(1..1000*5,1..5000) as col3)

// Insert data into the table and force flush the data to disk
pt.tableInsert(tmp)
flushTSDBCache()

select * from pt where col2<now() order by rowEuclidean(col3, [1339, 252, 105, 105, 829]) limit 10

Application

Vector retrieval plays a crucial role in RAG (Retrieval-Augmented Generation) systems. By leveraging VectorDB's efficient retrieval capabilities, RAG systems can quickly find relevant information from large knowledge bases, improving the quality and accuracy of generated content. Additionally, VectorDB can enhance recommendation systems, enable efficient image and video search, and improve AI generation models' accuracy and diversity.