Weaviate Open Source Database for Vector Search

Weaviate is one of the names around vector search engines development. But, what does vector search mean and why is it relevant?

In recent years, the areas of search engines, Natural Language Programming (NLP) and Machine Learning (ML) have been going through several discoveries and breakthroughs. This includes new tools and paradigms to offer better results. In this matter, when we’re thinking about search engines, perhaps the first thing that comes to mind are search engines and indexers on the Internet, like Google or Bing. Nevertheless, “vector search marks a major shift from this traditional method of information retrieval to a future in which all of the complex data that makes up modern content assets can be put to work”, as it’s explained in this article in the Search Engine Journal website.

But, let’s break these concepts down to understand this area better.

What are vector embeddings?

In a few words, a vector search engine works with vector embeddings instead of keywords. This means that, different to “traditional search” algorithms that check a list of rows until finding one that fits the criteria into the database, vector search uses ML to capture the context and meaning of unstructured data. This way, ML algorithms can analyze and read different formats (text, images, audio and data) as numeric data.

The concept of “vector similarity” is also fundamental here. It compares the likeness of objects captured by machine learning models and identifies similarities as the closest values. It enables semantic searches, image searches and better recommendations.

What are ANN libraries?

This paper defines an Approximate Nearest Neighbor (ANN) as “a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions.” ANN libraries have been a resource for search engines. However, this method has some limitations, including lack of real-time capabilities and mutability. Additionally, it’s key to consider the amount of data companies need to search to complete these tasks.

“You might have heard of Spotify’s Annoy, Facebook’s faiss, or Google’s ScaNN. What they all have in common is that they make a conscious trade-off between accuracy (precision, recall) and retrieval speed (…) However, there is yet another trade-off engineers have to make when using ANN models. Many of the common ANN libraries fall short of some of the features we are used to when working with traditional databases and search engines”, explains Etienne Dilocker, Co-Founder & CTO of SeMi Technologies.

Here is where vector search comes into play, enabling faster execution and generating more relevant results.

Weaviate Vector Search Engine

According to its creators, “one of the most recurring challenges presented itself in naming and searching. How would we call certain objects and how could we find data that was structured in different ways?”. These first questions led to Weaeviate development, going from a more traditional approach “where the semantic (NLP) element was a feature rather than the core architecture” to a more refined tool with semantic search element and semantic classification as core features. In this context, “one of the goals in designing Weaviate was to combine the speed and large-scale capabilities of ANN models with all the functionality we enjoy about databases.”

Now, Weaviate is officially presented as “an open-source search engine with a built-in NLP model. (…) What makes Weaviate unique is that it stores data in a vector space rather than a traditional row-column or graph structure, allowing you to search through data based on its meaning rather than keywords alone.”

Examples of implementations include:

Classification of invoices to categories.
Searching through documents for specific concepts rather than keywords.
Site search.
Product knowledge graphs.

How does Weaviate work?

Etienne Dilocker also signalizes that “any object imported into Weaviate can immediately be queried — whether through a lookup by its id, a keyword-search using the inverted index, or a vector search. This makes Weaviate a real-time vector search engine. And because Weaviate also uses an ANN model under the hood, the vector search is going to be just as fast as with a library”.

A diagram about Weaviate structure.Source.

This review from Analytics Vidhya listed some considerations and details about Weaviate:

Weaviate is a persistent and fault-tolerant database.
Internally, each class in Weaviate’s user-defined schema results in developing an index. A wrapper type that consists of one or more shards is called an index, and shards are self-contained storage units within an index.
Multiple shards can be utilized to automatically divide load among multiple server nodes, acting as a load balancer. Each shard house consists of three main components: object store, Inverted index and vector index store.
The object store and inverted storage have been developed using an LSM-Tree architecture. This means that data can be consumed at memory speed, and if a threshold is reached, Weaviate will write the full (sorted) memtable to a disc segment.
Weaviate will first examine the Memtable for the most recent update for a specific object when receiving a read request. Weaviate will check all previously written segments, starting with the most recent, if not present in the memtable. Bloom filters prevent examining segments that do not contain the requested objects.
Object/Inverted Storage employs a segmentation-based LSM technique. The Vector Index, on the other hand, is unaffected by segmentation and is independent from those object storages.
Hierarchical Navigable Small-World graph (HNSW) is the first vector index type supported by Weaviate to act as a multilayered graph.

Advantages of using Weaviate

Methods and languages: It can perform through different methods such as GraphQL, REST, and various language clients, such as Python, Javascript, and Go.
Accessibility and scalability: Weaviate is cloud-native and open-source. Additionally, it offers a pricing model inspired by “pay-as-you-grow”.
Modularity: Weaviate can cover a wide variety of bases, and it comes with optional modules for text, images, and other media types. These modules can do the vectorization. It’s possible to combine modules and establish a relation between them. For example: a text object and a corresponding image object.
Agnostic: Weaviate is agnostic by default. “This means teams with experience in data science and machine learning can simply keep using their finely-tuned ML models and import their data objects alongside their existing vector positions”.
Some databases advantages:

CRUD support
Real-time or near Real-Time results
Mutability
Persistence
Consistency and resilience

Getting started

Software engineers are using Weaviate as an ML-first database for their applications. This way, data engineers can use a vector database built up from the ground with ANN at its core, and data scientists can deploy their search applications with MLOps.

There are three ways to run it:

Weaviate Cloud Service
Docker
Kubernetes

To learn more about Weaviate, check this documentation.