Data has always been one of an organization's most important assets, but the emergence of generative AI has put it front and center as the key element in the development of large language models (LLMs).
Vector databases have really come into their own as a means to manage and secure such data. A vector database indexes, stores and provides access to structured or unstructured data (e.g., text or images) alongside its vector embeddings, which are the data's numerical representation. But what is their relationship with generative AI?
The Role of Vectors
Vectors are created when text and/or images are converted into sets of numerical values in a process known as embedding, explained Brad Porter, CTO of KnowledgeLake.
This transformation translates the data into a format that computers can efficiently process and analyze. Each word or item is represented as a point in a multi-dimensional space where the geometric arrangement encapsulates the relationships and similarities between these items.
Imagine if each type of fruit in a grocery store had a barcode, he said. The closer the barcodes are in a numerical sequence, the more similar the fruits are in certain characteristics like color, shape or taste. So, apples might have barcodes that are close in sequence to pears, but far from bananas. In a similar way, embeddings give a unique numerical barcode to words or images, making it easier for computer systems to understand, compare and work with large amounts of text or other data.
Related Article: Do Data Lakes Live Up to Their Promise?
Storage in Vector Databases
Vector databases, then, are tailored to index, store and manage vector embeddings along with the original data, whether it is structured or unstructured like text or images, Porter continued.
This design facilitates efficient retrieval and similarity searches, especially in production environments dealing with large scales of data. The co-existence of original data and its vector representation in a database is instrumental for quick access and operational efficiency.
However, it is worth mentioning that for smaller use cases, storing vectors in-memory might suffice. This approach is simpler and offers rapid access to vector data, but comes with its own set of considerations.
The lifespan of vectors, the quantity and performance requirements are all factors that come into play, he said. In-memory storage is ephemeral, meaning the data will not persist if the system restarts, and it is constrained by the available memory. On the other hand, vector databases offer a more robust, persistent solution, capable of managing a growing quantity of vectors without compromising performance.
Information Retrieval in Vector Databases
The querying process commences with a prompt that is vectorized, converting the query into a numerical format comparable to the vectors in the database.
Following this, similarity metrics like cosine similarity are employed to find the closest matching vectors in the database to the vectorized prompt. Upon identifying the closest vectors, the original data correlated with these vectors is retrieved and placed into a context window alongside the query.
This compiled data is then channelled to a Large Language Model (LLM) which utilizes the context to formulate a pertinent response to the original query.
“The process is a meticulous orchestration aimed at efficiently navigating through the vast data, harnessing the vector representations to pinpoint relevant information, and leveraging the LLM to interpret and respond to the query,” he said.
“This structured flow underscores the synergy between vector databases, vector embeddings, and large language models in processing and delivering relevant information in response to user queries.”
Related Article: How to Train AI on Your Company's Data
Highly Scalable Databases
Databases are used for storing and retrieving data, said Mayank Jindal, a software development engineer with Amazon. The database internal indexing functionality helps to retrieve data quickly, he continued. Vector databases are a type of database that uses vector embeddings as indexes. These vector embeddings are a specialized representation of data used to represent complex data (like text and images) in numerical form.
Vector databases are built for high scalability, said Jindal, meaning that billions of vectors can be searched quickly and efficiently. And this is where their relationship with generative AI begins.
Vector databases are in high demand because of generative AI and LLMs. Generative AI and LLM models generate vector embeddings for capturing patterns in data, making vector databases an ideal component to fit into the overall ecosystem.
Vector databases have algorithms for fast searching of similar vectors. One such specialized algorithm is the approximate nearest-neighbor search algorithm.
“One of the real-world use cases I can think of is e-commerce. On the customer side, vector databases can be used to recommend products. If we represent product data like images, descriptions, and customer reviews as vectors, then a vector database can easily fetch similar products,” he said.
He added that it can also store a user's past behavior and preferences as vectors, which can be used to provide personalized recommendations and content. On the inventory side, it can also be used to optimize inventory and the supply chain. Product and inventory data can be stored as vectors. A vector database can help identify trends and seasonality, leading to the optimization of inventory and supply chain.
Structured, Unstructured Data
Vector databases are used to support various applications and technologies, particularly in the context of generative AI and LLMs, said IJYI development practice lead Lewis Huxtable. He pointed to several characteristics that could help organizations that are developing LLMs. They include:1. Indexing and Search
These databases create efficient indexes for the vector embeddings, allowing for fast and scalable similarity searches. The similarity search is a key feature of vector databases, enabling users to find data points like a given query vector.
2. Scalability
Vector databases are designed to manage large datasets and are optimized for scalable and high-performance operations, Huxtable said. This is especially important for applications that require real-time or near-real-time retrieval of similar data points, such as recommendation systems or content matching.
3. Applications
Vector databases have numerous applications in fields like information retrieval, recommendation systems, content matching, natural language processing, computer vision and more, he added. They are used in scenarios where finding similar data points or objects is critical, and traditional databases or search engines may not be efficient enough.
Related Article: It Could Be 5 Years Before We See Productivity Gains From Generative AI
Real-Time Production Use
Vector databases have been around for some time in context of geo-spatial data, and a lot of technology enabling fast retrieval via specialized indexes and notions of similarity has been developed, said Susan Davidson, professor at UPenn School of Engineering and Applied Science.
“New companies are springing up, and existing relational database systems are being extended to manage vectors — Postgres has been vector-enabled for some time, but Oracle just announced an integrated vector database,” she said.
So why is this important? The idea is that you can vectorize unstructured data such as text, images, video and audio; store the vectors in a table; and then query the data to quickly find a response.
This opens the ability to support Retrieval Augmented Generation, an AI technology that combines large language models (LLMs) and confidential business data to give answers to natural language queries.