Why a hybrid? Vector databases are useful for similarity queries, while graph databases are useful for relationship queries. Each stores data in a way that’s best for its main type of query (e.g. key-value stores vs. node-and-edge tables). However, many AI-driven applications need both similarity and relationship queries. For example, you might use vector-based semantic search to retrieve relevant legal documents, and then use graph traversal to identify relationships between cases.
Developers of such apps have the quandary of needing to build on top of two different databases—a vector one and a graph one—plus you have to link them together and sync the data. Even then, your two databases aren't designed to work together—for example, there’s no native way to perform joins or queries that span both systems. You’ll need to handle that logic at the application level.
Helix started when we realized that there are ways to integrate vector and graph data that are both fast and suitable for AI applications, especially RAG-based ones. See this cool research paper: https://arxiv.org/html/2408.04948v1. After reading that and some other papers on graph and hybrid RAG, we decided to build a hybrid DB. Our aim was to make something better to use from a developer standpoint, while also making it fast as hell.
After a few months of working on this as a side project, our benchmarking shows that we are on par with Pinecone and Qdrant for vectors, and our graph is up to three orders of magnitude faster than Neo4j.
Problems where a hybrid approach works particularly well include:
- Indexing codebases: you can vectorize code-snippets within a function (connected by edges) based on context and then create an AST (in a graph) from function calls, imports, dependencies, etc. Agents can look up code by similarity or keyword and then traverse the AST to get only the relevant code, which reduces hallucinations and prevents the LLM from guessing object shapes or variable/function names.
- Molecule discovery: Model biological interactions (e.g., proteins → genes → diseases) using graph types and then embed molecule structures to find similar compounds or case studies.
- Enterprise knowledge management: you can represent organisational structure, projects, and people (e.g., employee → team → project) in graph form, then index internal documents, emails, or notes as vectors for semantic search and link them directly employees/teams/projects in the graph.
I naively assumed when learning about databases for the first time that queries would be compiled and executed like functions in traditional programming. Turns out I was wrong, but this creates unnecessary latency by sending extra data (the whole written query), compiling it at run time, and then executing it. With Helix, you write the queries in our query language (HelixQL), which is then transpiled into Rust code and built directly into the database server, where you can call a generated API endpoint.
Many people have a thing against “yet another query language” (doubtless for good reason!) but we went ahead and did it anyway, because we think it makes working with our database so much easier that it’s worth a bit of a learning curve. HelixQL takes from other query languages such as Gremlin, Cypher and SQL with some extra ideas added in. It is declarative while the traversals themselves are functional. This allows complete control over the traversal flow while also having a cleaner syntax. HelixQL returns JSON to make things easy for clients. Also, it uses a schema, so the queries are type-checked.
We took a crude approach to building the original graph engine as a way to get an MVP out, so we are now working on improving the graph engine by making traversals massively parallel and pipelined. This means data is only ever decoded from disk when it is needed, and parts of reads are all processed in parallel.
If you’d like to try it out in a simple RAG demo, you can follow this guide and run our Jupyter notebook: https://github.com/HelixDB/helix-db/tree/main/examples/rag_d...
Many thanks! Comments and feedback welcome!
I.e: You have to re-index all of the vectors when you make an update to them.
How does the graph component of your database perform compared to Kuzu? Do you have any benchmarks.
For RAG I've tried Qdrant, Meilisearch, and Kuzu. At the moment I wouldn't consider HelixDB because of HelixQL. Wondering why you didn't use OpenCypher?
At the moment you have this system which is aimed to support AI/LLM systems but by creating HelixQL you do not have an AI coding friendly query language.
With OpenCypher even older cheap models can generate queries. Or maybe some GraphQL layer.
We're currently working on benchmarks so nothing exact on Kuzu right now with regards to performance. We've had quite a few requests for benchmark comparisons against different databases, so they should take a good few days. Will return here when they are ready
When we've used Cypher in the past we didn't get on with the methodology of the language that well. A functional approach, like gremlin, suited our minds better. But, Gremlin's syntax is awful (in our opinion), and the amount of boilerplate code you need we felt was unnecessary.
We wanted to make something that was easier to read than Gremlin, like Cypher, but also have functional aspect that just made traversals feel so much more intuitive.
Another note, we're more fond of type-safe languages, and it didn't make much sense to us that out of all the programming languages that exist, query languages were the non-type-safe ones.
We know it's a pain learning a new language, but we really believe that our approach will pave the way for a better development experience and a better paradigm.
Onto the AI stuff, you're right, it isn't ideal (right now). We did make a gpt wrapper that did a pretty good job of writing queries based on a condensed version of our docs, but this isn't ideal. So, the next thing on our road map is a graph traversal MCP tool. Instead of the agent having to generate text written queries, it can use the traversal tools and determine where it should hop to at each step.
We know we're being quite ambitious here, but we think there's a lot we can improve on over existing solutions.
Thanks again :)