←back to thread

235 points GeorgeCurtis | 1 comments | | HN request time: 0.205s | source

Hey HN, we want to share HelixDB (https://github.com/HelixDB/helix-db/), a project a college friend and I are working on. It’s a new database that natively intertwines graph and vector types, without sacrificing performance. It’s written in Rust and our initial focus is on supporting RAG. Here’s a video runthrough: https://screen.studio/share/szgQu3yq.

Why a hybrid? Vector databases are useful for similarity queries, while graph databases are useful for relationship queries. Each stores data in a way that’s best for its main type of query (e.g. key-value stores vs. node-and-edge tables). However, many AI-driven applications need both similarity and relationship queries. For example, you might use vector-based semantic search to retrieve relevant legal documents, and then use graph traversal to identify relationships between cases.

Developers of such apps have the quandary of needing to build on top of two different databases—a vector one and a graph one—plus you have to link them together and sync the data. Even then, your two databases aren't designed to work together—for example, there’s no native way to perform joins or queries that span both systems. You’ll need to handle that logic at the application level.

Helix started when we realized that there are ways to integrate vector and graph data that are both fast and suitable for AI applications, especially RAG-based ones. See this cool research paper: https://arxiv.org/html/2408.04948v1. After reading that and some other papers on graph and hybrid RAG, we decided to build a hybrid DB. Our aim was to make something better to use from a developer standpoint, while also making it fast as hell.

After a few months of working on this as a side project, our benchmarking shows that we are on par with Pinecone and Qdrant for vectors, and our graph is up to three orders of magnitude faster than Neo4j.

Problems where a hybrid approach works particularly well include:

- Indexing codebases: you can vectorize code-snippets within a function (connected by edges) based on context and then create an AST (in a graph) from function calls, imports, dependencies, etc. Agents can look up code by similarity or keyword and then traverse the AST to get only the relevant code, which reduces hallucinations and prevents the LLM from guessing object shapes or variable/function names.

- Molecule discovery: Model biological interactions (e.g., proteins → genes → diseases) using graph types and then embed molecule structures to find similar compounds or case studies.

- Enterprise knowledge management: you can represent organisational structure, projects, and people (e.g., employee → team → project) in graph form, then index internal documents, emails, or notes as vectors for semantic search and link them directly employees/teams/projects in the graph.

I naively assumed when learning about databases for the first time that queries would be compiled and executed like functions in traditional programming. Turns out I was wrong, but this creates unnecessary latency by sending extra data (the whole written query), compiling it at run time, and then executing it. With Helix, you write the queries in our query language (HelixQL), which is then transpiled into Rust code and built directly into the database server, where you can call a generated API endpoint.

Many people have a thing against “yet another query language” (doubtless for good reason!) but we went ahead and did it anyway, because we think it makes working with our database so much easier that it’s worth a bit of a learning curve. HelixQL takes from other query languages such as Gremlin, Cypher and SQL with some extra ideas added in. It is declarative while the traversals themselves are functional. This allows complete control over the traversal flow while also having a cleaner syntax. HelixQL returns JSON to make things easy for clients. Also, it uses a schema, so the queries are type-checked.

We took a crude approach to building the original graph engine as a way to get an MVP out, so we are now working on improving the graph engine by making traversals massively parallel and pipelined. This means data is only ever decoded from disk when it is needed, and parts of reads are all processed in parallel.

If you’d like to try it out in a simple RAG demo, you can follow this guide and run our Jupyter notebook: https://github.com/HelixDB/helix-db/tree/main/examples/rag_d...

Many thanks! Comments and feedback welcome!

Show context
srameshc ◴[] No.43979626[source]
I was thinking about intertwining Vector and Graph, because I have one specific usecase that required this combination. But I am not courageos or competent enough to build such a DB. So I am very excited to see this project and I am certainly going to use it. One question is what kind of hardware do you think this would require ? I am asking it because from what I understand Graph database performance is directly proportional to the amount of RAM it has and Vectors also needs persistence and computational resources .
replies(2): >>43980247 #>>43980412 #
UltraSane ◴[] No.43980247[source]
Neo4j supports vector indexes
replies(1): >>43980477 #
GeorgeCurtis ◴[] No.43980477[source]
Neo4j first of all is very slow for vectors, so if performance is something that matters for your user experience they definitely aren't a viable option. This is probably why Neo4j themselves have released guides on how to build that middleman software I mentioned with Qdrant for viable performance.

Furthermore, the vectors is capped at 4k dimensions which although may be enough most of the time, is a problem for some of the users we've spoken to. Also, they don't allow pre filtering which is a problem for a few people we've spoken to including Zep AI. They are on the right track, but there are a lot of holes that we are hoping to fill :)

Edit: AND, it is super memory intensive. People have had problems using extremely small datasets and have had memory overflows.

replies(1): >>43990034 #
mauvo59 ◴[] No.43990034[source]
Hey, want to correct some of your statements here. :-)

Neo4j's vector index uses Lucene's HNSW implementation. So, the performance of vector search is the same as that of Lucene. It's worth noting that performance suffers when configured without sufficient memory, like all HNSW vector indexes.

>> This is probably why Neo4j themselves have released guides on how to build that middleman software I mentioned with Qdrant for viable performance.

No, this is about supporting our customers. Combining graphs and vectors in a single database is the best solution for many users - integration brings convenience, consistency, and performance. But we also recognise that customers might already have invested in a dedicated vector database, need additional vector search features we don't support, or benefit from separating graph and vector resources. Generally, integrating well with the broader data ecosystem helps people succeed.

>> Furthermore, the vectors is capped at 4k dimensions

We occasionally get asked about support for 8k vectors. But so far, whenever I've followed up with users, there doesn't seem to be a case for them. At ~32kb per embedding, they're often not practical in production. Happy to hear about use cases I've missed.

>> Also, they don't allow pre filtering which is a problem for a few people we've spoken to including Zep AI.

We support pre- and post-filtering. We're currently implementing metadata filtering, which may be what you're referring to.

>> AND, it is super memory intensive.

It's no more memory-intensive than other similar implementations. I get that different approaches have different hardware requirements. But in all cases, a misconfigured system will perform poorly.

replies(2): >>43990486 #>>43992766 #
1. ◴[] No.43990486[source]