←back to thread

235 points GeorgeCurtis | 3 comments | | HN request time: 0.205s | source

Hey HN, we want to share HelixDB (https://github.com/HelixDB/helix-db/), a project a college friend and I are working on. It’s a new database that natively intertwines graph and vector types, without sacrificing performance. It’s written in Rust and our initial focus is on supporting RAG. Here’s a video runthrough: https://screen.studio/share/szgQu3yq.

Why a hybrid? Vector databases are useful for similarity queries, while graph databases are useful for relationship queries. Each stores data in a way that’s best for its main type of query (e.g. key-value stores vs. node-and-edge tables). However, many AI-driven applications need both similarity and relationship queries. For example, you might use vector-based semantic search to retrieve relevant legal documents, and then use graph traversal to identify relationships between cases.

Developers of such apps have the quandary of needing to build on top of two different databases—a vector one and a graph one—plus you have to link them together and sync the data. Even then, your two databases aren't designed to work together—for example, there’s no native way to perform joins or queries that span both systems. You’ll need to handle that logic at the application level.

Helix started when we realized that there are ways to integrate vector and graph data that are both fast and suitable for AI applications, especially RAG-based ones. See this cool research paper: https://arxiv.org/html/2408.04948v1. After reading that and some other papers on graph and hybrid RAG, we decided to build a hybrid DB. Our aim was to make something better to use from a developer standpoint, while also making it fast as hell.

After a few months of working on this as a side project, our benchmarking shows that we are on par with Pinecone and Qdrant for vectors, and our graph is up to three orders of magnitude faster than Neo4j.

Problems where a hybrid approach works particularly well include:

- Indexing codebases: you can vectorize code-snippets within a function (connected by edges) based on context and then create an AST (in a graph) from function calls, imports, dependencies, etc. Agents can look up code by similarity or keyword and then traverse the AST to get only the relevant code, which reduces hallucinations and prevents the LLM from guessing object shapes or variable/function names.

- Molecule discovery: Model biological interactions (e.g., proteins → genes → diseases) using graph types and then embed molecule structures to find similar compounds or case studies.

- Enterprise knowledge management: you can represent organisational structure, projects, and people (e.g., employee → team → project) in graph form, then index internal documents, emails, or notes as vectors for semantic search and link them directly employees/teams/projects in the graph.

I naively assumed when learning about databases for the first time that queries would be compiled and executed like functions in traditional programming. Turns out I was wrong, but this creates unnecessary latency by sending extra data (the whole written query), compiling it at run time, and then executing it. With Helix, you write the queries in our query language (HelixQL), which is then transpiled into Rust code and built directly into the database server, where you can call a generated API endpoint.

Many people have a thing against “yet another query language” (doubtless for good reason!) but we went ahead and did it anyway, because we think it makes working with our database so much easier that it’s worth a bit of a learning curve. HelixQL takes from other query languages such as Gremlin, Cypher and SQL with some extra ideas added in. It is declarative while the traversals themselves are functional. This allows complete control over the traversal flow while also having a cleaner syntax. HelixQL returns JSON to make things easy for clients. Also, it uses a schema, so the queries are type-checked.

We took a crude approach to building the original graph engine as a way to get an MVP out, so we are now working on improving the graph engine by making traversals massively parallel and pipelined. This means data is only ever decoded from disk when it is needed, and parts of reads are all processed in parallel.

If you’d like to try it out in a simple RAG demo, you can follow this guide and run our Jupyter notebook: https://github.com/HelixDB/helix-db/tree/main/examples/rag_d...

Many thanks! Comments and feedback welcome!

Show context
basonjourne ◴[] No.43977294[source]
why not surrealdb?
replies(1): >>43978140 #
GeorgeCurtis ◴[] No.43978140[source]
General consensus is it's really slow, I like the concept of surreal though. Our first, and extremely bare bones, version of the graph db was 1-2 orders of magnitude faster than surreal (we haven't run benchmarks against surreal recently, but I'll put them here when we're done)
replies(2): >>43984050 #>>43990695 #
1. datastorydesign ◴[] No.43984050[source]
Hey George, Alexander from SurrealDB here.

Congratulations on the launch! This is a very exciting space, and it's great to see your take on it.

Running fair benchmarks, not benchmarketing, is a significant effort and we recently put in this effort to make things as fair and transparent as possible across a range of databases.

You can see the results and links to our code in the write-up here: https://surrealdb.com/blog/beginning-our-benchmarking-journe...

We'd be very interested in seeing the benchmarks you'd run and how we compare :)

You can sacrifice many things for faster performance, such as security, consistency levels or referential integrity.

I'm genuinely curious to learn what design decisions you will make as you continue building the database. There are so many options, each with its pros and cons.

If you would like to have a chat where we can exchange ideas, happy to do that :)

replies(2): >>43990931 #>>44005240 #
2. GeorgeCurtis ◴[] No.43990931[source]
I've only just seen this! Thanks so much for the response, would definitely love to chat with you guys.

You're definitely right btw, those weren't concrete benchmarks and I'm excited to see how we compare now :)

3. riku_iki ◴[] No.44005240[source]
> You can see the results and links to our code in the write-up here: https://surrealdb.com/blog/beginning-our-benchmarking-journe...

page says your benchmark runs on 5M of records only. Is it incredibly small dataset in current world, and is it more micro-benchmarking?

Also, count(*) query having 5s latency on 5m records is very underwhelming if I understand your tables correctly.