Will Amazon S3 Vectors kill vector databases or save them?

(zilliz.com)

276 points Fendy | 3 comments | 08 Sep 25 15:35 UTC | HN request time: 0s | source

Show context

cpursley ◴[08 Sep 25 18:19 UTC] No.45171865[source]▶

Postgres has pgvector. Postgres is where all of my data already lives. It’s all open source and runs anywhere. What am I missing with the specialty vector stores?

replies(1): >>45171919 #

CuriouslyC ◴[08 Sep 25 18:25 UTC] No.45171919[source]▶

>>45171865 #

latency, actual retrieval performance, integrated pipelines that do more than just vector search to produce better results, the list goes on.

Postgres for vector search is fine for toy products or stuff that's outside the hot loop of your business but for high performance applications it's just inadequate.

replies(1): >>45171952 #

cpursley ◴[08 Sep 25 18:27 UTC] No.45171952[source]▶

>>45171919 #

For the vast majority of applications, the trade off is worth keeping everything in Postgres vs operational overhead of some VC hype data store that won’t be around in 5 years. Most people learned this lesson with Mongo (postgrest jsonb is now good enough for 90% of scenarios).

replies(3): >>45171998 #>>45172223 #>>45172941 #

1. cpursley ◴[08 Sep 25 18:31 UTC] No.45171998[source]▶

>>45171952 #

Also, no way retrieval performance is going to match pgvector because you still have to join the external vector with your domain data in the main database at the application level, which is always going to be less performant.

replies(2): >>45172190 #>>45174676 #

2. CuriouslyC ◴[08 Sep 25 18:47 UTC] No.45172190[source]▶

>>45171998 (TP) #

For a large class of applications, the database join is the last step of a very involved pipeline that demands a lot more performance than PGVector can deliver. There are also a large class of applications that don't even interface with the database directly, except to emit logging/traceability artifacts.

3. jitl ◴[08 Sep 25 22:09 UTC] No.45174676[source]▶

>>45171998 (TP) #

i'll take a 100ms turbopuffer vector search plus a 50ms postgres-select-where-id-in over a 500ms all-in-one pgvector + join query.

When you only need to hydrate like 30 search result item IDs from Postgres or memcached i don't see the join being "too expensive" to do in memory.

↑