←back to thread

276 points Fendy | 1 comments | | HN request time: 0.207s | source
Show context
simonw ◴[] No.45170289[source]
This is a good article and seems well balanced despite being written by someone with a product that directly competes with Amazon S3. I particularly appreciated their attempt to reverse-engineer how S3 Vectors work, including this detail:

> Filtering looks to be applied after coarse retrieval. That keeps the index unified and simple, but it struggles with complex conditions. In our tests, when we deleted 50% of data, TopK queries requesting 20 results returned only 15—classic signs of a post-filter pipeline.

Things like this are why I'd much prefer if Amazon provided detailed documentation of how their stuff works, rather than leaving it to the development community to poke around and derive those details independently.

replies(5): >>45171116 #>>45171985 #>>45172432 #>>45177278 #>>45180236 #
speedysurfer ◴[] No.45171116[source]
And what if they change their internal implementation and your code depends on the old architecture? It's good practice to clearly think about what to expose to users of your service.
replies(2): >>45171187 #>>45172596 #
1. libraryofbabel ◴[] No.45172596[source]
If you can truly abstract away an internal detail, then great. But often there are design decisions that you cannot abstract away because they affect e.g. performance in a major way. For example, I don't care whether some AWS service is written in Java or Go or C++. I do care a bit about how its indexing and retrieval works, because I need to know that to plan my query workloads.

I actually think AWS did a reasonably good job of this with DynamoDB. Most of the performance tradeoffs, indexing etc., is pretty clear if you ready enough docs without exposing a ton of unnecessary internals.