Will Amazon S3 Vectors kill vector databases or save them?

1. resters ◴[08 Sep 25 16:21 UTC] No.45170203[source]▶

By hosting the vectors themselves, AWS can meta-optimize its cloud based on content characteristics. It may seem like not a major optimization, but at AWS scale it is billions of dollars per year. It also makes it easier for AWS to comply with censorship requirements.

replies(3): >>45170388 #>>45170758 #>>45173752 #

2. barbazoo ◴[08 Sep 25 16:34 UTC] No.45170388[source]▶

>>45170203 (TP) #

> It also makes it easier for AWS to comply with censorship requirements.

Does it, how? Why would it be the vector store that would make it easier for them to censor the content? Why not censor the documents in S3 directly, or the entries in the relational database. What is different about censoring those vs a vector store?

replies(1): >>45170514 #

3. resters ◴[08 Sep 25 16:43 UTC] No.45170514[source]▶

>>45170388 #

Once a vector has been generated (and someone has paid for it) it can be searched for and relevant content can be identified without AWS incurring any additional cost to create its own separate censorship-oriented index, etc. AWS can also add additional bits to the vector that benefit its internal goals (scalability, censorship, etc.)

Not to mention there is lock-in once you've gone to the trouble of using a specific embedding model on a bunch of content. Ideally we'd converge on backwards-compatible, open source approaches, but cloud vendors want to offer "value" by offering "better" embedding models that are not open source.

replies(4): >>45170544 #>>45170605 #>>45170776 #>>45173764 #

4. barbazoo ◴[08 Sep 25 16:46 UTC] No.45170544{3}[source]▶

>>45170514 #

And that doesn't apply to any other database/search technology AWS offers?

replies(1): >>45170705 #

5. simonw ◴[08 Sep 25 16:50 UTC] No.45170605{3}[source]▶

>>45170514 #

Why would they do that? Doesn't sound like something that would attract further paying customers.

Are there laws on the books that would force them to apply the technology in this way?

replies(1): >>45170716 #

6. resters ◴[08 Sep 25 16:57 UTC] No.45170705{4}[source]▶

>>45170544 #

It does to some but not to most of it, which is why Azure and GCP offer nearly the exact same core services.

7. resters ◴[08 Sep 25 16:58 UTC] No.45170716{4}[source]▶

>>45170605 #

Not official laws that we can read, but things like that are already in place per the Snowden revelations.

8. coredog64 ◴[08 Sep 25 17:01 UTC] No.45170758[source]▶

>>45170203 (TP) #

This comment appears to misunderstand the control plane/data plane distinction of AWS. AWS does have limited access to your control plane, primarily for things like enabling your TAMs to analyze your costs or getting assistance from enterprise support teams. They absolutely do not have access to your dataplane unless you specifically grant it. The primary use case for the latter is allowing writes into your storage for things like ALB access logs to S3. If you were deep in a debug session with enterprise support they might request one-off access to something large in S3, but I would be surprised if that were to happen.

replies(1): >>45170783 #

9. whakim ◴[08 Sep 25 17:03 UTC] No.45170776{3}[source]▶

>>45170514 #

Regardless of the merits of this argument, dedicated vector databases are all running on top of AWS/GCP/Azure infrastructure anyways.

10. resters ◴[08 Sep 25 17:03 UTC] No.45170783[source]▶

>>45170758 #

If that is the case why create a separate govcloud and HIPAA service?

replies(2): >>45172050 #>>45179460 #

11. thedougd ◴[08 Sep 25 18:36 UTC] No.45172050{3}[source]▶

>>45170783 #

HIPAA services are not separate. You only need to establish a Business Associations Addendum (BAA) with AWS and stick to HIPAA eligible services: https://aws.amazon.com/compliance/hipaa-eligible-services-re...

GovCloud exists so that AWS can sell to the US government and their contractors without impacting other customers who have different or less stringent requirements.

12. j45 ◴[08 Sep 25 20:46 UTC] No.45173752[source]▶

>>45170203 (TP) #

Also, if it's not encrypted, I'm not sure if AWS or others "synthesize" customer data by a cursory scrubbing of so called client identifying information, and then try to optimize and model for those scenarios at scale.

I do feel more and more some information in the corpus of AI models was done this way. A client's name and private identifiable information might not be in the model, but some patterns of how to do things sure seem to come up from such sources.

13. ◴[08 Sep 25 20:47 UTC] No.45173764{3}[source]▶

>>45170514 #

14. everfrustrated ◴[09 Sep 25 09:05 UTC] No.45179460{3}[source]▶

>>45170783 #

Product segmentation. Certain customers self-select to pay more for the same thing.