My goal is to create a system with smart search capabilities, and one of the most important requirements is that it must run entirely on my local hardware. Privacy is key, but the main driver is the challenge and joy of building it myself (an obviously learn).
The key features I'm aiming for are:
Automatic identification and tagging of family members (local face recognition).
Generation of descriptive captions for each photo.
Natural language search (e.g., "Show me photos of us at the beach in Luquillo from last summer").
I've already prompted AI tools for a high-level project plan, and they provided a solid blueprint (eg, Ollama with LLaVA, a vector DB like ChromaDB, you know it). Now, I'm highly interested in the real-world human experience. I'm looking for advice, learning stories, and the little details that only come from building something similar.
What tools, models, and best practices would you recommend for a project like this in 2025? Specifically, I'm curious about combining structured metadata (EXIF), face recognition data, and semantic vector search into a single, cohesive application.
Any and all advice would be deeply appreciated. Thanks!
edit: To explain further why it's almost always desirable:
You guarantee that you and your users' information is safe if the server is compromised, if an admin goes rogue, or if local bodies of power request their information from you.
The information can't be sent to third-parties by design.
Any operations / transformations that need to be applied to the information will have to either be done via homomorphic encryption or on the client-side (which is much more likely to be open source / easy-to-deobfuscate compared to blackbox server code).
E. g., “Any operations / transformations” includes facial recognition, CLIP embeddings, &c; you want to run this on the server, overnight, and to be able to re-run at a later date when new models become available. Under e2ee, that’s a round-trip through a client device at every model update. So that’s a significant downside, for no important upsides in the case when you and your family are the only users.
What happens if there’s a new, better model? You’d need to re-download, decrypt, and run inference on all your past media, which is in terabytes for many.
I understand the benefit of e2ee in a situation where there is no trust between user and admin. In personal self-hosting, that’s the same person (or family), and the upsides are not as relevant. The downsides (possibility of data loss for, e. g., kids who are not very good with passwords/keys; difficulties with updating models / thumbs; …) remain important, and outweigh the benefits, even assuming the e2ee is implemented well.
edit: also feel like I'm echoing the classic dropbox comment, but self-hosting in a sane and secure manner is harder than it's made out to be. It needs to be taken seriously.
[0] https://proton.me/blog/data-recovery-end-to-end-encryption