My goal is to create a system with smart search capabilities, and one of the most important requirements is that it must run entirely on my local hardware. Privacy is key, but the main driver is the challenge and joy of building it myself (an obviously learn).
The key features I'm aiming for are:
Automatic identification and tagging of family members (local face recognition).
Generation of descriptive captions for each photo.
Natural language search (e.g., "Show me photos of us at the beach in Luquillo from last summer").
I've already prompted AI tools for a high-level project plan, and they provided a solid blueprint (eg, Ollama with LLaVA, a vector DB like ChromaDB, you know it). Now, I'm highly interested in the real-world human experience. I'm looking for advice, learning stories, and the little details that only come from building something similar.
What tools, models, and best practices would you recommend for a project like this in 2025? Specifically, I'm curious about combining structured metadata (EXIF), face recognition data, and semantic vector search into a single, cohesive application.
Any and all advice would be deeply appreciated. Thanks!
Do you need the embeddings to be private? Or just the photos?
As of now, I use SentenceTransformer model to chunk files, blip for captioning (“Family vacation in Banff, February 2025”)) and mtcnn with InsightFace for face detection. My index stores captions, face embeddings, and EXIF metadata (date, GPS) for queries like “show photos of us in Banff last winter.” I’m working on integrating ChromaDB for faster searches.
Eventually, I aim to store indexes as:
{
"filename": "/Vacation/Banff/Wife.jpg",
"chunk_id": 0,
"text": "Family at Banff, February 2025",
"caption_embedding": [0.1, 0.2, ...],
"face_embeddings": [{"name": "NT", "embedding": [0.3, 0.4, ...]}, ...],
"exif": {
"DateTimeOriginal": "2025:02:15",
"GPSCoordinates": "18.387, -65.992"
}
}I also built an UI (like Spotlight Search) to search through these indexes.
Code (in progress): https://github.com/neberej/smart-search
No hate here, I'm really grateful for what they've achieved so far, but I think there's a lot of room for improvement (e.g: proper R/W query split, native S3 integration, faster endpoints, ...). I already mentioned it in their channel (they're a really welcoming community!) and I'm working on an alternative drop-in replacement backend (written in Go) [1] that will hopefully bring all the needed improvements.
TL;DR: It's definitely good, especially for an open-source project, and the team is very dedicated - but it's definitely not Postgres-good
1. The Immich app's performance is awful. It is a well known problem and their current focus. I have pretty high confidence that it will be fixed within a few months. Web app is totally fine though.
2. Some background operations such as AI indexing, face detection and video conversion don't work gracefully when restarted from scratch. They all basically first delete all the old data, then start processing assets. So for many days (depending on your parallelism settings and server performance) you may be completely missing some assets from search or converted videos. But you only need to do this very rarely (change encoding settings and want to apply to the back catalog or switch AI search model). I don't upload at a particularly high rate but my sever can very easy handle the steady state.
1 is pretty major but being worked on and you can work around it by just opening the website. 2 is less important but I don't think there is any work on it.
The addition of an AI tool is a great idea.
We no longer are auto uploading to Google or Apple.
So far, I really like it. I haven't quite gone 100%, as we're still uploading with Synology's photo app, but Immich provides a much more refined, featured interface.
It gives a sort of high level system overview that might provide some useful insights or inspiration for you.
2. The software is provided without modification; I think it would be stranger to remove the encryption.
Near zero maintenance stack, incredibly easy to update, the client mobile apps even notify you (unobtrusively) when your server has an update available. The UI is just so polished & features so stable it's hard to believe it's open source.
The dev is really reluctant of accepting external contributions, which has driven away a lot of curious folks willing to contribute.
Immich seems to be the other extreme. Moving really fast with a lot of contributors, but stuff occasionally breaks, the setup is fiddly, but the Ai features are 100x more powerful. I just don't like the ui as much as photoprism. I with there was some kind of blend of the two, on a middle ground of their dev philosophies.
I've used gemma to process pictures and get descriptions and also to respond questions about the pictures (eg. is there a bicycle in the picture?). Haven't tried it for face recognition, but if you already have identified someone in one photo, it can probably tell you if the person in that photo is also in another photo
Just one caveat, if you are processing thousands of pictures, it will take a while to process them all (depending on your hardware and picture size). You could also try creating a processing pipeline, first extracting faces or bounding boxes of the faces with something like opencv, and then passing those to gemma3
Please post repo link if you ever decide to open source
And for sure, if I get this to a point where it's open-source, I'll post the link here!
The stack is hacky, since it was mostly for myself...
take my photo catalog stored in google photos, apple pictures, Onedrive, Amazon photos. collate into a single store, dedupe. Then build a proper timeline and geo/map view for all the photos.
Example: https://rclone.org/googlephotos/#limitations
Glaring example:
> The current google API does not allow photos to be downloaded at original resolution. This is very important if you are, for example, relying on "Google Photos" as a backup of your photos. You will not be able to use rclone to redownload original images. You could use 'google takeout' to recover the original photos as a last resort
I'm using docker compose to include some supporting containers like go-vod (for hardware transcoding), another nextcloud instance to handle push notifications to the clients, and redis (for caching). I can share some more details, foibles and pitfalls if you'd like.
I initiated a rescan last week, which stacks background jobs in a queue that gets called by cron 2 or 3 times a day. Recognize has been cranking through 10k-20k photos per day, with good results.
I've installed a desktop client on my dad's laptop so he can dump all of the family hard drives we've accumulated over the years. The client does a good job of clearing up disk space after uploading, which is a huge advantage in my setup. My dad has used the OneDrive client before, so he was able to pick up this process very quickly.
Nextcloud also has a decent mobile client that can auto-upload photos and videos, which I recently used to help my mother-in-law upload media from her 7-year-old iPhone.
I pay them for service/storage as it’s e2ee and it doesn’t matter to me if they or I store the encrypted blobs.
They also have a CLI tool you can run from cron on your NAS or whatever to make sure you have a complete local copy of your data, too.
https://ente.io - if you use the referral code SNEAK we both get additional free storage.
Gonna check the apps that you mentioned. Feel free to share more details of your set up. Why are you running 2 instances? Edit: I see, probably for the memories app.
A lot of existing tooling supports the s3 protocol, so it would simplify the storage picture (no pun intended).
This is exactly how I self-host Ente and it has been great.
Machine leaning for image detection has worked really well for me, especially facial recognition for family members (easy to find that photo to share).
I have the client on my Android mobile, Fire tablet (via F-Droid), and my Windows laptop.
My initial motivation was to replace "cloud" storage for getting photos copied off the phone as soon as possible.
Stock NC gets you a very solid general purpose document management system and with a few addons, you basically get self hosted SharePoint and OneDrive without the baggage. The images/pictures side of things has seen quite a lot of development and with some addons you get image classification with fairly minimal effort.
The system as a whole will quite happily handle many 100,000 files with pretty rubbish hardware, if you are happy to wait for batch jobs to run or you throw more hardware at it and speed up the job schedules.
NC has a stock phone app which works very well these days, including camera folder uploads. There are several more apps that integrate with the main one to add optional functionality. For example notes and voip.
It is a very large and mature setup with loads of documentation and hence extensible by a determined hacker if something is missing.
I focused more on fast rendering in [photofield] (quick [explainer] if you're interested), but even the hacked up basic semantic search with CLIP works better than it has any right to. Vector DBs are cool, but what is cooler is writing float arrays to sqlite :)
[deepface]: https://github.com/serengil/deepface
[photofield]: https://github.com/SmilyOrg/photofield
[explainer]: https://lnar.dev/blog/photofield-origins/
If you need a detailed mask for editing in another application, florence2 or SAM. Or rembg for decent all purpose one shot removals, as long as you have a touchup process or don't mind rerunning the failures.
I’m running a DS1813+. It’s stopped getting new feature updates. This approach lets me keep the storage running while migrating away the server components.
Given how good the new multimodal models are, I've been thinking it would be much better to just have a multimodal model describe the image, and let the searching be done by the already included melleisearch.
That said, due to reasons I haven't had time to mess with it past couple of months, so perhaps something drastic has changed.
I have an OpenMediaVault VM with a 10tb volume in the network that runs the S3 plugin (Minio-based) which is connected through Nextcloud's external storage feature (I want to migrate to Garage soon). I believe notify_push helps desktop clients cut down on the chatter when querying the external storage folder. Limiting the users that can access this also helps.
I was having issues getting the notify_push app [1] to work in the container with my reverse-proxy. I found some similar setups that did this [2], so I added another nextcloud container to the docker-compose yaml like so:
notify_push:
image: nextcloud
restart: unless-stopped
ports:
- 7867:7867
depends_on:
- app
environment:
- PORT=7867
- NEXTCLOUD_URL=http://<local ip address of docker server>:8081
entrypoint: /var/www/html/custom_apps/notify_push/bin/x86_64/notify_push /var/www/html/config/config.php
volumes:
- /path/to/nextcloud/customapps:/var/www/html/custom_apps
- /path/to/nextcloud/config:/var/www/html/config
[1] - https://apps.nextcloud.com/apps/notify_push[2] - https://help.nextcloud.com/t/docker-caddy-fpm-notify-push-ca...
For Features. I dont know why there's isn't a tag for Screen Caps. I made lots of them and I want to group them together.
Also, my house is less secure than commercial data centers, so e2ee gives me greater peace of mind about data safety.
I must be wasting so much storage on the 4 photos I took in a row of the family pose, or derivatives that got shared on whatsapp and then stored back to my gallery, and so on, and I know I'm not the only one.
Personally I'd love a separate thing that could crawl the photos in a folder I point it to and then let me search using semantics and natural language. But can it please just be an exe I can double click when I need it? If it involves maintaining a server or faffing about with Docker I'm probably not going to bother.
Is it really that stable and flawless in terms of updates?
Because I'm sat here with ZFS, snapshotting and replication configured and wondering why people scare others off of it when the tools to mitigate issues are all free and should be used anyway as part of a bog-standard self-hosted stack.
The ball-ache of SQLite not scaling outweighs any "maintenance" Postgres needs (it really is just set-and-forget and use a Docker container to schedule database backups—whole thing takes a couple minutes).
While I really like it — snappy and encrypted — I was surprised by how much the missing Ultra HDR implementation affects me. Photos are currently uploaded with brightness information but not displayed with it. Therefore, my photos look great in Google Photos but far less vivid in Ente.
For what it's worth, I found a discussion about Ultra HDR. It doesn't seem to be a priority right now, though: https://github.com/ente-io/ente/discussions/779
I think you overestimate security of data centers.
At rest, you use full-disk encryption anyway, so the extra layer just makes things harder.
I also perform all my updates manually - it's fully automated: a simple script that runs in seconds across my entire home server - but I don't have it on any schedule so I'm not doing anything blind. That at least affords me the luxury of being present if/when anything breaks (though for Immich that has not occurred yet).
Also considering the type of workload, I imagine photo albums to be write heavy upon photo imports but read heavy afterwards which sqlite should excel at. I'll mostly be syncing pictures from our phones, and it'll be me and the wife using it. Postgres is overkill for my needs.
What about having to do db migrations across major updates?
edit: To explain further why it's almost always desirable:
You guarantee that you and your users' information is safe if the server is compromised, if an admin goes rogue, or if local bodies of power request their information from you.
The information can't be sent to third-parties by design.
Any operations / transformations that need to be applied to the information will have to either be done via homomorphic encryption or on the client-side (which is much more likely to be open source / easy-to-deobfuscate compared to blackbox server code).
E. g., “Any operations / transformations” includes facial recognition, CLIP embeddings, &c; you want to run this on the server, overnight, and to be able to re-run at a later date when new models become available. Under e2ee, that’s a round-trip through a client device at every model update. So that’s a significant downside, for no important upsides in the case when you and your family are the only users.
So my photo storage on my home server is getting filled with a bunch of useless images that I only have on my phone temporarily and that I end up deleting shortly after.
What happens if there’s a new, better model? You’d need to re-download, decrypt, and run inference on all your past media, which is in terabytes for many.
I understand the benefit of e2ee in a situation where there is no trust between user and admin. In personal self-hosting, that’s the same person (or family), and the upsides are not as relevant. The downsides (possibility of data loss for, e. g., kids who are not very good with passwords/keys; difficulties with updating models / thumbs; …) remain important, and outweigh the benefits, even assuming the e2ee is implemented well.
Once you get everything ingested and the initial classifications and clustering done, the process runs pretty quickly as you upload new photos.
I'm using OPNsense as the main firewall/router, with the HAProxy plugin acting as reverse-proxy. Cloudflare DNS proxies my home IP address and keeps it hidden from the public, and the DDNS plugin in OPNsense updates the A record in CF when my ISP changes my public IP address every few months.
edit: also feel like I'm echoing the classic dropbox comment, but self-hosting in a sane and secure manner is harder than it's made out to be. It needs to be taken seriously.
[0] https://proton.me/blog/data-recovery-end-to-end-encryption