5 points Wayve | 1 comments | | HN request time: 0.723s | source

We needed a speaker diarization solution that could run every few seconds alongside transcription on iOS and macOS. But native Swift support was either limited or locked behind paid licenses. Since diarization is a common need in speech-to-text workflows, we decided to open source our work and give back to the community.

We initially tried sherpa-onnx, which works, but running both diarization and transcription models slowed down older devices. CPU-only inference just isn’t ideal for near real-time workloads, so we wanted the option to offload segmentation and speaker embedding to the GPU or ANE. Supporting M1 Macs in particular meant pushing more of the workload to the ANE.

Instead of shoehorning the ONNX model into CoreML with C++, we converted the original PyTorch models directly to CoreML. This approach required some monkey-patching in the PyTorch and pyannote code, but the initial benchmarks look promising.

We’d love feedback! We're currently working on adding VAD and integrating Parakeet for transcription, but still wrestling with CoreML model conversion.