Anyways, I am very glad that you put in all that effort to make the JavaScript version work well. Working under limitations is sometimes cool. I remember having to figure out how PyTorch evaluated neural networks, and having to convert the PyTorch neural network into Java code that could evaluate the model without any external libraries (it was very inefficient) for a Java code competition. Although there may have been a better way, what I did was good enough.
You might also be interested in Project Gameface, open source Windows and Android software for face input: https://github.com/google/project-gameface
Most probably I'm wrong, but I wonder if it has anything to do with all the text being written to stdout. In the odd chance that it happens on the same thread, it might be blocking.
also, rewriting neural network from PyTorch to Java sounds like a big task, I wonder if people are doing ML in Java
The precision was always tricky, and while fun, i eventually abandoned the project and switched to face tracking and blinking so i didn't have to hold up my hand.
For some reason the idea of pointing my webcam down, didn't dawn on me ever. I then discovered Project Gameface and just started using that.
Happy programming thank you for the excellent write up and read!
I've been thinking on and off on how to improve the forward facing mode. Since having the hand straight ahead of the camera is messing with the readings, I think the MediaPipe is trained on seeing the hand from above or below (and maybe sides) but not straight ahead.
Ideally, the camera should be like kind of above the hand (pointing downwards) to get the best results. But in the current version of downward facing mode, the way to move the cursor is actually by moving the hand around (x and y position of the hand translates to x and y of the cursor). If the camera FOV is very big (capturing from far away), then you would have to move your hand very far in order to move the cursor, which is probably not ideal.
I later found the idea of improvement for this when playing around with a smart TV, where the remote is controlling a cursor. We do that by tilting the remote like up and down or left and right, I think it uses gyroscope or accelerometer (idk which is which). I wish I have a video of it to show it better, but I don't. I think it is possible to apply the same concept here to the hand tracking, so we use the tilt of the hand for controlling the cursor. This way, we don't have to rely on the hand position captured by the camera. Plus, this will work if the camera is far away, since it is only detecting the hand tilt. Still thinking about this.
Anyway, I'm glad you find the article interesting!
*: a lot of it. Plus, the tracking might be task-centered. I would not bet on a general hand gesture tracking with cheap sensors and bayesian modelling only.
[0] https://tympanus.net/codrops/2024/10/24/creating-a-3d-hand-c...
[1] https://tympanus.net/Tutorials/webcam-3D-handcontrols/
[2] [https://hackaday.com/2024/10/25/diy-3d-hand-controller-using... DIY 3d hand controller
Just because of the use case, and me not having used it in an AR app while wanting to, I'd like to point to doublepoint.com 's totally different but great working approach where they trained a NN to interpret a Samsung Watch's IMU data to detect taps. They also added a mouse mode.
I think Google's OS also allows client BT mode for the device, so I think it can be paired directly as a HID, IIRC.
Not affiliated, but impressed by the funding they received :)
I think they have a camera-based wristband version now.
Still doesn't have any room positioning info though, AFAIK.
out = last_out * x + input * (1-x)
Where x is between zero and one. Closer to one, the more filtering you'll do. You can cascade these too, to make a higher order filter, which will work even better.
see https://gery.casiez.net/1euro/ with plenty of existing implementations to pick from
[1] https://blog.google/technology/ai/google-project-gameface/
One suggestion for fixing the cursor drift during finger taps is instead of using hand position, use index finger. Then tap the middle finger to the thumb for selection. Since this doesn’t change the cursor position, yet is still a comfortable and easy to parse action.
And you are absolutely right regarding its use for the correct scale. For my implementation, I actually just hardcoded the calibration values, based on where I want the boundaries for the Z axis. This value I got from the reading, so in a way it's like a manual calibration. :D But having calibration is definitely the right idea, I just didn't want to overcomplicate things at that time.
BTW, I am a happy user of Exponent, thanks for making it! I am doing some courses and also peer mocks for interview prep!
Unrelated, but shoutout to bearblog. My first blog was on bearblog, which made me start writing. Although I later ended up self-hosting my own blog.
I've succeeded in fully replacing the keyboard (I use Talon voice) but find replacing the mouse tougher. Tried eyetracking but could never get it accurate enough not to be frustrating.
Anyhow , looking forward to try your approach with mediapipe. Thanks for the write up and demo, inspirational.