The barriers to AI engineering are crumbling fast

1. mark_l_watson ◴[14 Nov 24 16:44 UTC] No.42138010[source]▶

After just spending 15 minutes trying to get something useful accomplished, anything useful at all, with latest beta Apple Intelligence with a M1 iPad Pro (16G RAM), this article appealed to me!

I have been running the 32B parameters qwen2.5-coder model on my 32G M2 Mac and and it is a huge help with coding.

The llama3.3-vision model does a great job processing screen shots. Small models like smollm2:latest can process a lot of text locally, very fast.

Open source front ends like Open WebUI are improving rapidly.

All the tools are lining up for do it yourself local AI.

The only commercial vendor right now that I think is doing a fairly good job at an integrated AI workflow is Google. Last month I had all my email directed to my gmail account, and the Gemini Advanced web app did a really good job integrating email, calendar, and google docs. Job well done. That said, I am back to using ProtonMail and trying to build local AIs for my workflows.

I am writing a book on the topic of local, personal, and private AIs.

replies(5): >>42138175 #>>42139063 #>>42140813 #>>42141201 #>>42142652 #

2. mark_l_watson ◴[14 Nov 24 16:57 UTC] No.42138175[source]▶

>>42138010 (TP) #

Another thought: OpenAI has done a good enough job productizing ChatGPT with advanced voice mode and now also integrated web search. I don’t know if I would trust OpenAI with access to my Apple iCloud data, Google data, my private GitHub repositories, etc., but given their history of effective productization, they could be a multi-OS/platform contender.

Still, I would really prefer everything running under my own control.

replies(1): >>42138585 #

3. alexander2002 ◴[14 Nov 24 17:24 UTC] No.42138585[source]▶

>>42138175 #

Who can trust a company whose name contradicts its presence.

replies(3): >>42139599 #>>42141093 #>>42142101 #

4. zerop ◴[14 Nov 24 18:00 UTC] No.42139063[source]▶

>>42138010 (TP) #

Have you tried RAG on Open WebUI. How does it do in asking questions from source docs?

replies(1): >>42139613 #

5. mark_l_watson ◴[14 Nov 24 18:42 UTC] No.42139599{3}[source]▶

>>42138585 #

I don’t disagree with you!

6. mark_l_watson ◴[14 Nov 24 18:43 UTC] No.42139613[source]▶

>>42139063 #

Not yet. It has ‘Knowledge sources’ that you can set up, and I think that supplies data for built in RAG - but I am not sure until I try it.

7. tracerbulletx ◴[14 Nov 24 20:27 UTC] No.42140813[source]▶

>>42138010 (TP) #

I wrote a script to queue and manage running llama vision on all my images and writing the results to an sqlite db used by my Media Viewer, and now I can do text or vector search on it. It's cool to not have to rely on Apple or Google to index my images and obfuscate how they're doing it from me. Next I'm going to work on a pipeline for doing more complex things like multiple frames in a video, doing multiple passes with llama vision or other models to separate out the OCR, description, and object, people recognition. Eventually I want to feed all of this in here https://lowkeyviewer.com/ and have the ability to manually curate the automated classifications and text.

replies(2): >>42140868 #>>42142973 #

8. Eisenstein ◴[14 Nov 24 20:31 UTC] No.42140868[source]▶

>>42140813 #

I'm curious why you find descriptions of images useful for searching. I developed a similar flow and ended up embedding keywords into the image metadata instead. It makes them easily searchable and not tied to any databases, and it is faster (dealing with tens of thousands of images personally).

* https://github.com/jabberjabberjabber/LLavaImageTagger

replies(2): >>42140957 #>>42141607 #

9. tracerbulletx ◴[14 Nov 24 20:39 UTC] No.42140957{3}[source]▶

>>42140868 #

It's not as good as tags but it does pretty ok for now especially since searching for specific text in an image is something I want to do a lot. I'm trying to work on getting llama to output according to a user defined tagging vocabulary/taxonomy and ideally learn from manual classifications. Kind of a work in progress there.

This is the prompt I've been using.

"Create a structured list of all of the people and things in the image and their main properties. Include a section transcribing any text. Include a section describing if the image is a photo, comic, art, or screenshot. Do not try to interpret, infer, or give subjective opinions. Only give direct, literal, objective descriptions of what you see."

replies(1): >>42141061 #

10. Eisenstein ◴[14 Nov 24 20:49 UTC] No.42141061{4}[source]▶

>>42140957 #

> I'm trying to work on getting llama to output according to a user defined tagging vocabulary/taxonomy and ideally learn from manual classifications. Kind of a work in progress there.

Good luck with that. The only thing that I found that works is using gbnf to force it, which slows inference down considerably.

11. arcanemachiner ◴[14 Nov 24 20:53 UTC] No.42141093{3}[source]▶

>>42138585 #

Not me. I learned that lesson after I tried to take a bite out of my Apple Macintosh.

replies(1): >>42141952 #

12. bboygravity ◴[14 Nov 24 21:04 UTC] No.42141201[source]▶

>>42138010 (TP) #

can llama 3.3 vision do things like "there's a textbox/form field at location 1000, 800 with label "address"" ?

I did a quick and dirty prototype with Claud for this, but it returned everything with an offset and/or scaled.

Would be a killer app to be able to auto-fill any form using OCR.

replies(1): >>42147091 #

13. vunderba ◴[14 Nov 24 21:50 UTC] No.42141607{3}[source]▶

>>42140868 #

I can't speak to the OPs decision, but I also have a similar script set up that adds a combination of YOLO, bakllava, tesseract etc. and also puts it along with a URI reference to the image file into a database.

I actually store the data in the EXIF as well, but the nice thing about having a database is that it's significantly faster than attempting to search hundreds of thousands of images across a nested file structure, particularly since I store a great deal of media on a NAS.

replies(1): >>42155171 #

14. blharr ◴[14 Nov 24 22:39 UTC] No.42141952{4}[source]▶

>>42141093 #

Well, clearly the apple already has a bite taken out of it, so that's user error

15. sumedh ◴[14 Nov 24 22:54 UTC] No.42142101{3}[source]▶

>>42138585 #

You are getting access to some of the best AI tools for free, by that definition isnt that open?

16. honestAbe22 ◴[15 Nov 24 00:02 UTC] No.42142652[source]▶

>>42138010 (TP) #

Open source frontend??? Wtf. Show your code. This post is bs. AI coding has not advanced software in any meaningful way and will make coders dumber

17. mark_l_watson ◴[15 Nov 24 00:56 UTC] No.42142973[source]▶

>>42140813 #

nice!

18. MaxLeiter ◴[15 Nov 24 14:11 UTC] No.42147091[source]▶

>>42141201 #

Were you using claude’s computer mode? It can do this

replies(1): >>42156005 #

19. Eisenstein ◴[16 Nov 24 08:04 UTC] No.42155171{4}[source]▶

>>42141607 #

You wouldn't happen to have this on github or have some other way to share it? I am interested in seeing how you implemented it.

20. bboygravity ◴[16 Nov 24 12:08 UTC] No.42156005{3}[source]▶

>>42147091 #

No, I used the regular Claude which can also (somewhat) do this and uses the same image processing backend as "computer use" as far as I know (source: Antropic CEO interview with Lex Friedman recently).

Computer use is also not very good at it (often mis-clicking for example).

I'm guessing this will work flawlessly within 6 months to a year or so, but it doesn't seem ready yet.