ArchiveBox is evolving: the future of self-hosted internet archives

(docs.sweeting.me)

We've been pushing really hard over the last 6mo to develop this release. I'd love to hear feedback from people who've worked on big plugin systems in the past, or anyone who's tried our betas!

Show context

rcarmo ◴[16 Oct 24 19:17 UTC] No.41862756[source]▶

>>41860909 (OP) #

This is nice. I'm actually much more excited about the REST API (which will let me do searches and pull information out, I hope) than the plugin ecosystem, since the last thing I need is for another tool to have a half-baked LLM integration -- I prefer to do that myself and have full control.

Being able to do RAG on my ArchiveBox is something that I have very much wanted to do for over a year now, and it might finally be within reach without my going and hacking at the archived content tree...

Edit: Just looked at the API schema at https://demo.archivebox.io/api/v1/docs.

No dedicated search endpoint? This looks like a HUGE missed opportunity. I was hoping to be able to query an FTS index on the SQLlite database... Have I missed something?

replies(1): >>41862871 #

1. nikisweeting ◴[16 Oct 24 19:28 UTC] No.41862871[source]▶

>>41862756 #

The /cli/list endpoint is the search endpoint you're looking for. It provides FTS but I can make it clearer in the docs, thanks for the tip.

As for the AI stuff don't worry, none of it is touching core, it's all in an optional community plugin only for those who want it.

I'm not personally a huge AI person but I have clients who are already using it and getting massive value from it, so it's worth mentioning. (They're doing some automated QA on thousands of collected captured and feeding results into spreadsheets)

replies(2): >>41863131 #>>41863934 #

2. rcarmo ◴[16 Oct 24 19:52 UTC] No.41863131[source]▶

>>41862871 (TP) #

Thanks, I'll have a look.

My use for this is very different--I want to be able to use a specific subset of my archived pages (which is mostly reference documentation) to "chat" with, providing different LLM prompts depending on subset and fetching plaintext chunks as reference info for the LLM to summarize (and point me back to the archived pages if I need more info).

replies(1): >>41863431 #

3. nikisweeting ◴[16 Oct 24 20:21 UTC] No.41863431[source]▶

>>41863131 #

Ok that makes sense, I think archivebox works as the first step in a pipeline there, with some other tool doing the LLM analysis and query stuff.

replies(1): >>41867361 #

4. sunshine-o ◴[16 Oct 24 21:17 UTC] No.41863934[source]▶

>>41862871 (TP) #

I have been using ArchiveBox recently and love it.

About search, one thing I haven't yet figured out how to do easily is to plug it to my SearXNG instance as they only seem to support Elasticsearch, Meilisearch or Solr [0]

So this new plugin architecture will allow for a meilisearch plugin I guess (with relevancy ranking).

- [0] https://docs.searxng.org/dev/engines/offline/search-indexer-...

replies(1): >>41864125 #

5. nikisweeting ◴[16 Oct 24 21:41 UTC] No.41864125[source]▶

>>41863934 #

Definitely doable! Search plugins are one of the first that I implemented.

We already provide Sonic, ripgrep, and SQLiteFTS as plugins, so adding something like Solr should be straightforward.

Check out the existing plugins to see how it's done: https://github.com/ArchiveBox/ArchiveBox/pull/1534/files?fil...

archivebox/plugins_search/sonic/*

6. rcarmo ◴[17 Oct 24 07:51 UTC] No.41867361{3}[source]▶

>>41863431 #

Yep. That's what I've built for myself, I just can't really get at the data inside ArchiveBox until I upgrade.

replies(1): >>41878679 #

7. pbronez ◴[18 Oct 24 12:16 UTC] No.41878679{4}[source]▶

>>41867361 #

How did you build it?

I can imagine an architecture where I throw everything into ArchiveBox, then run VectorDB as a plugin with Gradio or some such as the client.

https://vectordb.com/

replies(1): >>41886444 #

8. rcarmo ◴[19 Oct 24 08:17 UTC] No.41886444{5}[source]▶

>>41878679 #

You're overcomplicating things. You don't need a vector database, FTS works just as well for non-homogenous content.

↑