Show HN: Semlib – Semantic Data Processing

(github.com)

59 points anishathalye | 1 comments | 15 Sep 25 13:45 UTC | HN request time: 0s | source

Show context

Y_Y ◴[15 Sep 25 14:34 UTC] No.45250210[source]▶

  >>> await sort(presidents, by="right-leaning")
  ['Jimmy Carter', 'Bill Clinton', 'George H. W. Bush', 'Ronald Reagan']

Is this supposed to be impressive? GIGO, if you want to vibe-classify your data then go right ahead, but I hope nobody serious relies on it.

replies(3): >>45250378 #>>45250441 #>>45251590 #

anishathalye ◴[15 Sep 25 14:55 UTC] No.45250441[source]▶

>>45250210 #

That was a small self-contained example that fit above the fold in the README (and fwiw even last year’s models like GPT-4o give the right output there). That `sort` is based on pairwise comparisons, which is one of the best ways you can do it in terms of accuracy (Qin et al., 2023: https://arxiv.org/abs/2306.17563).

I think there are many real use cases where you might want a semantic sort / semantic data processing in general, when there isn’t a deterministic way to do the task and there is not necessarily a single right answer, and some amount of error (due to LLMs being imperfect) is tolerable. See https://semlib.anish.io/examples/arxiv-recommendations/ for one concrete example. In my opinion, the outputs are pretty high quality, to the point where this is practically usable.

These primitives can be _composed_, and that’s where this approach really shines. As a case study, I tried automating a part of performance reviews at my company, and the Semlib+LLM approach did _better_ than me (don’t worry, I didn’t dump AI-generated outputs on people, I first did the work manually, and shared both versions with an explanation of where each version came from). See the case study in https://anishathalye.com/semlib/

There’s also some related academic work in this area that also talks about applications. One of the most compelling IMO is DocETL’s collaboration to analyze police records (https://arxiv.org/abs/2410.12189). Some others you might enjoy checking out are LOTUS (https://arxiv.org/abs/2407.11418v1), Palimpzest (https://arxiv.org/abs/2405.14696), and Aryn (https://arxiv.org/abs/2409.00847).

replies(1): >>45250664 #

Y_Y ◴[15 Sep 25 15:15 UTC] No.45250664[source]▶

>>45250441 #

As you compose fuzzy operations your errors multiply! Nobody is asking for perfection, but this tool seems to me a straightforward way to launder bad data. If you want to do a quick check of an idea then it's probably great, but if you're going to be rigorous and use hard data and reproducible, understandable methods then I don't think it offers anything. The plea for citations at the end of the readme also rubs me the wrong way.

replies(1): >>45250927 #

anishathalye ◴[15 Sep 25 15:33 UTC] No.45250927[source]▶

>>45250664 #

I think semantic data processing in this style has a nonempty set of use cases (e.g., I find the fuzzy sorting of arXiv papers to be useful, I find the examples in the docs representative of some real-world tasks where this style of data processing makes sense, and I find many of the motivating examples and use cases in the academic work compelling). At the same time, I think there are many tasks for which this approach is not the right one to use.

Sorry you didn't like the wording in the README, that was not the intention. I like to give people a canonical form they can copy-paste if they want to cite the work, things have been a mess for many of my other GitHub repos, which makes it hard to find who is using the work (which can be really informative for improving the software, and I often follow-up with authors of papers via email etc.). For example, I heard about Amazon MemoryDB because they use Porcupine (https://dl.acm.org/doi/pdf/10.1145/3626246.3653380). Appreciate you sharing your feelings; I stripped the text from the README; if you have additional suggestions, would appreciate your comments or a PR.

replies(2): >>45251839 #>>45252821 #

Y_Y ◴[15 Sep 25 16:40 UTC] No.45251839{4}[source]▶

>>45250927 #

Thank you for engaging with me so politely and constructively. I care probably more than I should about good science and honesty in academia, and so I feel compelled to push back in cases where I see things like: blatant overstating of capabilities, artificially farming citations.

This case seems to have been a false positive. Surely people will misuse your tool,but that's not your responsibility, as long as you haven't mislead them to begin with. Good luck with the project, I hope to someday need to cite the software myself.