Most active commenters

mkalioby(6)

Popular/hot comments

>>42192775 #
>>42191056 #
>>42195260 #

Query Your Python Lists

(github.com)

1. ◴[15 Nov 24 06:18 UTC] No.42144370[source]▶

>>42144369 (OP) #

2. sevensor ◴[20 Nov 24 03:23 UTC] No.42190570[source]▶

>>42144369 (OP) #

Having seen a lot of work come to grief because of the decision to use pandas, anything that’s not pandas has my vote. Pandas: if you’re not using it interactively, don’t use it at all. This advice goes double if your use case is “read a csv.” Standard library in Python has you covered there.

replies(2): >>42190621 #>>42190663 #

3. ttyprintk ◴[20 Nov 24 03:36 UTC] No.42190621[source]▶

>>42190570 #

Since DuckDB can read and write Pandas from memory, a team with varying Pandas fluency can benefit from learning DuckDB.

replies(1): >>42196654 #

4. c0balt ◴[20 Nov 24 03:47 UTC] No.42190663[source]▶

>>42190570 #

Both duckdb and especially polars should also be mentioned here. Polars in particular is quite good Ime if you want a pandas-alike interface (it additionally also has a more sane interface).

5. abdullahkhalids ◴[20 Nov 24 05:08 UTC] No.42190985[source]▶

>>42144369 (OP) #

I don't understand why numeric filters are included. The library is written in python, so shouldn't a lambda function based filter be roughly as fast but much easier/clearer to write.

replies(2): >>42191496 #>>42192896 #

6. tempcommenttt ◴[20 Nov 24 05:29 UTC] No.42191056[source]▶

>>42144369 (OP) #

It’s nice it’s fast at 10k dictionary entries, but how does it scale?

replies(3): >>42191171 #>>42192592 #>>42195503 #

7. glial ◴[20 Nov 24 05:40 UTC] No.42191090[source]▶

>>42144369 (OP) #

Interesting work. I'd be curious to know the timing relative to list comprehensions for similar queries, since that's the common standard library alternative for many of these examples.

replies(1): >>42194992 #

8. dsp_person ◴[20 Nov 24 05:57 UTC] No.42191163[source]▶

>>42144369 (OP) #

Interesting... I've been playing with the idea of embedding more python in my C, no cython or anything just using <Python.h> and <numpy/arrayobject.h>. From one perspective it's just "free" C-bindings to a lot of optimized packages. Trying some different C-libraries, the python code is often faster. Python almost becomes C's package manager

E.g. sorting 2^23 random 64-bit integers: qsort: 850ms, custom radix sort: 250ms, ksort.h: 582ms, np.sort: 107ms (including PyArray_SimpleNewFromData, PyArray_Sort). Where numpy uses intel's x86-simd-sort I believe.

E.g. inserting 8M entries into a hash table (random 64-bit keys and values): MSI-style hash table: ~100ns avg insert/lookup, cc_map: ~95ns avg insert/lookup, Python.h: 91ns insert, 60ns lookup

I'm curious if OPs tool might fit in similarly. I've found lmdb to be quite slow even in tmpfs with no sync, etc.

replies(1): >>42191830 #

9. ◴[20 Nov 24 06:00 UTC] No.42191171[source]▶

>>42191056 #

10. MathMonkeyMan ◴[20 Nov 24 07:20 UTC] No.42191496[source]▶

>>42190985 #

I'm not the author, but this implementation has the benefit of being a JSON compatible DSL that you can serialize. Maybe that's intentional, maybe not.

It does look like Python's comprehensions would be a better choice if you're writing them by hand anyway.

replies(2): >>42194918 #>>42195011 #

11. sitkack ◴[20 Nov 24 08:28 UTC] No.42191830[source]▶

>>42191163 #

You should look at embedding Wasmtime into your C.

https://github.com/bytecodealliance/wasmtime/tree/main/examp...

12. anentropic ◴[20 Nov 24 10:17 UTC] No.42192458[source]▶

>>42144369 (OP) #

Django ORM for plain lists is interesting I guess... but being faster than pandas at that is quite a surprise, bravo!

replies(1): >>42194984 #

13. fatih-erikli-cg ◴[20 Nov 24 10:37 UTC] No.42192592[source]▶

>>42191056 #

I think 10000 is a lot enough for a queryable dataset. More of them is like computer generated things like logs etc.

14. maweki ◴[20 Nov 24 11:11 UTC] No.42192775[source]▶

>>42144369 (OP) #

Embedding functionality into strings prevents any kind of static analysis. The same issue as embedding plain SQL, plain regexes, etc..

I am always in favor of declarative approaches where applicable. But whenever they are embedded in this way, you get this static analysis barrier and a possible mismatch between the imperative and declarative code, where you change a return type or field declaratively and it doesn't come up as an error in the surrounding code.

A positive example is VerbalExpressions in Java, which only allow expressing valid regular expressions and every invalid regular expression is inexpressible in valid java code. Jooq is another example, which makes incorrect (even incorrectly typed) SQL code inexpressible in Java.

I know python is a bit different, as there is no extensive static analysis in the compiler, but we do indeed have a lot of static analysis tools for python that could be valuable. A statically type-safe query is a wonderful thing for safety and maintainability and we do have good type-checkers for python.

replies(4): >>42193389 #>>42194000 #>>42194121 #>>42196401 #

15. yunohn ◴[20 Nov 24 11:32 UTC] No.42192896[source]▶

>>42190985 #

AFAICT this should filter in one pass, so it would be faster than multiple lambdas, or this plus a lambda for numeric.

replies(1): >>42199784 #

16. graup ◴[20 Nov 24 12:06 UTC] No.42193140[source]▶

>>42144369 (OP) #

    {'name__contains':"k", "age__lt":20}

Kind of tangential to this package, but I've always loved this filter query syntax. Does it have a name?

I first encountered it in Django ORM, and then in DRF, which has them as URL query params. I have recently built a parser for this in Javascript to use it on the frontend. Does anyone know any JS libraries that make working with this easy? I'm thinking parsing and offering some kind of database-agnostic marshaling API. (If not, I might have to open-source my own code!)

replies(1): >>42197135 #

17. gpderetta ◴[20 Nov 24 12:36 UTC] No.42193389[source]▶

>>42192775 #

If your schema is dynamic, in most languages there isn't much you can do, but at least in python

   Q(name=contains('k'))

it is not particularly more complex to write and certainly more composable, extensible and checkable.

Alternatively go full eval and do

   Q("'k' in name")

18. badmintonbaseba ◴[20 Nov 24 12:38 UTC] No.42193401[source]▶

>>42144369 (OP) #

I would prefer something like `{"name": contains("k")}`, where contains("k") returns an object with a custom __eq__ that compares equal to any string (or iterable) that contains "k". Then you can just filter by equality.

I recently started using this pattern for pytest equality assertions, as pytest helpfully produces a detailed diff on mismatch. It's not perfect, as pytest doesn't always produce a correct diff with this pattern, but it's better than some alternatives.

replies(1): >>42193797 #

19. mark_l_watson ◴[20 Nov 24 12:58 UTC] No.42193537[source]▶

>>42144369 (OP) #

Apologies for being off topic, but after reading the implementation code, I was amazed at how short it is!

I have never been a huge fan of Python (Lisp person) but I really appreciate how concise Python can be, and the dynamic nature of Python allows the nice query syntax.

replies(2): >>42194136 #>>42197024 #

20. James_K ◴[20 Nov 24 13:20 UTC] No.42193710[source]▶

>>42144369 (OP) #

Maybe this is just me, but embedding the language in strings like this seems like it's just asking for trouble.

replies(2): >>42195488 #>>42225405 #

21. gpderetta ◴[20 Nov 24 13:33 UTC] No.42193797[source]▶

>>42193401 #

Instead of returning an __eq__ object, I have had 'contains' just return a predicate function (that you can pass to filter for example. I guess in practice it doesn't really change much, except that calling it __eq__ is a bit misleading.

A significant advantage is that you can just pass an inline lambda.

22. notpushkin ◴[20 Nov 24 14:01 UTC] No.42194000[source]▶

>>42192775 #

I love how PonyORM does this for SQL: it’s just Puthon `x for x in ... if ...`.

Of course, if you use the same syntax for Python lists of dicts, you don’t need any library at all.

23. eddd-ddde ◴[20 Nov 24 14:19 UTC] No.42194121[source]▶

>>42192775 #

I disagree. You'll be surprised to hear this, but source code... is just a very big string...

If you can run static analysis on that you can run static analysis on string literals. Much like how C will give you warnings for mismatched printf arguments.

replies(1): >>42194391 #

24. gabrielsroka ◴[20 Nov 24 14:22 UTC] No.42194136[source]▶

>>42193537 #

> Python can be seen as a dialect of Lisp

- Peter Norvig

https://www.norvig.com/python-lisp.html

replies(1): >>42194531 #

25. maweki ◴[20 Nov 24 14:52 UTC] No.42194391{3}[source]▶

>>42194121 #

You might be surprised to hear that most compilers and static analysis tools in general do not inspect (string and other) literals, while they do indeed inspect all the other parts and structure of the abstract syntax tree.

replies(1): >>42195260 #

26. mark_l_watson ◴[20 Nov 24 15:07 UTC] No.42194531{3}[source]▶

>>42194136 #

Well, Peter has moved on to Python. I had lunch with him in 2001, expecting to talk about Common Lisp, but he already seemed more interested in Python.

It has been a while since I studied his Lisp code, but I watch for new Python studies he releases.

27. wis ◴[20 Nov 24 15:44 UTC] No.42194918{3}[source]▶

>>42191496 #

Yea, In my opinion using Python's list comprehension is more readable and code checkable.

Here's the usage example from the README:

  from leopards import Q
  l = [{"name":"John","age":"16"}, {"name":"Mike","age":"19"},{"name":"Sarah","age":"21"}]
  filtered= Q(l,{'name__contains':"k", "age__lt":20})
  print(list(filtered))

Versus:

  [x for x in l if ('k' in x['name'] and int(x['age']) < 20)]

Outputs:

  [{'name': 'Mike', 'age': '19'}]

Also from the readme:

  > Even though, age was str in the dict, as the value of in the query dict was int, Leopards converted the value in dict automatically to match the query data type. This behaviour can be stopped by passing False to convert_types parameter.

I don't like this default behavior.

replies(1): >>42196303 #

28. mkalioby ◴[20 Nov 24 15:48 UTC] No.42194984[source]▶

>>42192458 #

Thanks alot.

29. mkalioby ◴[20 Nov 24 15:49 UTC] No.42194992[source]▶

>>42191090 #

Good point, but the libray allows you to generate custom queries based on user input which will be tough by list comprehension

30. mkalioby ◴[20 Nov 24 15:50 UTC] No.42195011{3}[source]▶

>>42191496 #

Sure, we wrote this to filter Json data based on user provided values in a search form.

31. eddd-ddde ◴[20 Nov 24 16:07 UTC] No.42195260{4}[source]▶

>>42194391 #

I know, but that's the point, if you can get a string into an AST you can just do the same thing with the string literals. It's not magic.

replies(3): >>42195362 #>>42196090 #>>42196502 #

32. saghm ◴[20 Nov 24 16:15 UTC] No.42195362{5}[source]▶

>>42195260 #

You can't get an arbitrary string into an AST, only ones that can be at parsed correctly. Rejecting the invalid strings that wouldn't make sense to do analysis on is pretty much the same thing that the parent comment is saying to do with regexes, SQL, etc., just as part of the existing compilation that's happening via the type system rather than at runtime.

replies(1): >>42196082 #

33. HumblyTossed ◴[20 Nov 24 16:25 UTC] No.42195488[source]▶

>>42193710 #

Yes, it looks very fragile.

34. HumblyTossed ◴[20 Nov 24 16:26 UTC] No.42195503[source]▶

>>42191056 #

> but how does it scale

Usually sideways, but if you stack them, you might get some vertical.

35. pphysch ◴[20 Nov 24 16:33 UTC] No.42195583[source]▶

>>42144369 (OP) #

I feel like the scale where a library like this is meaningful for performance, and therefore worth the dependency+DSL complexity, is also the scale where you should use a proper database (even just SQLite).

36. bityard ◴[20 Nov 24 17:14 UTC] No.42196064[source]▶

>>42144369 (OP) #

Title should be prefixed with Show HN and the name of the project in order to not mislead readers about the content of the link.

37. skeledrew ◴[20 Nov 24 17:15 UTC] No.42196082{6}[source]▶

>>42195362 #

Everything can be abstracted away using specialized objects, which can allow for better checking. The Python AST itself is just specialized objects, and it can be extended (but of course with much more work, esp in the analysis tools). There's also this very ingenious - IMO - monstrosity: https://pydong.org/posts/PythonsPreprocessor/. Pick your poison.

38. scott_w ◴[20 Nov 24 17:16 UTC] No.42196090{5}[source]▶

>>42195260 #

Not in the standard language functions. If you wanted to achieve this, you have to write your own parser. That parser is, by definition, not the language parser, adding a level of difficulty to proving any correctness of your program.

There's a reason the term "stringly-typed" is used as a criticism of a language.

39. dmezzetti ◴[20 Nov 24 17:24 UTC] No.42196194[source]▶

>>42144369 (OP) #

Interesting project and approach, thanks for sharing!

If you're interested in a simple solution to query a list with SQL including vector similarity, check this out: https://gist.github.com/davidmezzetti/f0a0b92f5281924597c9d1...

40. WesleyJohnson ◴[20 Nov 24 17:35 UTC] No.42196303{4}[source]▶

>>42194918 #

That's a bit biased, no? The actual comparison should be:

  filtered = Q(l,{'name__contains':"k", "age__lt":20})

Verus:

  filtered = [x for x in l if ('k' in x['name'] and int(x['age']) < 20)]

replies(1): >>42200219 #

41. WesleyJohnson ◴[20 Nov 24 17:45 UTC] No.42196401[source]▶

>>42192775 #

Could this be mitigated by using `dict()` instead of the `{}` literal, and then running an analysis to ensure the provided dictionary keys all end with valid operations? E.g, __contains, __lt, etc?

I don't have a strong background in static analysis.

42. jerf ◴[20 Nov 24 17:56 UTC] No.42196502{5}[source]▶

>>42195260 #

This is one of those ideas that I've seen kicking around for at least a decade now, but manifesting it in real code is easier said than done. And that isn't even the real challenge, the real challeng is keeping it working over time.

I've seen some stuff based on treesitter that seems to be prompting a revival of the idea, but it still has fundamental issues, e.g., if I'm embedding in python:

    sql = "SELECT * FROM table "
    if arbitrarilyComplicatedCondition:
        sql += "INNER JOIN a AS joined ON table.thing = a.id "
    else:
        sql += "INNER JOIN b AS joined ON table.thing = b.id "
    sql += "WHERE joined.

and if you imagine trying to write something to autocomplete at the point I leave off, you're fundamentally stuck on not knowing which table to autocomplete with. It doesn't matter what tech you swing at the problem, since trying to analyze "arbitrarilyComplicatedCondition" is basically Turing Complete (which I will prove by vigorous handwave here because turning that into a really solid statement would be much larger than this entire post, but, it can be done). And that's just a simple and quick example, it's not just "autocomplete", it's any analysis you may want to do on the embedded content.

This is just a simple example; they get arbitrarily complicated, quickly. This is one of those things that when you think of the simple case it seems so easy but when you try to bring it into the real world it immediately explodes with all the complexity your mind's eye was ignoring.

43. adolph ◴[20 Nov 24 18:14 UTC] No.42196654{3}[source]▶

>>42190621 #

Since Pandas 2, Apache Arrow replaced NumPy as the backend for Pandas. Arrow is also used by Polars, DuckDB, Ibis, the list goes on.

https://arrow.apache.org/overview/

Apache Arrow solves most discussed problems, such as improving speed, interoperability, and data types, especially for strings. For example, the new string[pyarrow] column type is around 3.5 times more efficient. [...] The significant achievement here is zero-copy data access, mapping complex tables to memory to make accessing one terabyte of data on disk as fast and easy as one megabyte.

https://airbyte.com/blog/pandas-2-0-ecosystem-arrow-polars-d...

44. pama ◴[20 Nov 24 18:56 UTC] No.42197024[source]▶

>>42193537 #

Agreed on conciseness of the implementation. It is short and clear despite having a Max and Min that share all except one character in 30 lines of code.

45. Kalanos ◴[20 Nov 24 19:01 UTC] No.42197059[source]▶

>>42144369 (OP) #

You can create a dataframe from a list of dictionaries in pandas

`df = pd.DataFrame([{},{},{}])`

replies(1): >>42197286 #

46. Daishiman ◴[20 Nov 24 19:13 UTC] No.42197135[source]▶

>>42193140 #

I think its first usage really was in Django. I've always referred to it as the Django ORM query notation.

replies(1): >>42201118 #

47. ciupicri ◴[20 Nov 24 19:30 UTC] No.42197286[source]▶

>>42197059 #

This is supposedly a bit faster (https://github.com/mkalioby/leopards?tab=readme-ov-file#comp...).

48. me-vs-cat ◴[21 Nov 24 00:38 UTC] No.42199784{3}[source]▶

>>42192896 #

If you are concerned that your Python is making single-digit extra function calls, then you should be using a different language. (However, you might want a C extension that can be used from Python.)

That said, it's trivial to apply multiple filter lambdas in one pass -- the most natural way is a comprehension.

Still, you might be surprised by how fast filter(cond_1, filter(cond_2, data)) actually is. The OP didn't present that performance comparison, nor can I see any reason they gave to avoid comprehensions.

replies(1): >>42205917 #

49. MathMonkeyMan ◴[21 Nov 24 01:47 UTC] No.42200219{5}[source]▶

>>42196303 #

Something like Clojure would be perfect for this. Write a macro that converts

    {"name" (includes? "k"), "age" (< 20)}

into

    {"name" #(includes? % "k"), "age" #(< % 20)}

which is the same as

    {"name" (fn [name] (includes? name "k")), "age" (fn [age] (< age 20))}

Then have another macro that converts that into the pattern matching code, or maybe there's already something in the standard library.

You could serialize the patterns using EDN as a substitute for JSON.

Fun stuff.

I wrote something [similar][1] in javascript. With that it would be:

    const is_k_kid = tisch.compileFunction((etc) => ({
      'name': name => name.includes('k'),
      'age': age => age < 20,
       ...etc
    }));
    const result = input.filter(is_k_kid);

Yes, "...etc" is part of the DSL.

[1]: https://github.com/dgoffredo/tisch

50. mkalioby ◴[21 Nov 24 04:26 UTC] No.42201118{3}[source]▶

>>42197135 #

As also a Django Developer for 10+ years. That is why I felt that Python needs something like this.

51. yunohn ◴[21 Nov 24 16:27 UTC] No.42205917{4}[source]▶

>>42199784 #

Why would you assume single digit extra calls? If the list is N million, then you would do a constant multiple of iterations of that. That’s a non trivial overhead in production applications.

replies(1): >>42211162 #

52. me-vs-cat ◴[22 Nov 24 04:21 UTC] No.42211162{5}[source]▶

>>42205917 #

You're choosing the weakest strawman interpretation and ignoring the rest of what I said.

Bye, have fun!

53. mkalioby ◴[24 Nov 24 01:59 UTC] No.42225405[source]▶

>>42193710 #

Django one of most popular Web Frameworks in Python does the same trick.

↑