(dynomight.net)

488 points MinimalAction | 1 comments | 15 May 25 16:05 UTC | HN request time: 0.206s | source

Show context

jamesblonde ◴[15 May 25 19:48 UTC] No.43998616[source]▶

In Data for ML, everything has switch from NumPy (Pandas) to Arrow (Polars, DuckDB, Spark, Pandas 2.x, etc). However, Scikit-Learn is still a hold out, so it's Arrow from you data sources all to way to pre-processing pipelines in Scikit-Learn when you have to go back to NumPy. In practice, it now makes more sense to separate feature pipelines in Arrow from training pipelines with Pandas/NumPy and Scikit-Learn.*

*This is ML, not Deep Learning or Transformers.

replies(1): >>44008302 #

1. kccqzy ◴[16 May 25 18:10 UTC] No.44008302[source]▶

>>43998616 #

Most Arrow arrays can be transformed into numpy arrays in a zero-copy manner. And having used both, I personally think Arrow is way more buggy than numpy: PyArrow segfaults for me about once a month when writing pure Python; numpy never segfaulted on me.

↑

I don't like NumPy