I dont care so much about the memory and CPU stuff, I mostly leave the heavy lifting to an SQL engine.
Although the Null handling seems very compelling, I guess it comes at a cost of incompatibility with existing libraries, otherwise Pandas would have implemented it as well?
If you mean whether I run it distributedly a la Spark then no. If you mean whether I test it on various machines with different RAM sizes then yes.
> I dont care so much about the memory and CPU stuff, I mostly leave the heavy lifting to an SQL engine.
Well, I care. Both pandas and polars are, to my view, single-machine dataframe library, so the memory and CPU constraints are rather stringent.
My comparison is based solely on my experience: reading csv files that are 20% to 50% the size of RAM, pandas takes (or errors out after) 2 to 10 minutes, while polars finishes in 20 seconds. Queries in pandas are almost always slower than polars.
But reading your comment, it seems you and I have different use cases for dataframe libraries, which is fine. I mostly use them for exploratory analysis, so the SQL api is not that much of a plus to me, but the performance is.
When using Pandas appropriately, that is with method chaining, lambda expressions (instead of intermediate assignments) and pyarrow datatypes, you also get much faster speed and null values handling.
This irritates me. What was the point of THEIR comment? To be cunty? It absolutely was NOT a productive comment. They were being an ass. And you’re defending them being an ass, asking me who I am like I’m somehow not allowed to point out someone being a cunt unless I have some type of status symbol of which you need to approve. At this point I can only fathom you’re a coworker or friend of theirs, hence the defense. Nothing else makes sense.
- Consistent Expression and SQL-like API.
- Lazy execution mode, where queries are compiled and optimized before running.
- Sane NaN and null handling.
- Much faster.
- Much more memory efficient.