You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This question is both a feature proposal and a question.
Currently, the process from a SQL (string) to a calculated dataframe is as follows:
Use Calcite for SQL parsing and non-optional relational algebra creation
Optimize with Calcite using rule-based optimizations resulting into a plan of Logical*. Those relational plans are quite generic, do not know anything about dask, distributed processing or the data
Transform each relational algebra one-by-one to dask API calls
Let dask do another round of optimizations on the dask graph and execute it
While this works quite well and is also used e.g. at blazingSQL, there are at least two problems with it:
Certain optimizations are only possible if you know, that there is a distributed processing involved: basically you can always decide whether to do something in parallel or sequentially and you need to define when to shuffle (or if you need shuffling at all). Example SQL would be SELECT ROW_NUMBER OVER (...), 1 + SUM(x) OVER (...) FROM data. It is much more optimal to only do the shuffling needed for the two windows once, and then just add the 1 to one of the columns in place. However currently, there is no way in getting to know that and the two window functions are calculated separately.
Calcite does not know about the data, which means there can not be any cost-based optimizations (which typically lead to much better results)
One possibility to solve these problems and therefore maybe boost the speed would be to bring the distributed operations like "Shuffle" etc to Calcite and let it decide when to shuffle and just copy its decision to dask. The resulting optimized plan would then look similar to e.g. Spark's output. I was hoping that we could maybe "copy" the procedure from other Apache Calcite projects and then just execute their optimized physical plan, but so far was not able to find a good project.
The text was updated successfully, but these errors were encountered:
This question is both a feature proposal and a question.
Currently, the process from a SQL (string) to a calculated dataframe is as follows:
Logical*
. Those relational plans are quite generic, do not know anything about dask, distributed processing or the dataWhile this works quite well and is also used e.g. at blazingSQL, there are at least two problems with it:
SELECT ROW_NUMBER OVER (...), 1 + SUM(x) OVER (...) FROM data
. It is much more optimal to only do the shuffling needed for the two windows once, and then just add the 1 to one of the columns in place. However currently, there is no way in getting to know that and the two window functions are calculated separately.One possibility to solve these problems and therefore maybe boost the speed would be to bring the distributed operations like "Shuffle" etc to Calcite and let it decide when to shuffle and just copy its decision to dask. The resulting optimized plan would then look similar to e.g. Spark's output. I was hoping that we could maybe "copy" the procedure from other Apache Calcite projects and then just execute their optimized physical plan, but so far was not able to find a good project.
The text was updated successfully, but these errors were encountered: