Optimization: Go from general logical plans to dask-specific plans? #183

nils-braun · 2021-05-20T13:55:41Z

This question is both a feature proposal and a question.

Currently, the process from a SQL (string) to a calculated dataframe is as follows:

Use Calcite for SQL parsing and non-optional relational algebra creation
Optimize with Calcite using rule-based optimizations resulting into a plan of Logical*. Those relational plans are quite generic, do not know anything about dask, distributed processing or the data
Transform each relational algebra one-by-one to dask API calls
Let dask do another round of optimizations on the dask graph and execute it

While this works quite well and is also used e.g. at blazingSQL, there are at least two problems with it:

Certain optimizations are only possible if you know, that there is a distributed processing involved: basically you can always decide whether to do something in parallel or sequentially and you need to define when to shuffle (or if you need shuffling at all). Example SQL would be SELECT ROW_NUMBER OVER (...), 1 + SUM(x) OVER (...) FROM data. It is much more optimal to only do the shuffling needed for the two windows once, and then just add the 1 to one of the columns in place. However currently, there is no way in getting to know that and the two window functions are calculated separately.
Calcite does not know about the data, which means there can not be any cost-based optimizations (which typically lead to much better results)
We can not properly implement predicate pushdown on data sources, which understand it (e.g. Hive+parquet, see Use the correct hive partition type information (hacky solution) gallamine/dask-sql#1)

One possibility to solve these problems and therefore maybe boost the speed would be to bring the distributed operations like "Shuffle" etc to Calcite and let it decide when to shuffle and just copy its decision to dask. The resulting optimized plan would then look similar to e.g. Spark's output. I was hoping that we could maybe "copy" the procedure from other Apache Calcite projects and then just execute their optimized physical plan, but so far was not able to find a good project.

The text was updated successfully, but these errors were encountered:

nils-braun · 2021-05-25T19:14:45Z

I am actively looking for people with Calcite (or any similar project) experience :-)

nils-braun mentioned this issue May 25, 2021

Use the correct hive partition type information (hacky solution) gallamine/dask-sql#1

Open

nils-braun mentioned this issue Jun 4, 2021

Implement all (non-window) aggregation functions #9

Open

27 tasks

nils-braun mentioned this issue Aug 25, 2021

Cost based optimization #226

Merged

ayushdg mentioned this issue Nov 18, 2021

[ENH] Support predicate pushdown in dask-sql #323

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Go from general logical plans to dask-specific plans? #183

Optimization: Go from general logical plans to dask-specific plans? #183

nils-braun commented May 20, 2021 •

edited

Loading

nils-braun commented May 25, 2021

Optimization: Go from general logical plans to dask-specific plans? #183

Optimization: Go from general logical plans to dask-specific plans? #183

Comments

nils-braun commented May 20, 2021 • edited Loading

nils-braun commented May 25, 2021

nils-braun commented May 20, 2021 •

edited

Loading