Implement all (non-window) aggregation functions #9

nils-braun · 2020-08-27T22:41:02Z

rajagurunath · 2021-06-04T07:35:44Z

Hi @nils-braun, I tried implementing some functions from the above list like,

REGR_COUNT :
count(when(x,y) is not null)

REGR_SXX :
REGR_SXX(y,x) = REGR_COUNT(y, x) * VAR_POP(x)

REGR_SYY:
REGR_SYY(y,x) = REGR_COUNT(y, x) * VAR_POP(y)

REGR_SXY :
REGR_SXY (y,x) = REGR_COUNT(y, x) * COVAR_POP(y, x)

COVAR_POP:
COVAR_POP(y,x) = (SUM(xy) - SUM(x) * SUM(y) / COUNT()) / COUNT() (or )
COVAR_POP = sum((y - avg_y)(x - avg_x)/regr_count
COVAR_SAMP = sum((y - avg_y)*(x - avg_x)/(regr_count -1)

I observed one thing by implementing REGR_COUNT most of the other function's plan was produced by the calcite using an already existing function like sum, count, etc.

Some doubts:
Got a chance to go through the code (rel.logical.aggregate), Observed that if the function as more than one input Dask-sql was raising raise NotImplementedError("Can not cope with more than one input") and tried registering the function which needs two inputs as aggregation function in Dask using dd.Aggregation , And I am expecting Chunk function to get data frame (with two columns ) but instead it is receiving only pandas Series (one column). How to get two column Dataframe in the Dask chunk function ? or we need to implement these types of aggregation functions differently? Did I understand the problem wrongly ? or anything I am missing here? Any help can be really appreciated

def map_regr_count(df):
    return pd.Series(df.apply(lambda x:x.notnull().all()).sum())

def reduce_sum(Series):
    print(Series)
    return Series.sum()


regr_count= dd.Aggregation("regr_count",chunk=map_regr_count,agg=reduce_sum)

Referred this doc for formulas: https://docs.yugabyte.com/latest/api/ysql/exprs/aggregate_functions/function-syntax-semantics/

Thanks

nils-braun · 2021-06-04T14:15:58Z

Hi @rajagurunath
Thanks for looking into it. I really enjoy your comments and I am glad you are reading through more parts of the code. I would be happy to get your thoughts (independent on this particular issue) if you spot something interesting.

Now to your issue: currently, I am using Dask's groupy().agg() functionality to calculate the aggregations. agg however can only have a single input column per aggregation, therefore the issues you have seen.
We might use a different way to calculate the aggregations if we go forward with #183 (I am working on this, but currently did not have a lot of progress so far), but there is still a long way to go. So we need to have a solution for this now :-)

Immediately to my mind comes a "hacky", but probably working solution for this particular use case: you can calculate a temporary result column is_null(df[col1]) | is_null(df[col2]) (where is_null is the IsNullOperation from call.py) and feed it into a standard "count" aggregation. I am actually wondering why calcite is not reducing this automatically...
What do you think?

rajagurunath · 2021-06-11T10:09:52Z

Hi @nils-braun ,

Thanks a lot for the in detailed explanation of the solution.

I followed similar steps, developed this functionalities here: https://github.com/rajagurunath/dask-sql/tree/feature/regr_count
This solution worked and cross checked with results from Postgres SQL for all the functions.

And yes, I am also not sure about the calcite plan of sql nodes, even it produces slightly different plan for func regr_count in regr_count and regr_sxx, but I think it is so powerful we only need to implement primitive sql functions rest of the higher level functions are planned by calcite for most of the cases, My guess here is that calcite considers regr_count as primitive sql function so it is not able to break the plan further. 🤔

I will raise this as a PR once the existing PR #191 was closed .

I also enjoy developing this functionalities, exploring different frameworks, learning more
from this code base, Thanks to you 👍

rajagurunath · 2021-06-17T19:09:39Z

Hi @nils-braun,

Created a pull request for REGR_COUNT functionality here: #193

Please review and let me know if any changes needed.

Thanks

* Disable deploy/docker GHA workflows * Target branch-21.12 for testing * Cancel in progress runs when a new run is triggered * Make unique groups for style / python tests * Skip CodeCov step * Explicitly install SASL

* Feature/alter schema (#9) * ML model improvement : Adding "SHOW MODELS and DESCRIBE MODEL" Author: rajagurunath <[email protected]> Date: Mon May 24 02:37:40 2021 +0530 * fix typo * ML model improvement : added EXPORT MODEL * ML model improvement : refactoring for PR * ML model improvement : Adding stmts in notebook * ML model improvement : Adding stmts in notebook * ML model improvement : also test the non-happy path * ML model improvement : Added mlflow and <With> in sql for extra params * ML model improvement : Added mlflow and <With> in sql for extra params * Added Test cases for Export MODEL * Added ML documentation about the following: 1. SHOW MODELS 2. DESCRIBE MODEL 3. EXPORT MODEL * refactored based on PR * Added support only for sklearn compatible models * excluded mlflow part from code coverage * install mlflow in test cluster * Added test for non sklearn compatible model * [WIP] Added Alter schema * Added 1. Alter Schema 2. Alter Table * revert unwanted cells Co-authored-by: Gurunath LankupalliVenugopal <[email protected]> Co-authored-by: Nils Braun <[email protected]> * lint fix * Added PR changes * Minor capitalization nitpick * Add info logger for ALTER TABLE * Avoid making deepcopies of schema and tables * Consolidate alter tests in context tests * Make alter errors more specific * Updates tests to use new error type Co-authored-by: Gurunath LankupalliVenugopal <[email protected]> Co-authored-by: Nils Braun <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Disable deploy/docker GHA workflows * Target branch-21.12 for testing * Cancel in progress runs when a new run is triggered * Make unique groups for style / python tests * Skip CodeCov step * Explicitly install SASL

nils-braun mentioned this issue Aug 27, 2020

Implement all simple expression operators #11

Open

nils-braun mentioned this issue Sep 20, 2020

Add more aggregation types and filtered aggregations #41

Merged

nils-braun changed the title ~~Implement more aggregation functions~~ Implement all (non-window) aggregation functions Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement all (non-window) aggregation functions #9

Implement all (non-window) aggregation functions #9

nils-braun commented Aug 27, 2020 •

edited

Loading

rajagurunath commented Jun 4, 2021

nils-braun commented Jun 4, 2021

rajagurunath commented Jun 11, 2021

rajagurunath commented Jun 17, 2021

Implement all (non-window) aggregation functions #9

Implement all (non-window) aggregation functions #9

Comments

nils-braun commented Aug 27, 2020 • edited Loading

rajagurunath commented Jun 4, 2021

nils-braun commented Jun 4, 2021

rajagurunath commented Jun 11, 2021

rajagurunath commented Jun 17, 2021

nils-braun commented Aug 27, 2020 •

edited

Loading