Skip to content

Implement all (non-window) aggregation functions #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
23 of 27 tasks
nils-braun opened this issue Aug 27, 2020 · 4 comments
Open
23 of 27 tasks

Implement all (non-window) aggregation functions #9

nils-braun opened this issue Aug 27, 2020 · 4 comments

Comments

@nils-braun
Copy link
Collaborator

nils-braun commented Aug 27, 2020

The LogicalAggregatePlugin currently can not handle all aggregation functions.

From the calcite docu, here are all aggregation functions that calcite understands:

  • ANY_VALUE
  • APPROX_COUNT_DISTINCT
  • AVG
  • BIT_AND
  • BIT_OR
  • BIT_XOR
  • COUNT
  • COVAR_POP (implemented via REGR_COUNT)
  • COVAR_SAMP (implemented via REGR_COUNT)
  • EVERY
  • MAX
  • MIN
  • REGR_COUNT
  • REGR_SXX
  • REGR_SYY
  • SINGLE_VALUE
  • STDDEV
  • STDDEV_POP
  • STDDEV_SAMP
  • SUM
  • VAR_POP
  • VAR_SAMP
  • VARIANCE

Operations on collection types:

  • COLLECT
  • FUSION
  • INTERSECTION
  • LISTAGG
@nils-braun nils-braun changed the title Implement more aggregation functions Implement all (non-window) aggregation functions Sep 24, 2020
@rajagurunath
Copy link
Collaborator

Hi @nils-braun, I tried implementing some functions from the above list like,

REGR_COUNT :  
                         count(when(x,y) is not null)
                         
REGR_SXX :
                          REGR_SXX(y,x) = REGR_COUNT(y, x) * VAR_POP(x)
 
 REGR_SYY:
                           REGR_SYY(y,x) = REGR_COUNT(y, x) * VAR_POP(y)
                           
  REGR_SXY :                
                           REGR_SXY (y,x) = REGR_COUNT(y, x) * COVAR_POP(y, x)
                           
 COVAR_POP:
                        COVAR_POP(y,x) = (SUM(xy) - SUM(x) * SUM(y) / COUNT()) / COUNT() (or )
                        COVAR_POP  = sum((y - avg_y)
(x - avg_x)/regr_count
                        COVAR_SAMP  = sum((y - avg_y)*(x - avg_x)/(regr_count -1)

I observed one thing by implementing REGR_COUNT most of the other function's plan was produced by the calcite using an already existing function like sum, count, etc.

Some doubts:
Got a chance to go through the code (rel.logical.aggregate), Observed that if the function as more than one input Dask-sql was raising raise NotImplementedError("Can not cope with more than one input") and tried registering the function which needs two inputs as aggregation function in Dask using dd.Aggregation , And I am expecting Chunk function to get data frame (with two columns ) but instead it is receiving only pandas Series (one column). How to get two column Dataframe in the Dask chunk function ? or we need to implement these types of aggregation functions differently? Did I understand the problem wrongly ? or anything I am missing here? Any help can be really appreciated

def map_regr_count(df):
    return pd.Series(df.apply(lambda x:x.notnull().all()).sum())

def reduce_sum(Series):
    print(Series)
    return Series.sum()


regr_count= dd.Aggregation("regr_count",chunk=map_regr_count,agg=reduce_sum)

Referred this doc for formulas: https://docs.yugabyte.com/latest/api/ysql/exprs/aggregate_functions/function-syntax-semantics/

Thanks

@nils-braun
Copy link
Collaborator Author

Hi @rajagurunath
Thanks for looking into it. I really enjoy your comments and I am glad you are reading through more parts of the code. I would be happy to get your thoughts (independent on this particular issue) if you spot something interesting.

Now to your issue: currently, I am using Dask's groupy().agg() functionality to calculate the aggregations. agg however can only have a single input column per aggregation, therefore the issues you have seen.
We might use a different way to calculate the aggregations if we go forward with #183 (I am working on this, but currently did not have a lot of progress so far), but there is still a long way to go. So we need to have a solution for this now :-)

Immediately to my mind comes a "hacky", but probably working solution for this particular use case: you can calculate a temporary result column is_null(df[col1]) | is_null(df[col2]) (where is_null is the IsNullOperation from call.py) and feed it into a standard "count" aggregation. I am actually wondering why calcite is not reducing this automatically...
What do you think?

@rajagurunath
Copy link
Collaborator

Hi @nils-braun ,

Thanks a lot for the in detailed explanation of the solution.

I followed similar steps, developed this functionalities here: https://github.com/rajagurunath/dask-sql/tree/feature/regr_count
This solution worked and cross checked with results from Postgres SQL for all the functions.

And yes, I am also not sure about the calcite plan of sql nodes, even it produces slightly different plan for func regr_count in regr_count and regr_sxx, but I think it is so powerful we only need to implement primitive sql functions rest of the higher level functions are planned by calcite for most of the cases, My guess here is that calcite considers regr_count as primitive sql function so it is not able to break the plan further. 🤔

I will raise this as a PR once the existing PR #191 was closed .

I also enjoy developing this functionalities, exploring different frameworks, learning more
from this code base, Thanks to you 👍

@rajagurunath
Copy link
Collaborator

Hi @nils-braun,

Created a pull request for REGR_COUNT functionality here: #193

Please review and let me know if any changes needed.

Thanks

charlesbluca added a commit to charlesbluca/dask-sql that referenced this issue Sep 27, 2021
* Disable deploy/docker GHA workflows

* Target branch-21.12 for testing

* Cancel in progress runs when a new run is triggered

* Make unique groups for style / python tests

* Skip CodeCov step

* Explicitly install SASL
charlesbluca added a commit that referenced this issue Dec 8, 2021
* Feature/alter schema (#9)

* ML model improvement : Adding "SHOW MODELS and DESCRIBE MODEL"

Author:    rajagurunath <[email protected]>
Date:      Mon May 24 02:37:40 2021 +0530

* fix typo

* ML model improvement : added EXPORT MODEL

* ML model improvement : refactoring for PR

* ML model improvement : Adding stmts in notebook

* ML model improvement : Adding stmts in notebook

* ML model improvement : also test the non-happy path

* ML model improvement : Added mlflow and <With> in sql for extra params

* ML model improvement : Added mlflow and <With> in sql for extra params

* Added Test cases for Export MODEL

* Added ML documentation about the following:
1. SHOW MODELS
2. DESCRIBE MODEL
3. EXPORT MODEL

* refactored based on PR

* Added support only for sklearn compatible models

* excluded mlflow part from code coverage

* install mlflow in test cluster

* Added test for non sklearn compatible model

* [WIP] Added Alter schema

* Added
1. Alter Schema
2. Alter Table

* revert unwanted cells

Co-authored-by: Gurunath LankupalliVenugopal <[email protected]>
Co-authored-by: Nils Braun <[email protected]>

* lint fix

* Added PR changes

* Minor capitalization nitpick

* Add info logger for ALTER TABLE

* Avoid making deepcopies of schema and tables

* Consolidate alter tests in context tests

* Make alter errors more specific

* Updates tests to use new error type

Co-authored-by: Gurunath LankupalliVenugopal <[email protected]>
Co-authored-by: Nils Braun <[email protected]>
Co-authored-by: Charles Blackmon-Luca <[email protected]>
charlesbluca added a commit to charlesbluca/dask-sql that referenced this issue Dec 14, 2021
* Disable deploy/docker GHA workflows

* Target branch-21.12 for testing

* Cancel in progress runs when a new run is triggered

* Make unique groups for style / python tests

* Skip CodeCov step

* Explicitly install SASL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants