DOC: User Guide Page on user-defined functions #61195

arthurlw · 2025-03-28T19:15:48Z

closes DOC: Write user guide page on apply/map/transform methods #61126
~~Tests added and passed if fixing a bug or adding a new feature~~
~~All code checks passed.~~
~~Added type annotations to new arguments/methods/functions.~~
~~Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.~~

arthurlw · 2025-03-28T19:34:49Z

Currently writing this, so I would appreciate any feedback on it!

rhshadrach

Thanks for the PR! I'm not opposed to a dedicated page on UDFs, but I am opposed to duplicating documentation that exists elsewhere in the user guide, as I think much of this does. Instead of e.g. examples of apply, I recommend linking to the appropriate section. This page can then focus on recommendations of when to use apply vs other methods.

rhshadrach · 2025-03-29T12:54:27Z

doc/source/user_guide/user_defined_functions.rst

+Why Use User-Defined Functions?
+-------------------------------


I think we should lead with Why _not_ User-Defined Functions. While performance is called out down below, I think the poor behavior of UDFs should be mentioned as well. Namely that pandas has no information on what a UDF is doing, and so has to infer (guess) at how to handle the result.

In particular, I think it should be mentioned that none of the examples on this page should be UDFs in practice.

arthurlw · 2025-03-29T17:10:53Z

Hi @rhshadrach thanks for the feedback! I agree with you and will push updates soon

rhshadrach

I think this is looking a lot better. Can we also link to https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation at the very bottom in a section titled something like "Improving Performance with UDFs".

doc/source/user_guide/user_defined_functions.rst

rhshadrach · 2025-04-06T13:02:47Z

doc/source/user_guide/user_defined_functions.rst

+ways to apply UDFs across different pandas data structures.
+
+.. note::
+    Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.


Can you also make a mention of resample, rolling, expanding, and ewm. Perhaps link to each section in the User Guide.

Can we add the other objects to this note, it seems to me they all belong together.

Suggested change

Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.

Some of these methods are can also be applied to groupby, resample, and various window objects. See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`, and :ref:`ewm()<window>` for details.

doc/source/user_guide/user_defined_functions.rst

rhshadrach · 2025-04-12T12:22:54Z

doc/source/user_guide/user_defined_functions.rst

+pandas comes with a set of built-in functions for data manipulation, UDFs offer
+flexibility when built-in methods are not sufficient. These functions can be
+applied at different levels: element-wise, row-wise, column-wise, or group-wise,
+and change the data differently, depending on the method used.


nit: "change the data differently" sounds very close to mutating in a UDF, which we explicitly do not support. What do you think of "behave differently".

“Behave differently” sounds clearer and avoids implying mutation. I'll update it!

rhshadrach · 2025-04-12T12:28:08Z

doc/source/user_guide/user_defined_functions.rst

+* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series,
+  DataFrames, or groups of data.


I'm thinking we should remove groups of data here. DataFrame.apply that you're referencing doesn't operate on groups, and you mention groupby below.

rhshadrach · 2025-04-12T12:30:12Z

doc/source/user_guide/user_defined_functions.rst

+ways to apply UDFs across different pandas data structures.
+
+.. note::
+    Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.


Can we add the other objects to this note, it seems to me they all belong together.

Suggested change

Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.

Some of these methods are can also be applied to groupby, resample, and various window objects. See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`, and :ref:`ewm()<window>` for details.

rhshadrach · 2025-04-12T12:35:56Z

doc/source/user_guide/user_defined_functions.rst

+When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation
+functions across groups.


Things like .agg(["sum", "mean"]) aren't UDFs, so I don't think they should be mentioned here, and it could be make users think these types of usages are slow (they are not).

Suggested change

When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation

functions across groups.

When to use: Use :meth:`DataFrame.agg` for performing custom aggregations, where the operation returns a scalar value on each input.

rhshadrach · 2025-04-12T12:46:18Z

doc/source/user_guide/user_defined_functions.rst

+    })
+
+    # Using transform with mean
+    df['Mean_Transformed'] = df.groupby('Category')['Values'].transform('mean')


This isn't an example of a UDF. I really like your example of using linear regression - can we do that here? It's a bit unfortunate that groupby.transform does not allow operating on the entire group (only works column-by-column) here.

from sklearn.linear_model import LinearRegression df = pd.DataFrame({ 'group': ['A', 'A', 'A', 'B', 'B', 'B'], 'x': [1, 2, 3, 1, 2, 3], 'y': [2, 4, 6, 1, 2, 1.5] }).set_index("x") # Function to fit a model to each group def fit_model(group): x = group.index.to_frame() y = group model = LinearRegression() model.fit(x, y) pred = model.predict(x) return pred result = df.groupby('group').transform(fit_model)

datapythonista

Excellent job here @arthurlw, thanks for taking care of this. I added a general comment about using examples to incrementally illustrate what it's explain here, and changing a bit the order of the sections.

Please let me know if it doesn't make sense or you have any comment. I'll review more in depth after the proposed changes are implemented or discussed. But in a first look, this is really nice.

datapythonista · 2025-04-12T21:11:16Z

doc/source/user_guide/index.rst

@@ -88,3 +88,4 @@ Guides
    sparse
    gotchas
    cookbook
+    user_defined_functions


I would move this before the groupby section. It feels more natural to me to explain first Series.apply and later explain groupby("col").apply.

datapythonista · 2025-04-12T21:14:41Z

doc/source/user_guide/user_defined_functions.rst

+and change the data differently, depending on the method used.
+
+Why Not To Use User-Defined Functions
+-----------------------------------------


Not sure if Sphinx is more flexible now, but this had to be the same exact length as the title before.

Title marker needs to be at least as long as the text, but can be longer.

datapythonista · 2025-04-12T21:17:27Z

doc/source/user_guide/user_defined_functions.rst

+{{ header }}
+
+**************************************
+Introduction to User-Defined Functions


Suggested change

Introduction to User-Defined Functions

User-Defined Functions (UDFs)

This will be what will be shown in the index too, so better to be concise. Also, better for consistency to remove the Introduction to, which we could have in every other user guide too.

datapythonista · 2025-04-12T21:19:17Z

doc/source/user_guide/user_defined_functions.rst

+applied at different levels: element-wise, row-wise, column-wise, or group-wise,
+and change the data differently, depending on the method used.
+
+Why Not To Use User-Defined Functions


Maybe just personal opinion, but to me it makes more sense to explain what UDFs are in pandas before explaining when not to use them. This order seems reasonable assuming users already know what pandas udfs are in practice, but I'd personally prefer not to assume it in the user guide for UDFs.

In my opinion, after the previous introduction which is great, I'd show a very simple example so we make sure users reading this understand the very basics.

Something like:

def add_one(x): return x + 1 my_series = pd.Series([1, 2, 3]) my_series.map(add_one)

Building on top of this, like then showing the same with a DataFrame, at some point showing UDFs that receive the whole column with .apply... should help make sure users are following and understanding all the information provided here.

I am a bit negative here. This is duplicating a lot of other documentation that we already have. I think we should instead link to that documentation.

Do you mind pointing out to an specific example @rhshadrach? I found documentation for the aggregate functions, but not much for the map, apply... on Series and DataFrame other than in the API docs. I agree with not having much duplication. Personally, if there is few here and there like in the FAQs, Performance page... I'd rather have the docs related to these methods in this page, as it feels like the natural place, and link to the sections here in the FAQs, performance hints, groupby user guide... Of course there can be cases where it makes more sense the opposite, but maybe we can discuss the specific cases where there is duplication.

apply: https://pandas.pydata.org/docs/user_guide/basics.html#row-or-column-wise-function-application
map: https://pandas.pydata.org/docs/user_guide/basics.html#applying-elementwise-functions

I'd rather have the docs related to these methods in this page, as it feels like the natural place

If we are going to move the docs on e.g. DataFrame.agg here, then this no longer is a page just about UDFs as DataFrame.agg does more than just use UDFs. In addition, that seems like a large reworking of the docs for little (in my opinion, actually negative) benefit.

datapythonista

Very nice, just a couple of small comment. And we need to decide about duplication, but in general looks great. Thanks for the work here @arthurlw

datapythonista · 2025-04-19T12:19:26Z

doc/source/user_guide/user_defined_functions.rst

+* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series and
+  DataFrames.
+* :meth:`~DataFrame.agg` (Aggregate) - Used for summarizing data, supporting custom
+  aggregation functions.
+* :meth:`~DataFrame.transform` - Applies a function to Series and Dataframes while preserving the shape of
+  the original data.
+* :meth:`~DataFrame.filter` - Filters Series and Dataframes based on a list of Boolean conditions.
+* :meth:`~DataFrame.map` - Applies an element-wise function to a Series or Dataframe, useful for
+  transforming individual values.
+* :meth:`~DataFrame.pipe` - Allows chaining custom functions to process Series or
+  Dataframes in a clean, readable manner.


What do you think about having this as a table? Personally I think it should make it easier to understand the differences about the methods. As a general idea:

method function input function output description

map scalar scalar map each element to the element returned by the function elementwise

apply(axis=0 column column map each column to the column returned by the function

apply(axis=1) row row map each row to the row returned by the function

pipe series or dataframe series or dataframe map the series or dataframe to a new series or dataframe returned by the function

Not sure if it makes sense to combine with the table below.

datapythonista · 2025-04-19T12:23:33Z

doc/source/user_guide/user_defined_functions.rst

+.. note::
+    :meth:`DataFrame.filter` does not accept UDFs, but can accept
+    list comprehensions that have UDFs applied to them.


I'm unsure on having filter here for now. I think it's very good that you added it, as it doesn't support udfs, but it probably should. So, it opens a discussion we probably want to have about adding them. @rhshadrach thoughts?

I suspect the reason this was added is that DataFrameGroupBy.filter does accept UDFs. Perhaps that should be mentioned instead?

I actually think DataFrame.filter should accept Boolean masks, similar to PySpark and Polars. But agreed that discussion is not for here!

datapythonista · 2025-04-19T12:35:40Z

doc/source/user_guide/user_defined_functions.rst

+Documentation can be found at :meth:`~DataFrame.pipe`.
+
+
+Best Practices


Maybe just personal preference, but these last 3 sections seem to be talking about the same (performance), I'd have just a section about performance.

I'd keep it short for now, and we can iterate over it later. The reason is that each time we review this before merging it we need to re-read the whole document. So, if we can finish the main part above first, and have this as a placeholder, then in a second PR we can focus more on performance without having to keep reviewing the first part again.

arthurlw added 8 commits March 25, 2025 19:39

udf user guide introduction

3f94137

added apply method

bf984ca

added agg, transform and filter

fe67ec8

added map, pipe and vectorized operations

4ec5697

bugfix

11392d7

updated map method

f322d9e

precommit

b6b7b02

trim trailing whitespace

d20bcc7

toctree

72f7b62

rhshadrach requested changes Mar 29, 2025

View reviewed changes

rhshadrach added Apply Apply, Aggregate, Transform, Map Docs labels Mar 29, 2025

arthurlw added 5 commits March 29, 2025 13:28

restructured udf user guide

90a2d24

updated documentation links

0d02d64

precommit

214f0ac

fix links

fffaad0

change links

561a1f5

rhshadrach requested changes Apr 6, 2025

View reviewed changes

updated user guide

c6891a0

rhshadrach requested changes Apr 12, 2025

View reviewed changes

datapythonista reviewed Apr 12, 2025

View reviewed changes

updated udf user guide based on reviews

f56ec28

datapythonista reviewed Apr 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: User Guide Page on user-defined functions #61195

DOC: User Guide Page on user-defined functions #61195

arthurlw commented Mar 28, 2025

arthurlw commented Mar 28, 2025

rhshadrach left a comment

rhshadrach Mar 29, 2025

arthurlw commented Mar 29, 2025

rhshadrach left a comment

rhshadrach Apr 6, 2025

rhshadrach Apr 12, 2025

rhshadrach Apr 12, 2025

arthurlw Apr 12, 2025 •

edited

Loading

rhshadrach Apr 12, 2025

rhshadrach Apr 12, 2025

rhshadrach Apr 12, 2025

rhshadrach Apr 12, 2025

datapythonista left a comment

datapythonista Apr 12, 2025

datapythonista Apr 12, 2025

rhshadrach Apr 16, 2025

datapythonista Apr 12, 2025

datapythonista Apr 12, 2025

rhshadrach Apr 16, 2025

datapythonista Apr 19, 2025

rhshadrach Apr 19, 2025 •

edited

Loading

datapythonista left a comment

datapythonista Apr 19, 2025

datapythonista Apr 19, 2025

rhshadrach Apr 19, 2025

datapythonista Apr 19, 2025

		Why Use User-Defined Functions?
		-------------------------------

	Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.
	Some of these methods are can also be applied to groupby, resample, and various window objects. See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`, and :ref:`ewm()<window>` for details.

		* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series,
		DataFrames, or groups of data.

		When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation
		functions across groups.

	Introduction to User-Defined Functions
	User-Defined Functions (UDFs)

method	function input	function output	description
map	scalar	scalar	map each element to the element returned by the function elementwise
apply(axis=0	column	column	map each column to the column returned by the function
apply(axis=1)	row	row	map each row to the row returned by the function
pipe	series or dataframe	series or dataframe	map the series or dataframe to a new series or dataframe returned by the function

		Documentation can be found at :meth:`~DataFrame.pipe`.


		Best Practices

DOC: User Guide Page on user-defined functions #61195

Are you sure you want to change the base?

DOC: User Guide Page on user-defined functions #61195

Conversation

arthurlw commented Mar 28, 2025

arthurlw commented Mar 28, 2025

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arthurlw commented Mar 29, 2025

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arthurlw Apr 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Apr 19, 2025 • edited Loading

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arthurlw Apr 12, 2025 •

edited

Loading

rhshadrach Apr 19, 2025 •

edited

Loading