Add open_virtual_mfdataset #349

TomNicholas · 2024-12-16T00:03:56Z

Here I have copied the code from xr.open_mfdataset, changed it to use open_virtual_dataset, and added an option to parallelize with lithops as an alternative to using dask.delayed.

I haven't even tried to run this yet, but I think this is the right approach @tomwhite? I realised we don't need cubed's blockwise because xarray.open_mfdataset has internal logic to turn the N-dimensional concat into a 1D map already, so lithops.map should be fine?

Also I think based on our conversation we should be able to use lithops.map instead of lithops.map_reduce like @thodson-usgs did in #203 because the tiny size of the virtual datasets being returned to the client means that we should be able to get away with a single reduction step on the client even at large scale? (see also #104 for justification that we only need to send back kB-sized objects).

Adds open_virtual_mfdataset as suggested in open_virtual_mfdataset #345, but also sketches out how we might close both Trying to run open_virtual_dataset in parallel #95 and Serverless parallelization of reference generation #123
Tests added
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst
New functions/methods are listed in api.rst
~~New functionality has documentation~~
~~Make it a context manager~~

TomNicholas · 2024-12-16T00:04:49Z

virtualizarr/backend.py

+        datasets, closers = dask.compute(datasets, closers)
+    elif parallel == "lithops":
+
+        def generate_refs(path):


This is the equivalent of @thodson-usgs 's map_references function

VirtualiZarr/examples/virtualizarr-with-lithops/virtualizarr-with-lithops.py

Line 25 in 1dbd119

def map_references(fil):

TomNicholas · 2024-12-16T00:05:57Z

virtualizarr/backend.py

+
+        # wait for all the serverless workers to finish, and send their resulting virtual datasets back to the client
+        completed_futures, _ = fn_exec.wait(futures, download_results=True)
+        virtual_datasets = [future.get_result() for future in completed_futures]


IIUC this will cause every serverless worker to send a small virtual dataset back to the client process over the internet somehow

virtualizarr/backend.py

TomNicholas · 2024-12-16T00:10:40Z

virtualizarr/backend.py

+from xarray.backends.api import _multi_file_closer
+from xarray.backends.common import _find_absolute_paths
+from xarray.core.combine import _infer_concat_order_from_positions, _nested_combine


I don't like importing these deep xarray internals like this (though _infer_concat_order_from_positions and _nested_combine haven't changed since I wrote them 6 years ago), but the only alternative would be to make a general virtualizarr backend engine for xarray (see #35).

TomNicholas · 2024-12-16T00:12:17Z

virtualizarr/backend.py

+        # lithops doesn't have a delayed primitive
+        open_ = open_virtual_dataset


I think the code would be more straightforward if the parallel primitive we used for lithops was the same as the one we used for dask.

See pydata/xarray#9932 (comment)

TomNicholas · 2024-12-16T00:12:47Z

virtualizarr/backend.py

+    elif parallel == "lithops":
+        import lithops


I believe all of this could also be useful upstream in xr.open_mfdataset

See pydata/xarray#9932

tomwhite · 2024-12-17T12:45:34Z

I realised we don't need cubed's blockwise because xarray.open_mfdataset has internal logic to turn the N-dimensional concat into a 1D map already, so lithops.map should be fine?

Yes, that should work fine. We may want to loosen/generalize blockwise slightly in Cubed to return an arbitrary object so it can be done with Cubed - but that can be done later.

Also I think based on our conversation we should be able to use lithops.map instead of lithops.map_reduce like @thodson-usgs did in #203 because the tiny size of the virtual datasets being returned to the client means that we should be able to get away with a single reduction step on the client even at large scale? (see also #104 for justification that we only need to send back kB-sized objects).

Agreed - it will be interesting to see this for large datasets. (It's also similar to the approach I've taken for storing data in Icechunk where the changesets are returned to the client - again, small kB-sized UUIDs.)

for more information, see https://pre-commit.ci

* need latest version of xarray to import internals correctly * Fix metadata equality for nan fill value (#502) * add check that works for fill_values too * note about removing once merged upstream * type hint * regression test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove accidental changes to pyproject.toml * Update pyproject.toml * ignore mypy --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Setup intersphinx mapping for docs (#503) * Setup intersphinx mapping for docs --------- Co-authored-by: Kyle Barron <[email protected]> * Change default loadable_variables (and indexes) to match xarray's behaviour (#477) * draft refactor * sketch of simplified handling of loadable_variables * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * get at least some tests working * separate VirtualBackend api definition from common utilities * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove indexes={} everywhere in tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * stop passing through loadable_variables to where it isn't used * implement logic to load 1D dimension coords by default * remove more instances of indexes={} * remove more indexes={} * refactor logic for choosing loadable_variables * fix more tets * xfail Aimee's test that I don't understand * xfail test that explicitly specifies no indexes * made a bunch more stuff pass * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix netcdf3 reader * fix bad import in FITS reader * fix import in tiff reader * fix import in icechunk test * release note * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update docstring * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix fits reader * xfail on empty dict for indexes * linting * actually test new expected behaviour * fix logic for setting loadable_variables * update docs page to reflect new behaviour * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix expected behaviour in another tests * additional assert * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use encode_dataset_coordinates in kerchunk writer * Encode zarr vars * fix some mypy errors * move drop_variables implmentation to the end of every reader * override loadable_variables and raise warning * fix failing test by not creating loadable variables that would get inlined by default * improve error message * remove some more occurrences of indexes={} * skip slow test * slay mypy errors * docs typos * should fix dmrpp test * Delete commented-out code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unecessary test skip --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Max Jones <[email protected]> * Update pyproject.toml deps (#504) * re-add icechunk to upstream tests * add pytest-asyncio to test envs * passing serial open_virtual_mfdataset test * passes with lithops but only for the HDF backend * add test for dask * refactored serial and lithops codepaths to use an executor pattern * xfail lithops * consolidate tests by parametrizing over parallel kwarg * re-enable lithops test * remove unneeded get_executor function * add test for using dask distributed to parallelize --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Max Jones <[email protected]> Co-authored-by: Kyle Barron <[email protected]>

* need latest version of xarray to import internals correctly * passing serial open_virtual_mfdataset test * passes with lithops but only for the HDF backend * add test for dask * refactored serial and lithops codepaths to use an executor pattern * xfail lithops * consolidate tests by parametrizing over parallel kwarg * re-enable lithops test * remove unneeded get_executor function * add test for using dask distributed to parallelize * Add ManifestStore for loading data from ManifestArrays (#490) * Draft ManifestStore implementation --------- Co-authored-by: Tom Nicholas <[email protected]> Co-authored-by: Kyle Barron <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * make it work for dask delayed * correct docstring --------- Co-authored-by: Max Jones <[email protected]> Co-authored-by: Kyle Barron <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* need latest version of xarray to import internals correctly * passing serial open_virtual_mfdataset test * passes with lithops but only for the HDF backend * add test for dask * refactored serial and lithops codepaths to use an executor pattern * xfail lithops * consolidate tests by parametrizing over parallel kwarg * re-enable lithops test * remove unneeded get_executor function * add test for using dask distributed to parallelize * make it work for dask delayed * correct docstring * added compliant executor for lithops * add links to lithops issues * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

TomNicholas · 2025-03-25T23:18:02Z

virtualizarr/tests/test_backend.py

+                pytest.mark.xfail(
+                    reason="Lithops bug - see https://github.com/lithops-cloud/lithops/issues/1428"


See lithops-cloud/lithops#1428

I was able to dodge this by not using functools.partial, but I'm a bit worried that my newer approach of using a closure won't work properly in a remote execution context.

TomNicholas · 2025-03-25T23:18:24Z

virtualizarr/parallel.py

+class LithopsEagerFunctionExecutor(Executor):
+    """
+    Lithops-based function executor which follows the concurrent.futures.Executor API.
+
+    Only required because lithops doesn't follow the concurrent.futures.Executor API, see https://github.com/lithops-cloud/lithops/issues/1427.
+    """


See lithops-cloud/lithops#1427

* need latest version of xarray to import internals correctly * passing serial open_virtual_mfdataset test * passes with lithops but only for the HDF backend * add test for dask * refactored serial and lithops codepaths to use an executor pattern * xfail lithops * consolidate tests by parametrizing over parallel kwarg * re-enable lithops test * remove unneeded get_executor function * add test for using dask distributed to parallelize * make it work for dask delayed * correct docstring * added compliant executor for lithops * add links to lithops issues * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * specify dask and lithops executors with a string again * fix easy typing stuff * fix typing errors by aligning executor signatures * remove open_virtual_mfdataset from public API for now * release note * refactor construction of expected result * implement preprocess arg, and dodge lithops bug * update comment --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

pyproject.toml

virtualizarr/backend.py

TomNicholas added 2 commits December 15, 2024 17:54

copy implementation from xarray

a48e8a4

sketch idea for lithops parallelization

75c7da3

TomNicholas added the enhancement New feature or request label Dec 16, 2024

TomNicholas had a problem deploying to test-release December 16, 2024 00:04 — with GitHub Actions Failure

TomNicholas commented Dec 16, 2024

View reviewed changes

virtualizarr/backend.py Outdated Show resolved Hide resolved

TomNicholas commented Dec 16, 2024

View reviewed changes

standardize naming of variables

ce5a096

TomNicholas had a problem deploying to test-release December 16, 2024 05:03 — with GitHub Actions Failure

add to public API

bcf1b70

TomNicholas had a problem deploying to test-release December 16, 2024 05:07 — with GitHub Actions Failure

fix errors caused by trying to import xarray types

61f0f32

TomNicholas temporarily deployed to test-release December 16, 2024 05:10 — with GitHub Actions Inactive

start writing tests

5317207

TomNicholas temporarily deployed to test-release December 16, 2024 05:41 — with GitHub Actions Inactive

TomNicholas added 2 commits December 17, 2024 10:32

passing test for combining in serial

cd54328

Merge branch 'main' into open_virtual_mfdataset

323904c

TomNicholas temporarily deployed to test-release December 17, 2024 15:34 — with GitHub Actions Inactive

requires_kerchunk

c229c06

TomNicholas temporarily deployed to test-release December 17, 2024 16:06 — with GitHub Actions Inactive

test for lithops with default LocalHost executor

f296ef9

TomNicholas temporarily deployed to test-release December 17, 2024 16:25 — with GitHub Actions Inactive

notes on confusing AssertionError

542f063

TomNicholas temporarily deployed to test-release December 17, 2024 16:54 — with GitHub Actions Inactive

TomNicholas mentioned this pull request Dec 17, 2024

Split optional dependencies in pyproject.toml #309

Merged

2 tasks

ensure lithops is installed

a013b2c

TomNicholas changed the base branch from main to develop March 24, 2025 15:45

TomNicholas and others added 2 commits March 24, 2025 08:48

Merge branch 'develop' into open_virtual_mfdataset

ae8d31d

[pre-commit.ci] auto fixes from pre-commit.com hooks

b5f382e

for more information, see https://pre-commit.ci

pre-commit-ci bot had a problem deploying to test-release March 24, 2025 15:49 Failure

TomNicholas mentioned this pull request Mar 24, 2025

Bug with comparing ManifestArrays using __eq__ which have fill_value=NaN #501

Closed

TomNicholas and others added 2 commits March 25, 2025 13:26

Merge branch 'develop' into open_virtual_mfdataset

f4dcdf6

TomNicholas marked this pull request as ready for review March 25, 2025 20:28

TomNicholas temporarily deployed to test-release March 25, 2025 20:28 — with GitHub Actions Inactive

TomNicholas and others added 2 commits March 25, 2025 14:39

Merge branch 'develop' into open_virtual_mfdataset

6418c82

TomNicholas temporarily deployed to test-release March 25, 2025 21:41 — with GitHub Actions Inactive

TomNicholas temporarily deployed to test-release March 25, 2025 23:17 — with GitHub Actions Inactive

TomNicholas commented Mar 25, 2025

View reviewed changes

TomNicholas mentioned this pull request Mar 25, 2025

Why doesn't the API follow that of concurrent.futures? lithops-cloud/lithops#1427

Open

Merge branch 'develop' into open_virtual_mfdataset

5e851f3

TomNicholas temporarily deployed to test-release March 29, 2025 17:52 — with GitHub Actions Inactive

TomNicholas temporarily deployed to test-release March 29, 2025 18:01 — with GitHub Actions Inactive

TomNicholas commented Mar 29, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

Apply suggestions from code reviewRemRemove new deps

92f03a6

TomNicholas temporarily deployed to test-release March 29, 2025 18:16 — with GitHub Actions Inactive

TomNicholas commented Mar 29, 2025

View reviewed changes

virtualizarr/backend.py Outdated Show resolved Hide resolved

remove rogue print statement

1e0643e

TomNicholas temporarily deployed to test-release March 29, 2025 18:20 — with GitHub Actions Inactive

TomNicholas merged commit 810d4d0 into develop Mar 29, 2025
11 checks passed

TomNicholas deleted the open_virtual_mfdataset branch March 29, 2025 18:31

TomNicholas mentioned this pull request Mar 29, 2025

open_virtual_mfdataset #345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add open_virtual_mfdataset #349

Add open_virtual_mfdataset #349

TomNicholas commented Dec 16, 2024 •

edited

Loading

TomNicholas Dec 16, 2024

TomNicholas Dec 16, 2024

TomNicholas Dec 16, 2024

TomNicholas Dec 16, 2024 •

edited

Loading

TomNicholas Mar 21, 2025

TomNicholas Dec 16, 2024

TomNicholas Mar 21, 2025

tomwhite commented Dec 17, 2024

TomNicholas Mar 25, 2025

TomNicholas Mar 29, 2025

TomNicholas Mar 25, 2025

		# lithops doesn't have a delayed primitive
		open_ = open_virtual_dataset

		pytest.mark.xfail(
		reason="Lithops bug - see https://github.com/lithops-cloud/lithops/issues/1428"

Add open_virtual_mfdataset #349

Add open_virtual_mfdataset #349

Conversation

TomNicholas commented Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented Dec 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas commented Dec 16, 2024 •

edited

Loading

TomNicholas Dec 16, 2024 •

edited

Loading