Skip to content

Commit dfac732

Browse files
committed
Update datasets-dask.md
1 parent 33e44d8 commit dfac732

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

docs/hub/datasets-dask.md

+26
Original file line numberDiff line numberDiff line change
@@ -123,3 +123,29 @@ This is useful when you want to manipulate a subset of the columns or for analyt
123123
# for the filtering and computation and skip the other columns.
124124
df.token_count.mean().compute()
125125
```
126+
127+
## Client
128+
129+
Most features in `dask` are optimized for a cluster or a local `Client` to launch the parallel computations:
130+
131+
```python
132+
import dask.dataframe as dd
133+
from distributed import Client
134+
135+
if __name__ == "__main__": # needed for creating new processes
136+
client = Client()
137+
df = dd.read_parquet(...)
138+
...
139+
```
140+
141+
For local usage, the `Client` uses a Dask `LocalCluster` with multiprocessing by default. You can manually configure the multiprocessing of `LocalCluster` with
142+
143+
```python
144+
from dask.distributed import Client, LocalCluster
145+
cluster = LocalCluster(n_workers=8, threads_per_worker=8)
146+
client = Client(cluster)
147+
```
148+
149+
Note that if you use the default threaded scheduler locally without `Client`, a DataFrame can become slower after certain operations (more details [here](https://github.com/dask/dask-expr/issues/1181)).
150+
151+
Find more information on setting up a local or cloud cluster in the [Deploying Dask documentation](https://docs.dask.org/en/latest/deploying.html).

0 commit comments

Comments
 (0)