Parallel processing libraries in Python - Dask vs Ray
- At work, I'm working with enabling numerical computations/transformations of N-dimensional tensors that cannot fit in a single machine's memory. They are "big data" in the order of TBs-PBs. The data types are numerical arrays; e.g. np.array. Data processing / analysis code are written in Python.
- I looked at two libraries to enable parallel computing within the Python ecosystem. The two are Dask and Ray. I made some comparison between them.
Dask
Dask is a flexible library for parallel computing in Python
https://docs.dask.org/en/latest/
Ray
Ray is a fast and simple framework for building and running distributed applications.
https://ray.readthedocs.io/en/latest/
(Originally published 02/25/2020, updated 03/03/2021 as per new features in the libraries)
Dimension | Dask | Ray | Color Code Justification |
---|---|---|---|
Efficient Data Sharing Between Nodes | Source: https://distributed.dask.org/en/latest/serialization.html
| Source: https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html
| Source: https://arrow.apache.org/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/
|
Scheduler Fault Tolerance | Source: https://distributed.dask.org/en/latest/resilience.html
| Source: https://github.com/ray-project/ray/issues/642
| Source: https://github.com/ray-project/ray/issues/642
|
Support for Deployment on AWS | Sources [1]: https://docs.dask.org/en/latest/setup/cloud.html [3]: https://yarn.dask.org/en/latest/
| https://docs.ray.io/en/latest/cluster/launcher.html
|
|
Support for Distributed Training | Source: https://ml.dask.org/
| Source: https://docs.ray.io/en/master/raysgd/raysgd.html
|
|
ML features - Hyperparameter optimization | Source: https://ml.dask.org/hyper-parameter-search.html
| Source: https://ray.readthedocs.io/en/latest/tune.html
|
|
ML feature - Generalized Linear Models | Source: https://ml.dask.org/glm.html
|
|
|
ML feature - Clustering | Source: https://ml.dask.org/clustering.html
|
| |
ML feature - Reinforcement Learning Library |
| Source: https://ray.readthedocs.io/en/latest/rllib.html
| |
Support for running on GPU [13] [14] | Source: https://docs.dask.org/en/latest/gpu.html
| Source: https://ray.readthedocs.io/en/latest/using-ray-with-gpus.html
|
|
Support for creating Task Graph [15] [16] | Source: https://docs.dask.org/en/latest/custom-graphs.html
| Source: https://rise.cs.berkeley.edu/blog/modern-parallel-and-distributed-python-a-quick-tutorial-on-ray/
|
|
Ease of Monitoring [17] [18] | Source: https://distributed.dask.org/en/latest/diagnosing-performance.html
| Source: https://docs.ray.io/en/latest/ray-dashboard.html
|
|
Support for Distributed Data Structures | Source: [1] https://docs.dask.org/en/latest/array.html [2] https://docs.dask.org/en/latest/dataframe.html [3] https://docs.dask.org/en/latest/bag.html
| N/A | Source: https://github.com/ray-project/ray/issues/642
|
Process Synchronization with Shared Data |
| Source: https://ray.readthedocs.io/en/latest/actors.html
| Source: https://github.com/ray-project/ray/issues/642
|
Maturity of Project | Source [1] https://github.com/dask/dask/releases?after=0.2.1
| Source [1] https://github.com/ray-project/ray/releases?after=ray-0.5.1 [2] https://rise.cs.berkeley.edu/projects/ray/
|
|
Comments
Post a Comment