Parallel processing libraries in Python - Dask vs Ray
At work, I'm working with enabling numerical computations/transformations of N-dimensional tensors that cannot fit in a single machine's memory. They are "big data" in the order of TBs-PBs. The data types are numerical arrays; e.g. np.array. Data processing / analysis code are written in Python. I looked at two libraries to enable parallel computing within the Python ecosystem. The two are Dask and Ray. I made some comparison between them. Dask Dask is a flexible library for parallel computing in Python https://docs.dask.org/en/latest/ Ray Ray is a fast and simple framework for building and running distributed applications. https://ray.readthedocs.io/en/latest/ (Originally published 02/25/2020, updated 03/03/2021 as per new features in the libraries) Dimension Dask Ray Color Code Justification Efficient Data Sharing Between Nodes Source: https://distributed.dask.org/en/latest/serialization.html Data sharing happens via TCP messages between workers. Every ti...