When S3 Turns Your Dask Pipeline Sequential

We recently migrated a project that processes petabytes of seismic data from Google Cloud Platform to Amazon Web Services.

What theoretically should have been a trivial change — basically replacing a few characters in file paths so that gcsfs becomes s3fs — turned into a fairly large rewrite of the code.

After the initial shock of discovering that the code was running about 100× slower with files on Amazon S3, we actually found the root cause quite quickly. The problem was that boto3, which s3fs relies on internally, effectively forced Dask workers into near-sequential behavior. Previously, with gcsfs, workers were happily running in parallel and were not blocking on I/O.

I have to admit that at this point I made a somewhat reckless promise that I would fix everything quickly. I was convinced that all I needed to do was switch the default thread-based cluster to a process-based one.

That confidence mostly came from my lack of experience with Dask. An experienced engineer would immediately think about memory semantics and the problems that usually appear when switching from threads to processes.

Sure enough, when I switched to a process-based cluster, the system stopped working entirely. All workers ended up in an endless shuffle that eventually crashed with OOM.

A year earlier this would probably have condemned me to months of rewriting the data-processing logic, restructuring it so that more operations were localized inside workers. None of that had mattered before, because the code implicitly relied on shared memory.

But this happened in 2026.

So instead of rewriting everything manually, I simply ran a couple dozen iterations of plan → review plan → implement → review implementation, and ended up with a new version of the pipeline where the computation is much better localized inside the workers.