Pyspark code runs more slowly than Pandas

mani0525 · June 2023

My current project involves converting Pandas code to PySpark code. The code uses a number of transformations, including join, group by, and more. But I've come into a problem where my PySpark code is much slower than the identical Pandas code.

The previous three months have been spent learning about PySpark, which I am still rather new to. I have been contrasting my PySpark code with a benchmark Pandas code at this time. My PySpark code initially ran rapidly because I hadn't called any actions, but I later discovered that calling at least one action is required to start computations.

I added a.count() procedure to my code to satisfy this requirement so that I could check the outcomes, but I soon became aware of a large increase in execution time. I'm not sure if this is the best course of action or if there are more effective options.

I would much welcome tips and direction from the community on how to use PySpark computations invocations in an optimised manner. My goal is to make my code run faster and make sure that computations are completed quickly.

I appreciate your help in advance.

Pyspark code runs more slowly than Pandas

All Time Leaders

Categories

Defenders of the month