Kindly be advised we cannot cancel subscriptions or issue refunds on the forum.
You may cancel your Bitdefender subscription from Bitdefender Central or by contacting Customer Support at:

Thank you for your understanding.

Pyspark code runs more slowly than Pandas


My current project involves converting Pandas code to PySpark code. The code uses a number of transformations, including join, group by, and more. But I've come into a problem where my PySpark code is much slower than the identical Pandas code.

The previous three months have been spent learning about PySpark, which I am still rather new to. I have been contrasting my PySpark code with a benchmark Pandas code at this time. My PySpark code initially ran rapidly because I hadn't called any actions, but I later discovered that calling at least one action is required to start computations.

I added a.count() procedure to my code to satisfy this requirement so that I could check the outcomes, but I soon became aware of a large increase in execution time. I'm not sure if this is the best course of action or if there are more effective options.

I would much welcome tips and direction from the community on how to use PySpark computations invocations in an optimised manner. My goal is to make my code run faster and make sure that computations are completed quickly.

I appreciate your help in advance.