Demystifying pandas with PySpark when scaling out PyData Vermont 2024

Demystifying pandas with PySpark when scaling out
.ical

07-30, 13:30–14:00 (US/Eastern), Filmhouse

In this talk, we'll explore effective strategies for scaling pandas workloads using PySpark. We'll delve into techniques such as the Pandas API on Spark, Python UDFs, Pandas UDFs, and Pandas Function APIs. In addition, this talk covers how to manage dependencies and environment setup seamlessly when transitioning to distributed PySpark cluster, providing insights into optimizing performance and leveraging PySpark features for seamless integration with pandas workflows.

Many people use pandas to handle data, but when dealing with big datasets, pandas can struggle. That's when many folks comes to PySpark to handle larger workloads and benefit from its features for distributed execution.

However, the challenge comes in when trying to combine their pandas workload with PySpark using different ways such as Pandas API on Spark, Python UDF, Pandas UDF, and Pandas Function APIs. Each way has its strengths and weaknesses, and it can be tricky to figure out when to use each one. For example, whether Pandas API on Spark works with other libraries such as scikit-learn out of the box, or if you can always replace a regular pandas DataFrame with a pandas-on-Spark DataFrame for distributed execution. And what's the difference between Python UDF and Pandas UDF?

Another tricky part is the dependency management and environment setup when you want to move your local pandas workload to a distributed environment. It's difficult to ensure all nodes in your cluster have the desired environment with proper dependences.

This talk demystifies how to scale out pandas workloads with PySpark, sharing best practices and providing insights on when, how and which to use PySpark features.

Hyukjin Kwon

Hyukjin is a Databricks software engineer as the tech-lead in OSS PySpark team, ASF member, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, infrastructure, etc. He is the top contributor in Apache Spark, and leads efforts such as Project Zen, Pandas API on Spark, and Python Spark Connect.

Demystifying pandas with PySpark when scaling out .ical 07-30, 13:30–14:00 (US/Eastern), Filmhouse

Demystifying pandas with PySpark when scaling out
.ical

07-30, 13:30–14:00 (US/Eastern), Filmhouse