Speeding Up Spark: The Simple Trick That Saved Me 2 Hours
How Partition Tuning Slashed My DataFrame Processing Time
Hello there, fellow coders and web enthusiasts!
Today, I want to share a challenge I was recently assigned, one that had me scratching my head for a bit. Finding a solution was not as straightforward as I’d hoped, but I’m excited to walk you through the journey of figuring it out!
1. The Problem
The task: We had a job that needed to summarize text, using a Python script with a transformer pipeline.
How it was set up: The job was running on a Kubernetes pod, which seemed like a solid choice.
The problem: It would take around 3 hours to process, and honestly, that was already pushing it.
The challenge ahead: With the plan to increase the input size, this would only make the run time longer, and just wasn’t feasible anymore.
Something had to change: We needed a way to make it faster without compromising the quality of the results.
Before we get into it lets get a basic overview of pyspark and udf’s
PySpark Overview:
PySpark is the Python API for Spark, which is a distributed system for processing big data. It helps process large datasets in parallel using DataFrames and RDDs. You can do things like filtering, aggregating, and transforming data much faster by spreading tasks across a cluster.
UDF (User-Defined Function):
A UDF in PySpark is a custom function you create to apply to DataFrame columns. It’s useful when you need to do something Spark’s built-in functions don’t support, like applying a machine learning model or custom logic. But remember, UDFs can be slower than built-in operations because they run outside of Spark’s optimized execution plan.
2. The Solution
The solution: I decided to try using PySpark. I wrote a UDF (user-defined function) to apply the transformer and summarize the text.
But, the same issue: Even with this new approach, it was still taking the same amount of time, and it was starting to feel like I was stuck in a loop of frustration.
A little detective work: I decided to dig into the Spark UI to see what was going on under the hood, and that’s when I noticed something surprising.
The issue: Imagine you’ve got a sports car with multiple gears, but for some reason, you’re only using one of them. You’re driving down the highway with the engine purring, but you’re barely tapping the gas, wasting all that horsepower. It’s like taking a Formula 1 car and driving it like a tricycle — looks fast, but you’re not even close to unleashing its potential! Spark was only using 4 cores, when it had 96 cores available! That was a massive waste of resources.
The fix: I decided to repartition the data and split it into 512 partitions.
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from transformers import pipeline
# Initialize Spark session
spark = SparkSession.builder.appName("TextSummarization").getOrCreate()
# Sample DataFrame with text to summarize
data = [("This is a long text that needs to be summarized.",),
("Another piece of text that requires summarization.",)]
df = spark.createDataFrame(data, ["text"])
# Load the transformer pipeline for text summarization
summarizer = pipeline("summarization") # this is a dummy pipeline for example
# Define UDF for summarizing text
def summarize_text(text):
summary = summarizer(text)
return summary[0]['summary_text']
# Register the UDF with Spark
summarize_udf = udf(summarize_text, StringType())
# Repartition the data into 512 partitions for parallel processing
df_repartitioned = df.repartition(512)
# Apply the summarizer UDF to the DataFrame
df_with_summary = df_repartitioned.withColumn("summary", summarize_udf(df_repartitioned['text']))
# Show the result
df_with_summary.show(truncate=False)
The result: This little tweak cut the processing time down to just 38 minutes!
Success: Problem solved, and it was a total game-changer!
The Conclusion
By switching to PySpark and using UDFs, I was able to scale the summarization process and get it running way faster. The key was optimizing how Spark used its resources, like repartitioning the data to use more cores. In the end, what used to take 3 hours was cut down to just 38 minutes — definitely a win!