Skip to content

Spark run faster and faster#

  • Cluster Optimization
  • Parameters Optimization
  • Code Optimization

Cluster Optimization#

Locality Level#

Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:

  • PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
  • NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
  • NO_PREF data is accessed equally quickly from anywhere and has no locality preference
  • RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
  • ANY data is elsewhere on the network and not in the same rack

Performance: PROCESS_LOCAL > NODE_LOCAL > NO_PREF > RACK_LOCAL

Locality settting#
  • spark.locality.wait.process
  • spark.locality.wait.node
  • spark.locality.wait.rack

Data Format#

  • text
  • orc
  • parquet
  • avro
format setting#
  • spark.sql.hive.convertCTAS
  • spark.sql.sources.default

parallelising#

  • spark.sql.shuffle.partitions : default is 200

computing#

  • --executor-memory : default is 1G
  • --executor-cores : default is 1 if large memory cause resource throtle in cluster, if small memory cause task termination if more cores cause IO issue, if less cores slow dow computing

memory#

  • spark.executor.overhead.memory

table join#

  • spark.sql.autoBroadcastJoinThreshold : default 10M

predicate push down in Spark SQL queries#

  • spark.sql.parquet.filterPushdown : default True
  • spark.sql.orc.filterPushdown=true : default False

reuse RDD#

    df.persist(pyspark.StorageLevel.MEMORY_ONLY)

Spark operators#

  • shuffle operators
  • avoid using reduceByKey, join, distinct, repartition etc
  • Broadcast small dataset

  • High performance operator

  • reduceByKey > groupByKey (reduceByKey works at map side)
  • mapPartitions > map (reduce function calls)
  • treeReduce > reduce (treeReduce works at executor not driver)
    • treeReduce & reduce return some result to driver
    • treeReduce does more work on the executors while reduce bring everything back to the driver.
  • foreachPartitions > foreach (reduce function calls)
  • filter -> coalesce (reduce number of partitions and reduce tasks)
  • repartitionAndSortWithinPartitions > repartition & sort
  • broadcast (100M)

shuffle#

  • spark.shuffle.sort.bypassMergeThreshold
  • spark.shuffle.io.retryWait
  • spark.shuffle.io.maxRetries

TBC