Spark run faster and faster#
- Cluster Optimization
- Parameters Optimization
- Code Optimization
Cluster Optimization#
Locality Level#
Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:
- PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
- NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
- NO_PREF data is accessed equally quickly from anywhere and has no locality preference
- RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
- ANY data is elsewhere on the network and not in the same rack
Performance: PROCESS_LOCAL > NODE_LOCAL > NO_PREF > RACK_LOCAL
Locality settting#
- spark.locality.wait.process
- spark.locality.wait.node
- spark.locality.wait.rack
Data Format#
- text
- orc
- parquet
- avro
format setting#
- spark.sql.hive.convertCTAS
- spark.sql.sources.default
parallelising#
- spark.sql.shuffle.partitions : default is 200
computing#
- --executor-memory : default is 1G
- --executor-cores : default is 1 if large memory cause resource throtle in cluster, if small memory cause task termination if more cores cause IO issue, if less cores slow dow computing
memory#
- spark.executor.overhead.memory
table join#
- spark.sql.autoBroadcastJoinThreshold : default 10M
predicate push down in Spark SQL queries#
- spark.sql.parquet.filterPushdown : default True
- spark.sql.orc.filterPushdown=true : default False
reuse RDD#
Spark operators#
- shuffle operators
- avoid using reduceByKey, join, distinct, repartition etc
-
Broadcast small dataset
-
High performance operator
- reduceByKey > groupByKey (reduceByKey works at map side)
- mapPartitions > map (reduce function calls)
- treeReduce > reduce (treeReduce works at executor not driver)
- treeReduce & reduce return some result to driver
- treeReduce does more work on the executors while reduce bring everything back to the driver.
- foreachPartitions > foreach (reduce function calls)
- filter -> coalesce (reduce number of partitions and reduce tasks)
- repartitionAndSortWithinPartitions > repartition & sort
- broadcast (100M)
shuffle#
- spark.shuffle.sort.bypassMergeThreshold
- spark.shuffle.io.retryWait
- spark.shuffle.io.maxRetries
TBC