Apache Spark… This is the Base

Spark can be defined as an open-source computing engine, a functional approach to parallel data processing on computer clusters, and also as a set of libraries and executables.

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.
Spark: The Definitive Guide

AI Generated**** Data Engineer using Apache Spark

AI Generated* Data Engineer using Apache Spark

Spark designed to solve a wide range of data analysis tasks, from data loading and SQL queries to machine learning and stream data processing. At the moment, Spark is considered one of the most actively developing and used open source tools. To understand how to work with Spark, let’s look at its components.

Ecosystem

Spark Ecosystem.  Not AI generated

Spark Ecosystem. Not AI generated

Part I. Programming language support

Spark can be integrated with various programming languages ​​for analytical tasks, including Java, Python, Scala and R.

Part II. Components

Spark contains 5 main components. These are Spark Core, Spark SQL, Spark Streaming, Spark MLlib and GraphX.

1. Spark Core includes functions for memory management, as well as disaster recovery, task scheduling in the cluster, and interaction with storage.

2. Spark SQL – an SQL query engine that supports various data sources and uses a data structure such as DataFrame.

3. Spark Streaming – processing of streaming data in real time.

4. MLlib – library for machine learning.

5. GraphX – a library for working with graphs.

III Part of the ecosystem – Cluster Management, which can be Standalone cluster, Apache Mesos and YARN.

Concerning Catalyst Optimizer, because Spark SQL is the most common and easy-to-use Spark component that supports both SQL queries and the DataFrame API, the Catalyst optimizer uses advanced programming language features to build more optimal queries. It allows you to add new optimization methods and functions, and provides opportunities to expand the functionality of the optimizer. At the same time, Catalyst Optimizer is used, among other things, to increase the speed of task execution, as well as to optimize the use of resources.

Computations

There are several main parts involved in the calculation process.

drawing

Interaction architecture. Not AI generated

  • DRIVER: program performer

  • EXECUTOR: performs calculations

  • CLUSTER MANAGER: manages real cluster machines and controls the allocation of resources for Spark applications

Data structures

Chronology of appearance of data structures by Spark versions

  • RDD since Spark 0 (Low level API)

    • Resilient Distributed Dataset – this is a specific set of objects divided into partitions blocks. RDD can be presented both in the form of structured and unstructured data sets. Partitions can be stored on different cluster nodes. RDDs are fault tolerant and can be recovered in the event of a failure.

  • DataFrame – starting from Spark 1.3 (Structured API)

    • this is a set of typed records divided into blocks. In other words, a table consisting of rows and columns. Blocks can be processed on different nodes of the cluster. A DataFrame can only be represented as structured or semi-structured data. Data is represented by a named set of columns, reminiscent of tables in a relational database.

  • DataSet – starting from Spark 1.6 (Structured API)

Let’s consider the behavior of these data structures in the context of Immutability and Interoperability**:

RDD consists of a set of data that can be divided into blocks. A block, or partition, can be considered an integral logical unchangeable piece of data, which can be created, including through transformations of existing blocks.

DataFrame can be created from RDD. After such a transformation, it will no longer be possible to return to RDD. That is, the original RDD cannot be restored after transformation into a DataFrame.

DataSet: Spark functionality allows you to convert both RDD and DataFrame into the DataSet itself.

Sources for data structures

Data Sources API

  • RDD – Data source API allows RDD to be generated from any sources, including text files, not necessarily even structured ones.

  • DataFrame – Data source API allows you to process different file formats (AVRO, CSV, JSON, as well as from HDFS storage systems, HIVE** tables, MySQL).

  • DataSet – Dataset API also supports various data formats.

Spark DataFrame can be created from various sources:

DF Sources.  Not AI generated

DF Sources. Not AI generated

Examples of generating a Spark DataFrame:

#from file:
from pyspark.sql.types import StructType,StringType, StructField, IntegerType
fields=[ StructField("col1", IntegerType()), StructField("col2", StringType())]
schema1 = StructType(fields)
from_file_df = spark.read.csv("/directory/your_file_name", schema=schema1, sep=";", header=True.)

#to view, call show():
from_file_df.show()

DataFrame from txt.  Not AI Generated

DataFrame from txt. Not AI Generated

For json files, it is advisable to define the schema in advance to avoid errors:

DataFrame from json.  Not AI Generated

DataFrame from json. Not AI Generated

For csv files, it is recommended to take into account headers and delimiters when defining a DataFrame:

DataFrame from csv.  Not AI Generated

DataFrame from csv. Not AI Generated

#from HDFS table – 2 options:
#Pyspark
pyspark_df = spark.table("schema_name.table_name")
#SQL query
sql_statement = “””SELECT * FROM schema_name.table_name”””
sql_df = spark.sql(sql_statement)

#From Pandas DF:
from_pandas_df = spark.createDataFrame(DF_Pandas)
#Directly from RDD:
from_rdd_df = spark.createDataFrame(rdd)

Spark DataFrame from Pandas DataFrame.  Not AI Generated

Spark DataFrame from Pandas DataFrame. Not AI Generated

Spark DataFrame has functionality similar to Pandas, such as select (column selection), filter (filtering), sort (sorting), WithColumn (new columns), merge (join tables) and others. Spark is somewhat similar in concept to the Pandas library in Python.

Lazy evaluation

Lazy calculations are implemented with RDD, DataFrame and DataSet in a similar way – as a result of the execution of an action (a calculation that returns a result, implying data exchange between executors and driver). When working with RDD, the result is not calculated at the time of definition – instead, a structural sequence of transformations is formed that was implemented on the initial RDD. In this case, the transformation itself will be implemented when it is necessary to display or transmit the result of the transformation. In DataFrame and DataSet, calculations occur in a similar way: at the moment when some action is required on it (show(), count(), collect(), saveAs(), etc.).

AI generated lazy evaluation

AI generated lazy evaluation

spark session configuration parameters

To start a Spark session, just run:

spark1 = SparkSession.builder.master("local").appName("<app_name>").getOrCreate()

Sessions can be configured at the launch stage using parameters. Let’s look at some of the main ones****:

Parameter

Default

Description

spark.app.name

(none)

App name. It is displayed in the UI and logs.

spark.driver.cores

1

Number of cores to be used by the driver (for cluster mode).

spark.driver.maxResultSize

1g

Limit the total size of the serialized results of all partitions for each Spark action (eg collect) in bytes.

spark.driver.memory

1g

The amount of memory to run driver processes.

spark.executor.memory

1g

The amount of memory for executor processes to run.

spark.executor.pyspark.memory

Not set

The amount of memory allocated by PySpark for each executor. If this parameter is specified, then PySpark memory for the executor will be limited to the specified value, if the parameter is not specified, Spark will not limit memory.

The proposed parameters can be combined with each other for optimal memory allocation, acceleration and reduction of intermediate calculations.

AI Generated Data Engineer.  I wish I could understand where the face is...

AI Generated Data Engineer. I wish I could understand where the face is…

*AI pictures Generator
**Apache Spark RDD vs DataFrame vs DataSet
***Hive Tutorial for Beginners
****Application Properties

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *