Apache Spark Interview Questions for Data Engineer

Spark can be defined as an open-source computing engine, a functional approach to parallel data processing on computer clusters, and a set of libraries and executables.

or as a framework for distributed batch and stream processing of unstructured and semi-structured data, part of the Hadoop ecosystem of projects.

Apache Spark Architecture

Ecosystem

Components

Spark has 5 main components. These are Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and GraphX.

  1. Spark Core includes functions for memory management, as well as disaster recovery, cluster task scheduling, and interaction with storage.

  2. Spark SQL — an SQL query engine that supports multiple data sources and uses a data structure called DataFrame.

  3. Spark Streaming — processing of streaming data in real time.

  4. MLlib — a library for machine learning.

  5. GraphX — a library for working with graphs.

Cluster Manager

Cluster Manager Manages the actual cluster machines and controls resource allocation for Spark applications. Cluster Manager can be Standalone cluster, Apache Mesos and YARN.

YARN is in Hadoop 2.0 There are two modes available when running applications on YARN:

  • yarn-client mode. Driver launches are used only to request resources from YARN.

  • yarn-cluster mode, the Spark driver runs in the master application process, which is managed by YARN in the cluster, and the client can be remote.

Catalyst Optimizer

Catalyst Optimizer: Since Spark SQL is the most common and easy-to-use component of Spark, which supports both SQL queries and DataFrame API, the Catalyst optimizer uses advanced features of the programming language to build more optimal queries. It allows you to add new optimization methods and functions, provides the ability to expand the functionality of the optimizer. At the same time, Catalyst Optimizer is used, among other things, to increase the speed of task execution, as well as to optimize resource use.


The process of calculations

There are several main parts involved in the computation process.

  • DRIVER: program performer

    • executes program code (converts user program into tasks)

    • plans and launches the work of executors

    Every Spark application contains driverwhich initiates various operations on the cluster. The driver process runs user code that creates a SparkContext, creates RDDs, and performs transformations and actions. If the driver exits, the application is finished.

  • EXECUTOR: performs calculations

  • CLUSTER MANAGER: manages the actual machines in the cluster and controls resource allocation for Spark applications

In-memory processing (in-memory processing). Spark stores and processes data in RAM. This approach provides much greater speed than loading and reading data from disk, as in MapReduce.

Parallel processing and combining operations. Spark distributes data and computations across multiple nodes in a cluster, performing different processing operations in parallel in real time. This distinguishes it from MapReduce, where each subsequent stage of working with a dataset requires the previous one to be completed. Thanks to these features, Spark is tens of times faster than MapReduce.

Workflow in Apache Spark:

  • A distributed system has a cluster manager that controls the distribution of resources among executors.

  • When a Spark application starts, the driver requests resources from the cluster manager to run executors.

  • If resources are available, the cluster manager starts the executors.

  • The driver sends its tasks to available performers.

  • The performers perform the necessary calculations and send the result to the driver.

  • After all calculations are completed, the executors are closed, their resources are returned to the common cluster. After that, the distributed system can work on other tasks.


Data structures

Timeline of data structures appearance by Spark versions:

  • RDD starting from Spark 0 (Low level API)

    • Resilient Distributed Dataset — is a specific set of objects divided into partitions. RDD can be presented as both structured and unstructured data sets. Partitions can be stored on different cluster nodes. RDDs are fault-tolerant and can be restored in the event of a failure.

  • DataFrame — starting with Spark 1.3 (Structured API)

    • is a set of typed records divided into blocks. In other words, a table consisting of rows and columns. Blocks can be processed on different nodes of the cluster. DataFrame can only be represented as structured or semi-structured data. Data is represented by a named set of columns, reminiscent of tables in relational databases.

  • DataSet — starting with Spark 1.6 (Structured API)

Let's consider the behavior of the specified data structures in the context of Immutability and Interoperability**:

RDD consists of a set of data that can be divided into blocks. A block, or partition, can be considered a whole logical, unchangeable part of data, which can be created, among other things, through transformations of existing blocks.

DataFrame can be created from RDD. After such a transformation, it will no longer be possible to return to RDD. That is, the original RDD cannot be restored after transformation into DataFrame.

DataSet: Spark functionality allows you to transform both RDD and DataFrame into DataSet itself. dataset is a typical Dataframe.

If in Dataframe a column can be accessed, while in Dataset a column is a separate object. dataframe is a raw dataset.

Sources for data structures

Data Sources API

  • RDD — Data source API allows RDD to be formed from any sources, including text files, and not necessarily even structured ones.

  • DataFrame — Data source API allows you to process different file formats (AVRO, CSV, JSON, as well as from HDFS storage systems, HIVE** tables, MySQL).

  • DataSet — Dataset API also supports various data formats.

A Spark DataFrame can be created from various sources:

Examples of Spark DataFrame formation: https://habr.com/ru/articles/777294/


Spark and Hadoop

The relationship between Spark and Hadoop only matters if you want to run Spark on a cluster where you have Hadoop installed. Otherwise, you don't need Hadoop to run Spark.

To run Spark on a cluster, you need a shared file system. A Hadoop cluster provides access to a distributed file system through HDFS and a cluster manager in the form of YARN. Spark can use YARN as a cluster manager for distributed work and use HDFS to access data. Additionally, some Spark applications can use the MapReduce programming model, but MapReduce is not the underlying model for computing in Spark.

Lazy evaluation.

Lazy evaluation: Spark uses the concept of lazy execution of computations. This means that operations on data are performed only before the results of these operations are actually used. Thanks to this, computing power is not wasted on calculations that will be needed at some point in the future.

Transformations

Transformations are operations on an existing RDD that return a new RDD.

Actions

Actions (Actions) computes the final result based on the RDD and either returns that result to the driver program or stores it in an external storage system such as HDFS. Transformations return RDDs, while actions return other data types.


Spark Application

Spark applications include a driver program and executors and run various parallel operations on a cluster. There are two types of Spark applications: Spark notebook applications and Spark batch applications.

Driver ~ application

Driver ~ application

Spark Application is split into Jobs. 1 Job = 1 Action (non-lazy transformation).

Job is divided into Stages (suffle generates stages).

Within a Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they are submitted from separate threads. A “job” in this case is a Spark action (e.g. save, collect) and any tasks that need to be performed to evaluate that action. The Spark scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (such as requests for multiple users).

By default, the Spark scheduler runs jobs in FIFO mode. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to run, then the second job gets priority, and so on. If the job at the front of the queue doesn't need to use the entire cluster, later jobs can start running right away, but if the jobs at the front of the queue are large, later jobs may be significantly delayed.

SparkConf

SparkConf stores configuration parameters for an application deployed to Spark. Every application must be configured before it can be deployed to a Spark cluster. Some configuration parameters specify properties of the application, and some are used by Spark to allocate resources on the cluster. Configuration parameters are passed to the Spark cluster via SparkContext.

SparkContext

SparkContext represents the connection to the Spark cluster. This is the entry point to Spark and configures the internal services needed to establish communication with the Spark runtime

SparkUI

Spark UI Beginner's Guide: How to Monitor and Analyze Spark Jobs

Spark Jobs optimization

Dynamic resource allocation

Spark provides a mechanism to dynamically adjust the resources an application occupies based on the workload. This means that an application can return resources to the cluster when they are no longer in use and request them again later when needed. This feature is especially useful when multiple applications share resources in a Spark cluster.


Optimizing Spark Jobs

  • Choosing the Right Data Structures
    While RDD is Spark’s core data structure, it’s a lower-level API that requires more verbose syntax and lacks the optimizations provided by higher-level data structures. Spark moved toward a more convenient and optimized API with the introduction of DataFrames, higher-level abstractions built on top of RDDs. Data in a DataFrame is organized into named columns, making it more similar to data in a relational database. DataFrame operations also benefit from Catalyst, Spark’s optimized SQL execution engine, which can improve computational efficiency, potentially improving performance. Transformations and actions can be performed on DataFrames just as they can on RDDs.

  • Caching
    When calculating Actions, an additional reading occurs (despite the fact that the dataframe has already been pre-formed = read). To avoid unnecessary reading, caching can be done.
    Caching is an important technique that can lead to significant increases in computational efficiency. Frequently accessed data and intermediate calculations can be cached or stored in a memory location, allowing for faster retrieval. Spark provides a built-in caching feature that can be especially useful for machine learning algorithms, graph processing, and any other application where the same data needs to be accessed multiple times. Without caching, Spark would recalculate the RDD or DataFrame and all its dependencies every time an action is called.

    The following block of Python code uses PySpark, the Python Spark API, to cache a DataFrame named df:

    df.cache()

    It is important to remember that caching requires careful planning because it uses the memory resources of Spark worker nodes that perform tasks such as performing computations and storing data. If the dataset significantly exceeds the available memory, or you cache RDDs or DataFrames without reusing them in subsequent stages, potential overflows and other memory management issues can lead to poor performance.

  • Data Partitioning
    Spark's architecture is built on partitioning, dividing large amounts of data into smaller, more manageable units called partitions. Partitioning allows Spark to process large amounts of data in parallel by distributing computations across multiple nodes, each processing a subset of the total data.

    While Spark provides a default partitioning strategy, typically based on the number of available CPU cores, it also provides options for custom partitioning. Instead, users can specify a custom partitioning feature, such as splitting data by a specific key.

    Number of partitions
    One of the most important factors affecting parallel processing efficiency is the number of partitions. If there are not enough partitions, the available memory and resources may be underutilized. On the other hand, too many partitions may result in increased performance costs due to task scheduling and coordination. The optimal number of partitions is usually set as a factor of the total number of cores available in the cluster.

    Partitions can be specified using repartition() and coalesce(). This example repartitions a DataFrame into 200 partitions:

    df = df.repartition(200) # repartition method
    df = df.coalesce(200) # coalesce method

    The repartition() method increases or decreases the number of partitions in an RDD or DataFrame and performs a full shuffle of the data across the entire cluster, which can be expensive in terms of processing and network latency. The coalesce() method decreases the number of partitions in an RDD or DataFrame and, unlike repartition(), does not perform a full shuffle but instead merges adjacent partitions to reduce the total number.

  • Broadcasting
    Join() is a common operation in which two data sets are combined based on one or more common keys. Rows from two different data sets can be combined into a single data set by matching the values ​​in the specified columns. Since it requires shuffling data between multiple nodes, join() can be an expensive operation in terms of network latency.

  • Filtering unused data
    When working with high-dimensional data, minimizing computational costs is important. Any rows or columns that are not absolutely necessary should be removed. Two key techniques that reduce computational complexity and memory usage are early filtering and column pruning.

  • Minimizing the use of Python user-defined functions (UDFs)
    One of the most effective methods for optimizing PySpark is to use PySpark's built-in functions whenever possible. PySpark comes with a rich library of functions, all of which are optimized

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *