Streaming Frameworks: Apache Flink

Requirements for modern systems in terms of information processing speed are growing. Users no longer want to wait for a post on a social network or a movie in an online cinema to load for more than a few seconds. Therefore, developers of high-load systems are faced with the task of processing big data in real time.

In this article you will learn:

  • what are streaming frameworks and what are they for;

  • What are the main features, architecture and operating principle of Apache Flink;

  • what are the similarities and differences between Apache Flink and Apache Spark;

  • what are the main use cases of Apache Flink;

  • What to consider when designing Apache Flink-based applications.

The article will be useful for architects, system analysts and system designers solving problems of big data processing in real time.

What are streaming frameworks and what are they for?

Streaming data processing or Data Streaming ㅡ is an approach that allows processing large amounts of data in the form of a stream with minimal lag from real time.

Streaming Framework (stream processing framework) is a tool that enables stream data processing. A streaming framework can read one or more data streams from different sources and process them. At the output, the framework can provide a processed data stream to other applications or store data in a database or storage.

There are two types of data streams.

  1. Unlimited flow (unbounded stream). This data stream is generated in real time and has no defined end. An example of an unbounded stream is the data stream from a message broker about user likes and views.

  2. Limited flow (bounded stream). Such a stream has a clearly defined beginning and end. An example of a bounded stream is information received from a database. Although data can be added to the database, for a streaming framework the stream is limited to the amount received within a single request.

Most often in stream processing, the primary source is an unconstrained stream. Data from constrained streams can be used to further enrich the primary stream.

Fig. 1 ㅡ Limited and unlimited data flows (source)

Fig. 1 ㅡ Limited and unlimited data flows (source)

Learn more about Apache Flink

Main features of the framework

Apache Flink ㅡ is an open-source streaming framework designed for streaming processing of unconstrained and constrained data streams.

The framework has the following main features.

  1. Data processing is done line by line. This plays an important role in the speed of data processing.

  2. Supported programming languages: Java, Scala, Python.

  3. You can use advanced window functions. Window functions allow you to divide a stream into segments – windows – and perform operations on data within each window.

  4. It is possible to implement the exactly-once approach (strictly one-time delivery, when each incoming event affects the result only once). There is a limitation: the source and receiver of the data must also support the exactly-once approach.

These features of the framework allow for the processing of large amounts of data in real time.

How Apache Flink Works

At a high level, the data processing process using Apache Flink can be depicted as follows:

Fig.2 ㅡ Data processing using Apache Flink (source)

Fig.2 ㅡ Data Processing with Apache Flink (source)

The framework-based application reads data from various sources (source) Sources can be unlimited real-time streams from applications, devices and message brokers, and limited streams from databases and file storage.

Inside the application it generates directed acyclic graph (DAG, directed acyclic graph). Each node of the graph is responsible for performing some operation on the data (sorting, filtering, calculations, etc.). A directed graph can be represented in two forms: logical and physical.

Logical graph shows what operations will be performed on the data. In a logical graph, nodes are called operators (operator). Physical graph ㅡ a logical graph transformed for execution. In a physical graph, each operator can be represented by several tasks (Task) Tasks can be executed either in parallel or sequentially.

Fig.3 ㅡ Directed acyclic graph (DAG)

Fig.3 ㅡ Directed Acyclic Graph (DAG)

After processing, the application can send the results to recipients (Sink): into other streams for further processing, into databases, storage, etc.

State in Apache Fink

You may have heard about the existence of architectural concepts Stateful And Stateless.

Stateful architecture

Stateless architecture

Meaning

Data is stored within the service between requests and sessions.

Data is not stored within the service.

Pros

fast processing

the ability to save state data between requests for use in future tasks

easier to develop

better scaling

lower resource consumption

Cons

additional load on system resources

decreased productivity

more data to transfer between services

the difficulty of implementing complex logic

Apache Flink allows you to implement both Stateless and Stateful applications. To provide Stateful computing, Apache Flink has a component State (State). State is a snapshot of data at a specific point in time. State is stored in RAM (in-memory), which allows for maximum acceleration of computations.

Let's look at how State works in Apache Flink using a simplified example. Let's assume that we have a task to count how many times users have liked something.

Fig.4 ㅡ State in Apache Flink

Fig.4 ㅡ State in Apache Flink

Suppose the event “User liked” is written to the Apache Kafka message broker. An Apache Flink-based application reads messages from the broker and splits the data stream into unique keys (key). In the example, the key will be the user ID. For each user, the application will count the number of likes at the current moment. When the user puts another like, it will be added to the existing number of likes. In this case, the previous number will not need to be read from the database, it will be stored inside the application as State. Stateful computing provides fast data processing in such cases, when it is important to know previous values ​​or states.

It is important that the state is only available within the task. In our example, the total number of likes will only exist within the likes counting task. This limitation must be taken into account when designing Apache Flink-based applications. If the state is needed in other tasks, it is necessary to think about how to transfer it: this can be a new data stream or saving to the database.

Rice. 5ㅡ State in Apache Flink (source)

In large, highly loaded systems, the state size can reach several terabytes. To support such systems, Apache Flink's architecture provides several storage options.

  1. In RAM. Suitable for applications where the state size is limited to a few gigabytes.

  2. In the RocksDB database. Suitable for high-load applications. A specially allocated one is responsible for placing the state in the database. State Backend.

Job Manager and Task Manager

Apache Flink is deployed using two components: Job Manager and Task Manager.

Job Manager ㅡ is an orchestrator, a central component of the Apache Flink architecture, responsible for coordinating distributed execution of an application. The job manager performs many tasks:

  • plans the execution of tasks;

  • responds to task execution or errors in the process;

  • coordinates checkpoints ㅡ mechanism for periodic backup of state in case of failure;

  • coordinates disaster recovery.

Task Manager is the main worker process in Apache Flink architecture, it is responsible for executing a specific task.

Rice. 6ㅡ Job Manager and Task Manager in Apache Flink

Fig. 6 ㅡ Job Manager and Task Manager in Apache Flink

When deploying an Apache Flink application, keep in mind that Job Manager is relatively lightweight and requires fewer resources than Task Manager.

Comparison of Apache Spark and Apache Flink

Apache Spark and Apache Flink frameworks are often compared and put on the same level. They are indeed similar:

  • can process large data streams;

  • allow you to use window functions and aggregate data;

  • allow you to implement complex logic: connecting streams, filtering, grouping, etc.;

  • work with programming languages ​​Java, Scala, Python;

  • support at-least-once delivery guarantee;

  • can connect to message brokers, databases, storages both as data sources and as receivers.

The frameworks also have significant differences:

  • The main difference is the way the information is processed. Flink processes data row by row: this allows data to be processed in real time with minimal delay. Spark processes data in micro-batches (batch) – combinations of several rows. This allows for increased performance of calculations over data within a batch.

  • different architecture and deployment mechanisms. For example, Spark does not have a Task Manager and Job Manager, instead other components are used.

  • Spark has a high entry threshold compared to Flink and is considered more difficult to learn.

The choice of framework depends on the need and task.

Spark is more often used at the junction of the backend and storage, such as DWH. These can be tasks of applying machine learning models to a data stream or accumulating data in a storage for further analytics.

Flink is most often used at the backend level. Let's take a closer look at the main use cases of Apache Flink.

Main use cases of Apache Flink

There are three main cases in which Apache Flink can be successfully applied.

  1. Social networks and recommendation systems. Let's say an application generates a stream of user behavior data (clicks, likes, views, comments, etc.). Apache Flink can be used to implement a mechanism for counting user actions in real time. The data can then be transferred to:

    1. databases ㅡ source for display on application pages;

    2. repositories, such as DWH ㅡ sources for building recommendations of relevant products, content or advertising.

  2. Fraud detection. Apache Flink allows you to implement user behavior analysis and immediate response in cases of strange behavior that differs from the specified pattern.

  3. Anomaly detection. The framework can be used to monitor the state of the system (including IT infrastructure) and instantly notify about deviations of metrics from normal values.

Let's look at an example of how a fraud detection application based on Apache Flink can work.

Fig. 7 - Fraud detection with Apache Flink (source)

Fig. 7 – Fraud Detection with Apache Flink (source)

There are two data streams at the input of the application.

  1. User Actions flow. This could be logging in and out of the system, adding to the cart, paying, etc.

  2. A flow of given patterns of behavior (Patterns). For example, the pattern states that adding a product to the cart and immediately logging out may indicate fraud.

The application connects two data streams (for example, using window functions) and waits for the user to perform some actions. The State we talked about earlier allows us to implement the wait for a series of user actions.

If the behavior matches the pattern, the app will send a notification that the user is behaving strangely and needs attention.

What should be considered when designing an Apache Flink-based application?

  1. Stream processing logic. What operations need to be performed on the data: filtering, grouping, business logic, etc.

  2. Are keys needed in operators? If yes, how many and what?

  3. Is State needed? If yes, what size should it be? Is it possible to set the Time To Live (TTL) for State?

  4. Should State be broadcast to other operators? If so, how to do it?

Resume

  1. Streaming frameworks enable real-time streaming processing of big data.

  2. Apache Flink can process both bounded and unbounded data streams coming from message brokers, databases, and storage. Processing logic in Apache Flink is implemented through a directed acyclic graph. Apache Flink allows you to implement stateful computing

  3. When designing an Apache Flink-based application, the main focus should be on the logic of constructing a directed acyclic graph.

Additional sources

  1. O'Reilly: Weske F., Kalari V. Stream Processing with Apache Flink.

  2. Official documentation for Apache Flink.

The author of the article is Daria Kolesova

Head of Data Quality Engineering Group.

Experience in roles: Big Data R&D developer, BI Engineer, Data analyst, technical lead.

More than 5 years in Big Data development.

More than 3 years as a coach.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *