Spark application code management

There are many approaches to building application code to keep project complexity from growing over time. For example, the object-oriented approach and a lot of attached patterns allow, if not keeping the complexity of the project at the same level, then at least keeping it under control during development, and making the code available to the new programmer in the team.

How can you manage the complexity of an ETL transformation project on Spark?

It’s not that simple.

What does it look like in real life? The customer offers to create an application that collects a storefront. It seems to be necessary to execute the code through Spark SQL and save the result. During development, it turns out that building this mart requires 20 data sources, of which 15 are similar, the rest are not. These sources must be combined. Then it turns out that for half of them, you need to write your own assembly, cleaning, and normalization procedures.

And a simple showcase after a detailed description starts to look something like this:

As a result, a simple project that had only to run a SQL script that collects a showcase on Spark acquires its own configurator, a block for reading a large number of configuration files, its own mapping branch, translators of some special rules, etc.

By the middle of the project, it turns out that only the author can support the resulting code. And he spends most of his time in thought. Meanwhile, the customer asks to collect a couple more showcases, again based on hundreds of sources. At the same time, it should be remembered that Spark is generally not very suitable for creating your own frameworks.

For example, Spark is designed to make the code look something like this (pseudocode):

park.sql(“select table1.field1 from table1, table2 where =”).write(...pathToDestTable)

Instead, you have to do something like this:

var Source1 = readSourceProps(“source1”)
var sql = readSQL(“destTable”)
writeSparkData(source1, sql)

That is, to move blocks of code into separate procedures and try to write something of your own, universal, which can be customized by settings.

At the same time, the complexity of the project remains at the same level, of course, but only for the author of the project, and only for a short time. Any invited programmer will take a long time to master, and the main thing is that it will not work to attract people who know only SQL to the project.

This is unfortunate, because Spark itself is a great way to develop ETL applications for those who only know SQL.

And in the course of the development of the project, it turned out that a simple thing was turned into a complex one.
Now imagine a real project where there are dozens, or even hundreds, of such storefronts as in the picture, and they use different technologies, for example, some can be based on parsing XML data, and some on streaming data.

I’d like to somehow keep the complexity of the project at an acceptable level. How can this be done?
The solution may be to use a tool and a low-code approach, when a development environment decides for you, which takes all the complexity, offering some convenient approach, such as described in this article.

This article describes the approaches and benefits of using the tool to solve these kinds of problems. In particular, Neoflex offers its own solution Neoflex Datagram, which is successfully used by different customers.

But it is not always possible to use such an application.

What to do?

In this case, we use an approach that is conventionally called Orc – Object Spark, or Orka, as you like.

The initial data are as follows:

There is a customer who provides a workplace with a standard set of tools, namely: Hue for developing Python or Scala code, Hue editors for SQL debugging via Hive or Impala, and Oozie workflow Editor. This is not much, but quite enough for solving problems. It is impossible to add something to the environment, it is impossible to install any new tools, for various reasons.

So how do you develop ETL applications, which, as usual, will grow into a large project, in which hundreds of data source tables and dozens of target marts will participate, without drowning in complexity and not writing too much?

A number of provisions are used to solve the problem. They are not their own invention, but are entirely based on the architecture of Spark itself.

  1. All complex joins, calculations and transformations are done through Spark SQL. Spark SQL optimizer improves with every version and works very well. Therefore, we give all the work of calculating Spark SQL to the optimizer. That is, our code relies on the SQL chaining, where step 1 prepares data, step 2 joinite, step 3 calculates and so on.
  2. All intermediate calculations are stored as temporary tables in the Spark catalog, making them available in Spark SQL. Any intermediate data source (DataFrame) can be saved in the current session and further accessed through Spark SQL.
  3. Since the Spark application runs as a Directed Acicled Graph, that is, it runs from top to bottom without any loops, then any dataset saved as a temporary table, for example, in step 2, is available at any stage after step 2.
  4. All operations in Spark are lazy, that is, data is loaded only when it is needed, so registering datasets as temporary tables does not affect performance.

As a result, the entire application can be made very simple.

It is enough to make a configuration file in which to define a single-level list of data sources. This sequential list of data sources is the object that describes the logic of the entire application.

Each data source contains a link to SQL. In SQL, for the current source, you can use a source that is not in Hive, but described in the configuration file above the current one.

For example, source 2, when translated into Spark code, looks something like this (pseudocode):

var df = spark.sql(“select * from t1”);

And source 3 may already look like this:

var df = spark.sql(“select count(*) from source2”)

That is, source 3 sees everything that was calculated before it.

And for those sources that are target showcases, you must specify the parameters for saving this target showcase.

As a result, the application configuration file looks like this:

[{name: “source1”, sql: “select * from t1”},
{name: “source2”, sql: “select count(*) from source1”},
{name: “targetShowCase1”,  sql: “...”, target: True, format: “PARQET”, path: “...”}]

And the application code looks something like this:

List = readCfg(...)
For each source in List:
 df = spark.sql(source.sql).saveAsTempTable(
 If( == true) {
    df.write(“format”, source.format).save(source.path)

This is, in fact, the whole application. Nothing else is required except for one moment.

How to debug all this?

After all, the code itself in this case is very simple, what is there to debug, but the logic of what is being done would be nice to check. Debugging is very simple – you have to go through all applications to the source being checked. To do this, add a parameter to Oozie workflow that allows you to stop the application at the required data source by printing its schema and contents to the log.

We called this approach Object Spark in the sense that all application logic is decoupled from Spark code and stored in a single, fairly simple configuration file, which is the application description object.

The code remains simple, and once created, even complex storefronts can be developed using programmers who only know SQL.

The development process is very simple. At the beginning, an experienced Spark programmer is involved, who creates universal code, and then the application configuration file is edited by adding new sources there.

What this approach gives:

  1. You can involve SQL programmers in the development;
  2. Given the parameter in Oozie, debugging such an application becomes easy and simple. This is debugging any intermediate step. The application will work everything to the desired source, calculate it and stop;
  3. As the customer’s requirements are added, the code of the application itself changes little (well, of course … well, okay), but all its requirements, all the logic for assembling storefronts, remain not in the code, but in the configuration file that controls the application. This file is the object description of the storefront assembly, the same Object Spark;
  4. This approach has all the flexibility available in a programming language that is excluded by using a tool. If necessary, a simple and universal code can be added. For example, you can add support for streaming, calling models, support for parsing XML or JSON nested in the fields of source tables. Given that such an application can be quickly written from scratch, depending on the customer’s requirements;
  5. The main consequence of this approach. As the project becomes more complex, all storefront assembly logic is not smeared across the application code, from where it is rather difficult to extract, but is saved in a configuration file, separate from the code, where it is available for analysis.

Similar Posts

Leave a Reply