What you need to implement Apache Superset

Apache Superset is an open source product for building business analytics that is actively gaining popularity.

Much has already been written about its deployment and its technical features, so I would like to talk about the experience of implementing Superset from an organizational point of view, and I will also try to describe the technical environment and important differences from Power BI. I will specifically focus on the access rights system, because… In our company this issue is important.

First, I’ll tell you about the distribution of BI responsibilities that we use.

There are 3 main roles involved in the process:

  1. IT – the main IT function is to support an analytical data warehouse built on the basis of a distributed Postgres database. IT solves tasks (requests) received from business to load new data into the storage. The main focus is data quality and solution reliability, as well as creating an architecture that provides maximum performance. It is also the responsibility of IT to follow naming conventions and use precise terms when naming fields and tables.

  2. Analysts – employees from business units. Actually, those people who make dashboards. They have a good understanding of the challenges and needs of their end users and have the skills to present and communicate information based on numbers. When using Superset, it becomes necessary for analysts to have basic to intermediate level knowledge of SQL. If analysts need some new data in the warehouse, then they request the integration of this data into IT.

  3. End business users – their task is simply to log into the system and use dashboards in their processes. They contact analysts when new requests arise.

Next, I will describe what I had to do to implement the described interaction scheme using Superset:

  1. Make data storage fast. Analytical storage must be fast, otherwise it will be practically impossible to use Superset. I think caching dashboards provided in superset is not the best way due to the variability of queries with different filters.
    In our case, it follows that the data must be preprocessed on the “master nodes” in such a way as to minimize the load on the main storage. Standard logical replication in Postgres is used to transfer data to the main storage.

    Our data is not so large in volume as it is in great variety and is frequently updated. For this reason, we use a traditional database rather than a columnar one. (Note also that Postgres has a Citus extension that can be used for columnar or mixed storage to speed up aggregate queries on large tables.)

  2. Set up a system of access rights in the data storage. As we understand, data can be confidential, so we create access groups for analysts and define their rights in the database.
    We needed to configure ldap in Postgres so as not to have to think about passwords, deleting employees who quit, etc.
    For each group of analysts (department), we create a diagram in which we give full rights to create their own views.
    The latter must be done, because if you write a query in Superet itself (virtual), then it works with Postgres much slower than accessing a view (like physical). This is easy to verify, for example, by checking the queries in pg_stat_statements. In the “physical” mode, the Superset system builds quite “adequate” queries to the database.

  3. Train analysts. The problem is that not all analysts know SQL, although many of them learned SQL at the institute or in various additional courses. To improve our knowledge, we had to record a short course that included: a little DBMS theory, the structure of our storage, training on how to write SELECT and create VIEW, as well as how to build dashboards in a superset. (Of course, IT supports analysts when writing complex queries and optimizing slow queries.) In general, “upgrading” such knowledge is interesting for analysts, because they have a desire to be in demand specialists.

  4. Organize a system of access rights in Superset itself.

    • First, let's talk about user authentication. To “log in” users, we use a Keycloak server integrated with Superset. Keycloak forwards requests to Active Directory (AD) and reports user data to Superset, including access groups from AD. Next, Superset, through configured mapping, automatically assigns the user to his role. The user can be authorized as a dashboard editor (analyst) or as an end user with specific roles to view dashboards.

    • In order to limit the access of editors (analysts), you need to grant rights to connect to the database (database connection). Those. Essentially, this is a connection to a database with a specific system login, for which the necessary rights are set. (Alternatively, you can restrict access at the schema level.)

    • In order to differentiate the access of end users, you need to include the necessary datasets in their roles (this is done by IT). An unusual feature of Superset is that rights to dashboards are assigned through datasets. Those. the user sees a list of dashboards in accordance with the availability of the datasets used in them. (This approach takes some getting used to).

General architecture of the access rights system

General architecture of the access rights system

  1. Make various minor additional settings in superset

    • Currency codes, number formats, default time zone.

    • Set up a mailbox and system for sending out dashboards. Some users like to receive charts by email. This functionality in Superset is quite functional, but it requires some tinkering with the configuration.

A quick comparison with Power BI

We can talk about many differences between both systems, but I want to focus on what seems to me to be key when switching to Superset:

  1. The Superset system requires database performance. Power BI can make a data model “offline” and update a specific dashboard on a schedule. In Superset, we can say that there is no such possibility (caching partly helps, but the performance of the database is still required).

  2. Superset cannot use Excel as data source. Of course, Excel is obviously evil, but this idea is not easy to convey to users. When using Superset, you will first have to load Excel into the database or find the necessary data in a more reliable source.

  3. Dashboard design in Superset takes place in the browser (online). Power BI requires installation of a client application and often requires uploading data to the client. In my opinion, the Superset approach looks more modern and allows you to work with large volumes directly on the server.

  4. Superset does not have a folder system for organizing dashboards. It should be noted that at the time of writing this article, a tag system is already being developed in Superset, as some kind of analogue of folders. Now we are removing unnecessary dashboards for the user using access rights.

  5. Superset does not have a mobile app. As a consolation, you can only offer to send dashboards to your mailbox.

Additionally, I note that users like the ergonomics and design of Superset. I think this factor helps a lot in implementation.

I hope I was able to talk about what is around the Superset system and share useful experience.

The article turned out to be more of a review. If anyone is interested in technical nuances, I will try to answer in chat or in private messages. I would be glad if community members share a couple of interesting ideas or alternative opinions on this topic.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *