Behind beautiful and understandable Power BI dashboards is often weeks of preparation and data mining. Especially when it comes to building useful BI reporting in a large organization with a traffic volume of tens of millions of visitors every month.
In this article, I want to describe a number of negative aspects that I encountered when building BI reports based on data from web analytics systems in a number of companies (large representatives of Russian e-commerce, insurance companies, etc.). The article does not aim to make anti-advertising, or vice versa, the advertising of certain tools or decisions. It is prepared in order to help avoid possible negative moments to other users and indicate options for solutions.
I’m talking about large amounts of data and show examples of uploading and sampling from Google Analytics 360. On projects with a small amount of data, such difficulties may not exist. I met all the identified problems in practice and in the article I describe only my experience in solving – yours can be completely different.
Connector to Yandex.Metrica
Yandex.Metrica has milder sampling conditions and an intuitive interface compared to Google Analytics. Therefore, many marketers prefer Yandex.Metrica and build BI reporting on data uploading from there. For this use connector Maxim Uvarov. This method is unsafe and does not allow processing complex requests.
Large companies need a guarantee that confidential information will not fall into the hands of detractors, and financial indicators will be used only for stakeholders and exclusively within the company.
The first big minus is that in order to use the connector, in Power BI you need to enable the “Ignore privacy levels” setting, otherwise it just won’t work.
Thus, confidential data can fall into the hands of any unauthorized user. This is confirmed by an excerpt from Power BI Help.
Data connection is anonymous. At the same time, there is no authorization or check for permission to use this data for those users who are really allowed to.
For these connectors to work, you need to get an oauth access token. The description of the connector provides a link on how to give the application access to your account.
In fact, a third-party application will gain access to your account. Nobody bears a guarantee for the safety of your data.
The second minus is that Yandex.Metrica API cannot process large amounts of data on large projects, and as a result, the connector also refuses to work with complex queries on raw Yandex.Metrica data – it does not unload them.
For example, you need to upload data for the year, without any complex filters. The connector throws an error: “The request is too complex. Please reduce the date spacing or sampling. ”
Of course, this is a translation of the error of the Yandex.Metrica API itself. However, in this case, we have no choice to download data with a breakdown by month or day – for example, to cycle data for each month and combine them into a single dataset overcoming an API error.
If you significantly reduce the period, then the API allows you to pump data, but this is only a small part of what we need.
To build large reports, unloading for short periods is impractical and inconvenient. In practice, you often need unloading on complex open site funnels with a huge number of visits per day. Especially when using complex segments of visits, it is practically impossible to upload the necessary data.
Even if you were able to load the data, it’s not a fact that it will be possible to upload to the data model to build the reporting itself, because various loading errors pop up every now and then. For some reason, even with a normally loaded table, errors occur.
In order for all stakeholders within the company to be able to use the report built on the Yandex.Metrica API connector, you need to publish the report in PowerBI Service. In this case, privacy levels will also have to be ignored.
When the visualization is configured and the report is published in the Microsoft cloud, the connector will display a combined query error and will not be updated through the Power BI Service – you will not be able to configure the scheduled update of your report. An error occurs when one of the queries, which forms the final upload table, contains the logic for simultaneous work with external and internal queries.
In the error, the table “Function called” is just indicated:
Such a table is obtained due to the PQYM function, which works with an external data source – Yandex.Metrica API. The reporting update mechanism in Power BI does not work and asks to redo the combination of queries. This is due to the internal links of one request to connect to an external request, as well as due to the structure of the function itself.
We decided not to use such a connector when building reporting and switched to unloading data from the Yandex.Metrica API via Python scripts, more about this later in the article.
To build reporting on Google Analytics data (we are talking about direct downloading from the analytics system), there is also connector Maxim Uvarov.
Here you will also have to set “Ignoring privacy levels”. Access to the API of your analytics system, which contains data on online sales, takes place on an anonymous connection.
According to the instructions, you will again have to open access to a third-party application, only for a Google account, which contains a threat to the privacy of your data.
You can rewrite some data in the connector itself and start using it for your own application. To do this, register “OAuth 2.0 Client ID” at cloud.google.com, get access tokens and enter the necessary access keys into its mechanism of work. But even these actions will not help in the fight against sampling and updating your reporting.
Whenever you touch the aggregated data of the Google Analytics API, we will always encounter sampling.
The connector has the option of unloading one day, but even it does not always save. The sampling indicator built into the connector shows that it is there even in one day.
For example, Shopping Stage breakdown session data cannot be uploaded without sampling on such amounts of data. This will lead to large errors in data and measures when building reports, especially when it is necessary to accurately calculate the conversion of the transition from one step to another, or to upload data on funnels with a complex structure of segments.
Even if data loss is not critical for you, you still cannot configure the scheduled update of your reports through the Power BI Service.
The same error of the mechanism of work interferes, only the PQGA functions.
We also refused to work with this connector.
Google Analytics Native Connector
Sometimes Google and Microsoft make concessions to each other: Power BI out of the box has a standard Google Analytics connector. Unfortunately, there is no such option for Yandex.Metrica.
The mechanism for working with the Google Analytics API is more conveniently implemented here than in Maxim Uvarov’s connector, but while downloading a large number of indicators, the data is too sampled. The repeating numbers in the screenshot are the sampled data.
You can avoid sampling if you upload mini-tables broken down only by date and any one indicator. For example, date, number of sessions and device type. And then from these tables to collect the necessary reporting.
If you do not have millions of traffic, this may help. However, reporting will not be informative and with limitations on scalability: because of the many mini-tables in the data model, it is difficult to create relationships between them. This imposes a restriction on the creation of filters and slices.
You can rewrite the standard connector and force it to upload data for every day, while the data will be close to the real numbers in the GA interface, but you will not be able to avoid sampling.
According to my observations, in the breakdown of Shopping Stage sessions, not a single connector has yet been able to normally upload data without sampling on large amounts of data.
In the standard connector, you cannot customize the download dates. All that is is simply a selection of parameters and metrics from the Google Analytics APIs arranged in folders.
This connector is suitable for small and medium-sized Internet projects with little traffic. Here a similar situation arises with obtaining data from Yandex.Metrica. All the necessary information without loss of accuracy can only be downloaded directly through the API using scripts or special data streaming services – more on this later in the article.
Sticks in Wheels by Google
Recently, we noticed a strange lock – when trying to configure the update for the “native” connector to Google Analytics, an authorization error appeared through a Google account in the Power BI Service interface.
We tried to find out from Google what to do in such a situation and how to “whitewash” the account’s reputation, but our attempts failed. We tried to register new accounts and log in through different devices – Google blocked any authorization attempts.
About a month after the blocking, we still managed to log in through the same account, but such an incident can greatly interfere with the release of the necessary BI-reporting on time and put the customer in an awkward position. Immediately after the blockage, we began to look for a possible way out of the situation. To avoid unexpected locks, we decided to create our own controlled environment with the necessary data for BI reporting.
Get the necessary BI-reporting for medium and large organizations, and even without sampling is possible. But you need to try a little and choose one of several ways.
You can use ready-made services for streaming data. Their main purpose is to upload data from the analytics system, various advertising systems, CRM and other services to any storage. Convenient, fast and practical, but not free. On large volumes of data or data with multiple sources, the amounts are tangible.
The most famous services:
Each of these systems allows you to configure daily streaming of non-sampled data from web analytics systems (Yandex.Metrics and Google Analytics) to databases (for example, BigQuery, Azure SQL Database, Microsoft SQL Server or ClickHouse) for connecting via Power BI and creating reports.
If the above tools cost the company too much, you can try to implement your data upload system and use Power BI + Python + Any Cloud Server (or Yandex.Cloud) + PostgreSQL. This is the method we use when building BI reporting.
System Interaction Scheme:
This work scheme provides the necessary safety, autonomy and aggregation of data. Having established such a scheme once, you will not need to waste time collecting information – everything is in the repository, you just have to connect and start making a report on Power BI.
Python scripts “pull” all the API data from the necessary sources, write to the database, or form uploads (for example, in csv format). Scripts for uploading from the API and loading into the database deserve a separate article, and someday I will write it.
Of course, there are a lot of templates and ready-made solutions on the Internet, but for further scaling it is necessary to know the individual needs of the system under which the scripts for unloading are created.
Data is written to a pre-created and configured database – in our case, PostgreSQL. Scripts pump out data for each day and write them to tables according to a schedule. They are located on a cloud server under our control or under the control of the client’s IT-security service.
There are a large number of companies that provide cloud storage services. Based on personal preferences, it was chosen Yandex Cloud.
Advantages of Yandex.Cloud:
- Convenient configuration and server rental with preinstalled PostgreSQL.
- Extensive documentation in Russian, which is clearly stated and allows you to quickly learn how to use the service.
- Convenient server management.
- Fast support of specialists.
- Flexibility of various equipment settings and tariffs: no one imposes any ready-made configuration packages, you choose the configuration of your equipment yourself, plus you can immediately order the preset of any database.
The resulting uploads are written to the PostgreSQL database. This database was chosen because it is an object-relational DBMS (working with multidimensional arrays and JSON support from the “box” – for complex data structures must have). It is flexible, reliable, free, and it also has a huge community of support.
Power BI has a built-in direct connector in PostgreSQL. The first time you connect, you need to install additionally Npgsql. Power BI notifies you of this and provides a link.
To configure update reports when using cloud storage, you must configure the gateway Power BI Gateway. It is needed to configure updating BI reporting and ensuring the security of confidential data when transferring it from a database, which should be located in the internal IT circuit of your organization, or on a secure cloud server.
The gateway provides fast and secure data transfer between data warehouses located on the organization’s internal network and Microsoft cloud services. The data transfer between Power BI and the gateway is secure, and all credentials provided by the gateway administrators are encrypted to protect information in the cloud and are decrypted only on the gateway computer.
After download Power BI Gateway and its launch a dialog box appears.
To register and operate the gateway, the credentials of the Power BI Service cloud service are required.
After all the settings, a window about the working gateway will appear. Next you need to spend additional settings.
A simpler alternative to customizing PostgreSQL is Google BigQuery. Power BI has a built-in connector in Google BigQuery. In this case, it is necessary to follow the principle of “Seven times to calculate the cost, once to fulfill the request.”
BigQuery can be used for free for a year, simply by tying a bank card upon agreement to use the service. Money after the expiration of the period without your knowledge will not be debited. Further tariffing – according to the tariffs of the system itself, but with a few pleasant “pluses”.
If the downloaded data from your systems does not reach 10 GB per month (!), There will be no charge for downloading.
If data processing per month does not exceed 1 TB, no fee will be charged for this either. If more, then $ 5 for each treatment 1 TB.
If you need to store more than 10 GB per month, each subsequent gigabyte will cost $ 0.010.
The tariffs are described in more detail at BigQuery page.
If you need data from various analytics systems, and at the same time, safety, security, continuous aggregation, usability and lack of sampling are fundamental to you, you have two ways.
The first is streaming services. They are easy to use, they have ongoing technical support and ready-made integration with data warehouses.
- Tangible costs on large amounts of data.
- Limited type of integration provided.
- Uncontrolled processes on the side of the streaming service provider to work with your information.
The second is its own data upload and aggregation stream. Scalable, controlled, with data security. It may seem complicated and time consuming to implement. If a company needs an effective organization of BI reporting, its further scalability and data security, then this option is the most acceptable.
All those who are just thinking about full-fledged BI reporting need to add the motto “Accumulate, structure and protect these data from youth” into the corporate culture. There is nothing better than a full-fledged fault-tolerant database with regular backups and with differentiated rights to use it for different categories of stakeholders.
It is necessary to accumulate, structure and aggregate your data from analytics systems, and from a mobile application, and from other client services. All this in the future will help to analyze both your audience and products in any sections. Information and data now rule the world and sometimes cost more than money.
About how to implement the approach described in the article for transferring analytics data to a PostgreSQL database using Python scripts and implementation details, I will discuss in the following articles.