In the fall of 2021, I thought about free analytics and reporting tools available to ordinary users. You can use Power BI or Tableau in one form or another, but why not try something simpler?
A small disclaimer: the dataset, which will be discussed below, was uploaded in the fall of 2021. Now the dataset is different, perhaps cleaner. I considered it irrational to upload new data, since a series of posts will be about simple visualizations, and not about current research or complex diagrams. And no, this is not a detailed manual on the capabilities of GDS, this is only a general overview of the solution and analysis of one case.
We are only interested in the side of the work of an ordinary analyst, as far as possible (and as far as I can imagine), so I will try to look for the simplest ways to solve the problem. I understand that some methods, like using an intermediate database, do not look easy to someone, but you can just as well use tables from Google. I just had a database at hand, and building a full-fledged ETL process would not work without it.
Working with the database
ETL process and data loading
To simplify, let’s consider our ETL process as loading data into storage, bringing it to the data type we need, and uploading it to our Google Data Studio analytics system.
In order not to go far, it was decided to take the dataset from data.mos.ru It has a “data” section, where there is a “roads and transport” section, where there is a dataset “issued permits … to taxi drivers” (there is a very long name that did not want to be copied). Why exactly this option? I love cars and it would be interesting to know what brand of cars are most popular in taxis. By the way, in the spring of 2022, this study may be quite relevant due to the increase in the cost of spare parts.
The first problem we face is that there is no way to upload data directly to the Studio, some kind of intermediate storage is required. The restriction is logical, there are 2 ways to solve it: either use some kind of database or Google Sheets. I already mentioned that the second option seems difficult to me, so next I will upload the CSV to my database. Nothing interesting: normal Mysql, none of its parameters can be of interest. True, filling it with a 50-meter CSV turned out to be a bit not as easy as I expected, but what can I do, I haven’t held checkers in my hands for a long time.
At that moment, when I tried to upload the first couple of hundred lines in the phpMyAdmin interface, I realized that there would simply be nothing here. Who and how fills these documents is a mystery, but a decent amount of artifacts was found even though I did not go deep into the data. At this moment, I had the idea that we can even make a comparison: how much data will we lose if we cut off the records quite roughly without trying to bring them to a sane form; how the data will change if we slightly convert them to a common denominator. Go!
Status field. In the initial data, in addition to 5 statuses (Cancelled, Expired, Active, Terminated and Suspended), there are some dates. Everything is simple here, we take only those fields that have non_numbers. Optionally, only the first character can be checked for a digit, but at this stage I do not see any need for this. Upper will allow us to exclude options for different ways of writing statuses.
SELECT DISTINCT UPPER(status) FROM taxidata where status REGEXP '^[[:alpha:]]';
There is also a lot of garbage in the VehicleBrand field. It can be divided into several categories: typos and erroneously entered data. Misprints can only be tracked by some kind of mapping, but I did not figure out how to automate this process. Therefore, we will simply try to cut off cases when the model is also written in the field with the brand. To do this, let’s try to cut off part of the string up to the first space and use only it.
select distinct substring_index(VehicleBrand, ' ', 1), count(VehicleBrand) from taxidata group by substring_index(VehicleBrand, ‘ ’, 1);
This allowed to reduce from 531 lines to 355
It is immediately clear that there are records of the type MarkName-, that is, an extra hyphen. We will remove it with another substring_index
CREATE VIEW taxidataclean AS SELECT LicenseNumber, global_id, VehicleNumber, INN, substring_index(VehicleBrand, ' ', 1), OGRN, BlankNumber, VehicleYear, FullName, ShortName, LicenseDate, LicenseEndDate, ValidityDate, EditDate, Info, Status, VehicleModel
where status REGEXP '^[[:alpha:]]'
After all the transformations, the number of unique car brands decreased to 310
CREATE VIEW taxidataclean AS SELECT LicenseNumber, global_id, VehicleNumber, INN, substring_index(substring_index(VehicleBrand, ' ', 1), '-', 1) AS VehicleBrand, OGRN, BlankNumber, VehicleYear, FullName, ShortName, LicenseDate, LicenseEndDate, ValidityDate, EditDate, Info, Status, VehicleModel
where status REGEXP '^[[:alpha:]]'
Now it’s time to calculate how much data we lost and what we gained in return. Everything related to VehicleBrand did not reduce the number of rows in the table in any way. We only reduced the data to one denominator. But instead of 531 unique records, we were able to achieve 310! This is already good progress, but then we will think a little about what else can be done about this.
By adding a condition for the Status field, we got 176885 lines instead of 177425. That is, only 540 lines had an unreadable status, and by default we decided that we did not need them. This is about 0.3% of the total number of entries.
A little more transformation
The next problem is that we have a huge number of records (some fleets transmitted data in an extremely unreadable form) with an incorrect label. This cannot be corrected automatically in any way, only by hand. Therefore, we will create a mappingVehicleBrand table, where we will compare the clumsy name and its normal appearance.
When the table is created, we will load it into the table of the same name in the database and try to collect some final data mart
CREATE TABLE TaxiDataClean1 AS SELECT taxidata.LicenseNumber, taxidata.global_id, taxidata.VehicleNumber, taxidata.INN, taxidata.OGRN, taxidata.BlankNumber, taxidata.VehicleYear, taxidata.FullName, taxidata.ShortName, taxidata.LicenseDate, taxidata.LicenseEndDate, taxidata.ValidityDate, taxidata.EditDate, taxidata.Info, taxidata.Status, taxidata.VehicleModel, trim(mappingVehicleBrand.VehicleBrandNew) as VehicleBrand
LEFT JOIN mappingVehicleBrand ON trim(mappingVehicleBrand.VehicleBrand) = substring_index(substring_index(taxidata.VehicleBrand, ' ', 1), '-', 1)
where taxidata.status REGEXP '^[[:alpha:]]' ;
Results of working with the database
As a result, we reduced the number of stamps from 310 to 92! It is already quite possible to work with this, although some artifacts remain (almost all of them lie in the “empty” VehicleBrand).
Now it’s time to talk about why so much trouble with the database was needed at all. Uploading excel to the BI system is the reality of our days, but definitely not the target picture. First, flat files are inconvenient to reload every time they change. If we are talking about google docs, then it is possible to achieve at least a single file location. If we talk about files downloaded from the user’s workstation, everything can break down quite quickly and simply due to simple copying, renaming or moving the file.
Secondly, working with a database is much more convenient, because, as we found out, on its basis we can make a data mart that can be loaded into a report in the SELECT * FROM format and not bother with filtering on the side of the BI solution. Yes, in the end we made a table (due to some limitations of my server with a database), but we could also make a view, the data in which would be automatically updated when the data in the source tables changes. In short, from an architectural point of view, such a solution looks more holistic and correct.
Creating a report in Google Data Studio
Now that we have a database with more or less cleared data, we can start creating a report in our analytical system.
Create a new report, select a data source
I think that there will be no problems with creating a new report and selecting a sheet: the interface in this regard is very simple, maximally sharpened for self-service development, that is, the complete absence of an IT specialist when creating a report. Of course, this is only under the condition of building reports on Google sources.
To connect to different sources, there is a great variety of connectors. Some of them were made by Google itself, some by third-party developers. From the popular – all popular relational DBMS, instagram, all services of the parent company and a whole scattering of some specialized SEO and CRM systems. But you should not count on something full-fledged and serious: SAP and Big Data systems are not on the list. Judging by the speed of work, even a couple of million lines will be quite problematic for this thing to digest.
I will not describe the connection – everything is standard, port, database name, password-login. Need I remind you not to do it as root, even if you don’t understand the difference between Root and anything else?
A new sheet is like a new page
Any classic report is organized by sheets. That is, this is not a conditionally endless web page that you can scroll down and never find the end, but a canvas of the format of one screen sheet.
This canvas can be configured centrally through templates, the menu of which can be accessed through the “File” button. The standard ones are enough for most tasks, plus each element can be externally configured manually. Of course, templates can be edited.
Any report consists of objects. We need some for data visualization (graphs, tables, indicators), others for UX or design.
Beauty can be brought about by standard objects familiar from the Office products from small soft ones: pictures, lines, shapes, text. I doubt how beautiful something can be done with these standard tools, but any other BI solution will not be more diverse without extensions (Extension).
All sorts of filters, date ranges, data management, and so on are responsible for UX – the standard functionality of any system for working with data.
For each object, you can specify the data source with which it will work. For example, we want to create a filter – there is nothing easier! We select the source, indicate the sign by which the filter will work – and voila!
Let’s pay attention to a strange thing – where does the indicator come from in the filter? In fact, this is an aggregate according to the attribute used by us for filtering. In the final version, it looks like this:
Quite convenient: you can immediately see the number of records for each option, this eliminates the need to make a table on this basis.
In some systems, the concepts of a direct table and pivot are spaced apart, and sometimes in such a way that a calculated indicator cannot be added to the straight line in any way. Google decided to choose a different principle of division: in any table, you can calculate what and how you want, but as alternative tables, you have a variant with a built-in bar chart and a heat map! In data journalism, there has long been a trend to use heat maps everywhere and everywhere, although this mechanism is still not so popular in business queries.
Working with tables and any other charts is the same: we select a feature that will be considered the X axis, we select an indicator (one or more) that will be plotted along the Y axis. If you suddenly need to make a horizontal bar (bar) chart – this can also be built. Choose a color from the charts – please. There are a lot of nuances, and it would be possible to write educational material only if there was a request from those who suffer 🙂
Working with indicators
The calculated indicator is what we generally create reporting for. As in programming, the number of functions for working with indicators is usually limited and conditionally divided into data transformation and aggregate calculation. I can refer to data transformation type casting, working with dates, extracting individual values from a large string, and so on. Aggregation is most often reduced to counting a sum or quantity.
Google Data Studio allows you to work using methods that are standard in programming and SQL queries. Count the number – count, sum – sum, arc tangent – atan. You can learn a little more about the various features in the help and the community forum, but it’s always easier to start from the problem you want to solve.
There is such a thing as ad-hoc indicators, that is, indicators that are displayed at the request of the user. For example, in the table of aggregates you show the amount of sales in the context of goods: bread was bought yesterday for 1560 rubles, milk for 4500. But the analyst was interested in knowing how much milk was purchased, and not just the amount of sales – he needs the count indicator. We can bring it to the area of additional indicators:
Then a new “additional indicators” icon will appear on our visualization, spitting out a context menu with a choice on click:
On 2 sheets, I built the same diagrams, according to which, firstly, you can determine the number of certain cars in the taxi service; secondly, to estimate the number of erroneous records, due to which we would lose a fairly large layer of data. If you go through some interesting records, you can find some artifacts. For example, how do you like this taxi:
I saw a very real case, I saw a record about Lamborghini in the data, and with the help of the “numberogram” service I was able to find a photo of a car registered as a taxi according to documents.
It is hardly possible to talk about Google Data Studio as a tool ready for working with big data, because even with such simple datasets and visualizations, this thing regularly fell into an error, which was cured only by page refresh. On the other hand, hypotheses on small slices can be tested quickly and easily, especially if the data is already in the database or can be uploaded to Googledoc.
Now I have an idea to compare the mechanisms of Google Data Studio with Yandex Datalens, which I see as somewhat more relevant in our difficult times. Let’s see what happens.