Using Raw Data in Google Analytics in practice

We have long believed that standard Google Analytics tools are the best way to get useful information. At times, I had to deal with some limitations and rather strange results, and it seemed like there was no end to this, until some analysts discovered the Google Analytics 360 and the mechanisms for exporting raw data to Google BigQuery.

After only a few hours of working with more advanced tools and SQL queries, we were able to extract information that we would never have received using only aggregate reports from Google Analytics. From that moment, we focused on studying the features of raw (“raw”) data and on what practical benefits web experts can get from them.

This article answers the following questions:

  • What is the difference between raw and summary data?
  • What are the benefits of using raw data?
  • How to access raw data?
  • How to use this new data (practical examples)?

What is the difference between raw and summary data in Google Analytics

Using the free version of Google Analytics, you can only get summary data. That is, full viewing information for a particular visit and event will not be available. Of course, the User Explorer report contains a lot of useful information that web analytics can use. However, certain restrictions are imposed on this file: it does not scale and is not available for download.

In most cases, having only summary data is enough to get answers to common questions:

  • Which campaign brings the most conversions?
  • How common is Feature X’s machine learning technology (is its user base growing)?
  • Where do visitors come from (main traffic sources)?
  • What devices do users go to from?


You do not need to analyze the raw data to get the answers to the questions above. Customized or default reporting tools in Google Analytics can easily do this. The main problem with summary data is that it is aggregated. You map many types of user actions that can be used to hide very useful information. We give an example with views of several pages in one session. We have two sources with six sessions with the following number of pages viewed in one session:

  • Source A: 1, 1, 2, 2, 2, 10;
  • Source B: 2, 2, 3, 3, 4, 4.

Using the statistical outlier method, it can be determined that source A characterizes a user with a lower degree of involvement. But if we take into account only the averaged data, we can conclude about the same involvement for both sources, since the number of pages is the same (the median can be any).

Why don’t standard analytics reports contain this data? The main reason is the computational cost. Providing only selective summary data, there is no need to process the millions of rows contained in the report. Therefore, the free version of Google Analytics does not contain tools for performing advanced calculations on a free basis.

What information can be extracted from raw data

Understanding the limitations of summary data, it’s time to find out how raw data can be used. Consider several options for their application.

Event Duration

One of the limitations that any web analyst using standard Google Analytics tools faces is that you cannot determine the time interval between placing an item in the cart and making a purchase, regardless of whether this action occurs in a single session or not. Of course, for this you can use cookies and make your own calculations. But this is pointless, especially since Google Analytics has already done this work for us.

After analyzing the raw data, you can easily determine the exact time the event occurred for a particular user, conduct a comparative analysis with another event for the same user. You can also aggregate data at your discretion, having obtained an average, median or percentile distribution, or use another advanced statistical model. Does the fact that 20% of users perform the target action in 2 minutes, and 10% – during the whole week, does not matter? With this knowledge, you will use various approaches to interact with these two categories of users.

Audience size analysis

Both Google Analytics and Google Analytics 360 reports contain visitor segmentation data over the past 90 days. Often, to obtain reliable data, analysis is required at longer distances (especially for large companies). Based on the analysis of the raw data, you can get answers to the following questions:

  1. Is it more likely that users who are involved in the holiday season will purchase the product in September than the likelihood of making a purchase by other categories of users in the same month?
  2. What is the effect of watching videos throughout the year and how does this affect the number of conversions?

Using raw data, you can store event logs for unlimited time and delete them only when this data is already hopelessly out of date.

Relationships Between Data

A correlation coefficient is introduced to determine the statistical relationship between two variable values. When analyzing large amounts of data, you can determine the relationship between two types of user behavior:

  • How do pageviews affect the performance of a targeted action?
  • Is there a connection between the type of content consumed and the product that the user ultimately acquires?
  • Are there related products? For example, if someone buys product A, which product category is associated with it?

Third Party Data

Last but not least, raw data allows you to get much more information if you connect other data sources. The following are some illustrative examples:

Ecommerce Data. The tool is most useful if you save the customer ID in Google Analytics, which performed any actions to add to the cart or place an order. This will allow you to calculate the exact value of the conversion rate, even if the Google Analytics tools did not work (because of the ad blockers used by clients, the lack of a redirect from the payment page, the long waiting time for the page to load, and other reasons). In addition, based on your own data, you can exclude canceled target actions and returns to recalculate real profit. It also allows you to calculate more complex and closed indicators, for example, margin instead of income.

CRM data. What could be worse than the fact that a huge number of leads contain a large number of irrelevant leads? This is a common problem for B2B online services. By exporting CRM data with unique lead identifiers (client identifier encrypted using SHA-256 e-mail, generated identifier, etc.), you can easily associate them with client identifiers in Google Analytics. This will allow you to calculate not only the percentage of generated leads, but also the conversion rate. Multichannel analysis will require more complex queries, but you can fully control the calculation process.

Offline Events. Online business is affected by many external factors: holidays, weather conditions, strikes, a deadly virus that sent half the world’s population to self-isolation. Google Analytics does not provide the ability to introduce new parameters to study for a certain period of time. As for annotations, they are not involved in calculations and are used solely as a user interface element. However, it would be useful to know how holidays affect the number of sales.

To conduct such an analysis, it is necessary to collect information and provide it in a readable format. By doing this, you will have relevant and relevant data.

Ads, search robots, logs – All this data should be stored in a single repository.
By learning to compare them with analytical data, you can make your wildest dreams come true:

  • Does longer content attract users? By choosing the right search robot (for example, Screaming Frog), you can establish the relationship between the length of text content and the number of page views.
  • Does the behavior of search engines affect SEO optimization? Using the BigQuery log data, you can determine how the frequency of visits to the search robot affects search results.

What is the real profitability of your business? Having made the settings in the attribution calculation algorithm, you can measure the return on investment in advertising campaigns for all platforms used.

Raw Data Extraction Tools

The information presented above favors raw data. But how to get them? Consider some of the most common methods.

Google Analytics 360

If you’re lucky enough to get access to this service or you have enough budget to pay the cost of using it, you have the best tool for extracting raw data from Google BigQuery. It allows you to export any information, including advanced e-commerce data. Each line corresponds to a specific session, and you can use a huge number of parameters and metrics.

Google Analytics App + Web and Firebase

Recently, web analytics have been able to export data to Google BigQuery without having to buy Google Analytics 360. Firebase, which is the core of the Google Analytics App + Web, supports the export functionality to Google BigQuery. You will be billed with Blaze, which uses a pay-as-you-go approach. If you have a large Internet portal, you will have to keep track of your budget. For small sites, costs range from just nothing to just a few dollars a month.

Each line corresponds to an event containing a screen or pageview. You will have to get used to this very specific way of presenting data, which is different from that used in Google Analytics. However, this service may be the best solution for those who want to use raw data.

Other free tools: Yandex.Metrica and Matomo

I could not try every tool, since many of them are paid. Each of them contains functionality for exporting raw data. However, there are two absolutely free online services that offer the same functionality and do not charge for their use.

Yandex.Metrica is an absolutely free tool that provides access to raw data through the API of its logs. Matomo is an open source analytics tool that needs to be installed directly on your server where the site files are located. It exports the raw data directly to your database.

Data pipeline

Another way to upload Google Analytics data directly to your data warehouse is through the data pipeline. OWOX BI organizes a powerful data flow between Google Analytics and BigQuery. To implement the functionality, you need to create a custom task in Google Analytics. It creates copies of the Google Analytics payload and transfers it to the final data warehouse.

Having sufficient experience, you will be able to independently create your own endpoint for uploading data using the functionality of the cloud service and based on the analysis of the log. The following are two useful resources to help you figure this out:
Simo Ahab – “How to Build a GTM Monitor.” After reading this article, you will learn how to send data to BigQuery using cloud functions. The amount of transmitted data is limited to 100,000 rows per second, which can be integrated into the BigQuery service. If the number of lines exceeds the maximum value indicated above, you will have to group the data from several logs.

Google Cloud Help Center – Serverless Pixel Tracking Architecture This source discusses the mechanism for creating your own tracking pixel, followed by integration into BigQuery.

Examples and special cases of using BigQuery

Now you know the benefits of using raw data and how to access it. Now let’s look at a few examples that clearly demonstrate the principles of working with this data.

Linking topics to targeted actions in Google Analytics 360
The analysis was conducted for a news site with online subscribers. The main objective of the analysis was to establish a relationship (correlation) between the topics of the news that users read and the targeted actions performed.

The result and conclusions:

corr_culture 0.397
corr_opinion 0.305
corr_lifestyle 0.0468
corr_sport 0.009

The most likely category of those who subscribe to the site are users who are interested in the “Culture” and “Opinions” sections. On the other hand, if the user is interested in the headings “Lifestyle” or “Sport”, the probability of subscribing to them is minimal.

Query for BigQuery

SELECT
CORR(culture,transac) AS corr_culture,
CORR(opinion,transac) AS corr_opinion,
CORR(lifestyle,transac) AS corr_lifestyle,
CORR(sport,transac) AS corr_sport
FROM(
SELECT
  SUM(IF(hit.page.pagePath LIKE'/culture%',
      1,
      0)) AS culture,
  SUM(IF(hit.page.pagePath LIKE'/opinion%',
      1,
      0)) AS opinion,
  SUM(IF(hit.page.pagePath LIKE'/lifestyle%',
      1,
      0)) AS lifestyle,-
  SUM(IF(hit.page.pagePath LIKE'/sport%',
      1,
      0)) AS sport,      
  COUNT(hit.transaction.transactionId) AS transac
FROM
  `mydatabase.view_id.ga_sessions_*`,
  UNNEST(hits) AS hit
WHERE
  _TABLE_SUFFIX BETWEEN '20191201' AND '20200301'
GROUP BY
  fullVisitorId
ORDER BY
  transac DESC
)

Cohort Analysis Using Firebase

An application with regularly updated content and a high seasonality rate was taken for analysis. The analysis is carried out to find answers to the following questions:
How do users who first installed this app behave?

What is the best time to attract customers who will use the app on a regular basis?

The result and conclusions:

Judging by the data received, the application is most often used by clients who downloaded it for the first time in September and December.

Query for BigQuery

# change my-app.analytics_123456789 to your ID
WITH cohorte_september
AS
(
WITH
user_september AS
(
SELECT DISTINCT user_pseudo_id AS user
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND
event_name="first_open"
)
SELECT sessions, month
FROM
(
(SELECT COUNT(DISTINCT user_pseudo_id) AS sessions,"201909" AS month
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(SELECT COUNT(DISTINCT user_pseudo_id),"201910"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201911"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201912"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202001"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200101' AND '20200131'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202002"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200201' AND '20200229'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202003"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200301' AND '20200331'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
)
ORDER BY month ASC
),

#october

cohorte_october AS
(
WITH
user_october AS
(
SELECT DISTINCT user_pseudo_id AS user
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND
event_name="first_open"
)
SELECT sessions, month
FROM
(
(SELECT COUNT(DISTINCT user_pseudo_id) AS sessions,"201909" AS month
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(SELECT COUNT(DISTINCT user_pseudo_id),"201910"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201911"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201912"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202001"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200101' AND '20200131'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202002"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200201' AND '20200229'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202003"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200301' AND '20200331'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october)
)
)
ORDER BY month ASC
),

#november

cohorte_november AS
(
WITH
user_november AS
(
SELECT DISTINCT user_pseudo_id AS user
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND
event_name="first_open"
)
SELECT sessions, month
FROM
(
(SELECT COUNT(DISTINCT user_pseudo_id) AS sessions,"201909" AS month
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(SELECT COUNT(DISTINCT user_pseudo_id),"201910"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201911"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201912"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202001"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200101' AND '20200131'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202002"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200201' AND '20200229'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202003"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200301' AND '20200331'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
)
ORDER BY month ASC
),

#decembre


cohorte_decembre AS
(
WITH
user_decembre AS
(
SELECT DISTINCT user_pseudo_id AS user
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND
event_name="first_open"
)
SELECT sessions, month
FROM
(
(SELECT COUNT(DISTINCT user_pseudo_id) AS sessions,"201909" AS month
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre )
)
UNION ALL
(SELECT COUNT(DISTINCT user_pseudo_id),"201910"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre  )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201911"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre  )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201912"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre  )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202001"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200101' AND '20200131'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre  )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202002"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200201' AND '20200229'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre  )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202003"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200301' AND '20200331'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre  )
)
)
ORDER BY month ASC
)

SELECT cohorte_september.sessions AS september_cohort,  
 cohorte_october.sessions AS october_cohort,  
 cohorte_november.sessions AS november_cohort,  
 cohorte_decembre.sessions AS december_cohort,  
 month

FROM cohorte_september
JOIN  cohorte_october USING (month)
JOIN  cohorte_november USING (month)
JOIN  cohorte_decembre USING (month)
ORDER BY month ASC

findings

Most likely, serious changes are coming in the near future, and Google App + Web will become something like an industry standard. This approach provides tighter integration between Google services such as Marketing Platform and Google Cloud Platform, and especially BigQuery. If you have lost the ability to create SQL queries (a language for working with DBMS), I highly recommend that you refresh your information in your memory and practice it.

Advanced digital data analytics is becoming an increasingly powerful tool due to easy access to raw data, fast and efficient computing processes, and good visualization of information. In the near future, even closer integration with other types of business data will be implemented.

For many years now, experts have argued that digital and business analytics must work together. Slowly, but confidently, we are moving towards translating this idea into reality.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *