Alternative Google Analytics data streaming to analyze user behavior
Why streaming GA data to your database
Working with typical client tasks, we at CreativePeople quickly encountered the limitations of Google Analytics (Universal Analytics version), which did not allow us to solve them efficiently.
For example, at the first stages of working with a project, we often need data on conversions for events for which goals were not previously set up on the project. Or they were configured, but we only need a conversion for some event subtypes from a large group, for which the goal can be configured.
In theory, this can be done using the calculated metrics in Google Data Studio, but in practice, the Universal Analytics connector to Data Studio begins to severely distort the data, underestimating the sample sizes for complex detailed queries.
Here is an example of data on the number of users by events on a specific page from April 8 to June 7:
But everything is the same, but for a longer period of time in the request – from March 9 to June 7:
The second prime example is situations where data needs to be combined with information from external sources, such as CRM. In this case, the presence of data in the database becomes critical for obtaining normal results when outputting to BI, since no modeling of the connection of sessions with a record in CRM according to more or less complex logical conditions on the BI side is impossible.
Another important point is the ability to work with the user’s story within any flashback window. Google Analytics has a 90-day limit, while in some cases we need to look back several years to calculate cumulative totals.
These are just a few basic cases. Their number, which we have accumulated in recent years, is much higher.
To summarize, in general, streaming removes the problem of data incompleteness and allows you to work with information much more flexibly than standard Google Analytics connectors to any BI systems.
How streaming works
Streaming can be organized using ready-made solutions, for example OWOX BI… The use of such services, on the one hand, eliminates the need to build the entire architecture and subsequent data modeling: breaking the string into parameters, enriching the calculated indicators and parameters, grouping into sessions, etc. On the other hand, these services are paid and their cost can be noticeable for projects with a large audience.
For most projects, we use our own solution. It works as follows.
Before sending the information to the Google Analytics collection point, the GA script runs a series of assignments…
And these tasks can be modified and add sending all data sent to the Google Analytics data collection point to an arbitrary URL.
We set up our own solution like this: using Google functions an endpoint is created that accepts a get request and, using Streaming API, writes this data and data of http-headers to the Google BigQuery database. Further, this data in the database goes through modeling, which from similar streaming lines
allows you to get a dataset for connecting to BI:
Configuring sending data to GTM
All changes are made to customTask (you can read more about working with customTask in the Simo Ahava guide). To send data, just add the following code:
var custom_tracking_url="https://{адрес гуглфукнции для endpoint}",
// добавляем в запрос все данные GA.
// Исходим из того что на объект модели GA ссылается переменная model
hitPayLoad = '?' + model.get('hitPayload'),
// дополняем информацией о user agent
user_agent="&us_ag="+ encodeURIComponent(navigator.userAgent),
// дополняем информацией о реферерре
referrer="&ref="+encodeURIComponent(document.referrer),
// добавим данные о размере окна браузера пользователя
window_dimensions="&wdm="+encodeURIComponent({{User window width}}+'x'+{{User window height}});
// Формируем итоговый url и отправляем.
var final_url = custom_tracking_url + hitPayLoad + user_agent + referrer + window_dimensions;
var xhr = new XMLHttpRequest();
xhr.open('GET', final_url, true);
xhr.send();
Sometimes, for security reasons, we cannot add submission directly to the customTask. In this case, we add a push of all data to the dataLayer to customTask and add a GTM trigger and a tag to this push that sends all data to our point.
Setting up a data collection point in Google Functions
The function is implemented in Python using official google library…
If you do not take into account the CORS body kit, then all the functionality fits into one function:
def stream_bq(uri, headers_json_string):
client = bigquery.Client()
table = client.get_table(f"""{PROJECT}.{DATASET}.{TABLE}""")
content = [{'URL': uri,
'timestamp': datetime.datetime.now(),
'headers': headers_json_string
}]
errors = client.insert_rows(table, content)
print(f"added {content}")
if errors:
logging.error(errors)
All function code:
from google.cloud import bigquery
import datetime
import logging
import urllib.parse
import datetime
import json
import re
PROJECT = 'cpeople-analytics'
DATASET = 'import_ga'
TABLE = 'streaming_' + datetime.datetime.today().strftime("%Y%m%d")
def stream_bq(uri, headers_json_string):
client = bigquery.Client()
table = client.get_table(f"""{PROJECT}.{DATASET}.{TABLE}""")
content = [{'URL': uri,
'timestamp': datetime.datetime.now(),
'headers': headers_json_string
}]
errors = client.insert_rows(table, content)
print(f"added {content}")
if errors:
logging.error(errors)
def entry_point(request):
request_host = request.headers["Origin"]
patterns = [r'.+.cpeople.ru',
r'.+.devw.cpeople.ru',
r'.+.devh.cpeople.ru']
matches = [re.match(p, request_host) is not None for p in patterns]
origin = 'https://cpeople.ru'
if any(matches):
origin = request.headers["Origin"]
# Set CORS headers for the preflight request
if request.method == 'OPTIONS':
# Allows GET requests from any origin with the Content-Type
# header and caches preflight response for an 3600s
headers = {
'Access-Control-Allow-Origin': origin,
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600'
}
return ('', 204, headers)
# Set CORS headers for the main request
headers = {
'Access-Control-Allow-Origin': origin
}
request_headers_dictionary = {k: v for k, v in request.headers.items()}
if request.url:
stream_bq(request.url, json.dumps(request_headers_dictionary))
return ("Ok", 200, headers)
It is important to note that, in order to avoid losses, we do not modify the data received at the input in any way, but simply write it to the database. Further transformations are already taking place on the modeling side.
Build a model in BigQuery
The model is assembled in one procedure, launched once a day. We will not provide the entire procedure code here, but describe its logic.
First, data is fetched from a string and assembled into a structure. This is done as follows:
1) To decode data, use the function:
CREATE TEMP FUNCTION urldecode(url STRING) AS ((
SELECT STRING_AGG(
IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'),
SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), ''
ORDER BY i
)
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
WITH OFFSET AS i
));
2) One more function is set to retrieve data:
# Извлекает из строки значение при помощи регулярного выражения.
CREATE TEMP FUNCTION get_value(name STRING, url STRING)
AS (
urldecode(REGEXP_EXTRACT(url, r'[&?]'||name||'=([^&]+)'))
);
3) Using the above functions, data is retrieved from the string:
# Разбираем URL стриминга на отдельные параметры.
SELECT
PARSE_DATE("%Y%m%d", _TABLE_SUFFIX) as date,
timestamp,
get_value('cid', URL) clientId,
get_value('gtm', URL) google_tag_manager_container_id,
get_value('ni', URL) non_interaction,
get_value('t', URL) type,
STRUCT(
REGEXP_REPLACE( JSON_QUERY(headers, '$.X-Appengine-Country'), '"', "" ) as country,
REGEXP_REPLACE( JSON_QUERY(headers, '$.X-Appengine-Citylatlong'), '"', "" ) as location,
REGEXP_REPLACE( JSON_QUERY(headers, '$.X-Appengine-User-Ip'), '"', "" ) as ip,
REGEXP_REPLACE( JSON_QUERY(headers, '$.X-Appengine-City'), '"', "" ) as city
) geo,
ARRAY(
SELECT AS STRUCT
REGEXP_EXTRACT(dimension, r'(cdd+)') number,
urldecode(REGEXP_EXTRACT(dimension, r'cdd+=(.+)$')) value
FROM UNNEST(REGEXP_EXTRACT_ALL(URL, r'[&?](cdd+=[^&]+)')) dimension
) dimensions,
STRUCT (
get_value('dt', URL) as title,
get_value('dl', URL) as url,
get_value('ref', URL) as referrer,
get_value('dr', URL) as external_referrer
) document,
...
FROM `cpeople-analytics.import_ga.streaming_*`
WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE("%Y%m%d", date_start) and FORMAT_DATE("%Y%m%d", date_end)
4) Add data about the user’s device to them in a subquery, parsing the userAgent using the library https://github.com/woothee/woothee-js
CREATE TEMPORARY FUNCTION decode_user_agent(ua STRING)
RETURNS STRING
LANGUAGE js AS """ return JSON.stringify(woothee.parse(ua));
"""OPTIONS(library="gs://model_hit_data/woothee.js");
An example of what the function will return:
{
"name":"Chrome",
"vendor":"Google",
"version":"91.0.4469.0",
"category":"smartphone",
"os":"Android",
"os_version":"10"
}
Next, we simulate additional parameters and indicators that are not initially in streaming, but which can be obtained with a history: user novelty, the first source within the established retrospective window, the last source, the last significant, and so on.
These parameters and indicators may have no analogues in Google Analytics at all, for example: the status of a user’s qualification in CRM at the time of the session.
Some things are different from the traditional ones from GA. For example, site returns during the day will still be flagged as new user activity.
Here’s an example of determining a user’s novelty status:
SELECT
DISTINCT date, clientId,
# Если по истории есть данные о пользователе или в стриминге в более ранних датах,
# то помечаем пользователя как вернувшегося new_user = false.
if(new_user_history and new_user_streaming, true, false) new_user
FROM
(SELECT
date, clientId,
# Ищем есть ли в истории данные о clientId:
(
SELECT if(count(m.date) > 0, false, true)
FROM cpeople-analytics.import_ga.modelled_hit_data m
WHERE m.clientId = r.clientId
# берем только ту часть модели, что попадает в ретроспективное окно для поиска статуса пользователя.
and (m.date between DATE_SUB(r.date, INTERVAL days_user_status_retrospective DAY) and DATE_SUB(r.date, INTERVAL 1 DAY))
# также проверяем, чтобы если мы обновляем кусок, который ранее был смоделирован, то мы не брали данные из модели
and (m.date not between date_start and date_end)
) new_user_history,
# Добавим также данные обновляемого/добавляемого куска стриминга за прошедшие дни, если там были ранее упоминания о пользвателе, то
# сегодня он уже стал вернувшимся.
if(FIRST_VALUE(r.date) OVER(PARTITION BY r.clientId ORDER BY date asc) != r.date, false, true) new_user_streaming
FROM raw_streaming r
Many of these parameters and indicators will be incorrect if streaming was not included from the very start of the project. To avoid this, we download the historical data of Google Analytics using the Reports API, add them to separate tables.
and then we collect data from such reports into a table with a scheme similar to that obtained in the process of streaming modeling.
Quantitative differences from data in Google Analytics
Besides the fact that we receive data with a large set of information, we also often see discrepancies in the number of events and users by events between Google Analytics and streaming data.
Here are some examples:

Differences from Google Analytics raw data upload 4
The new version of GA provides free export of raw data to BigQuery by default. Without resorting to additional settings, without creating any additional infrastructure, you get all the advantages of having data in your own database.
Nevertheless, on most projects, we still leave streaming even after switching to Google Analytics 4.
This is due to the following nuances:
so as not to run into Google Analytics limits 4
to have access to data that has not been processed by anyone
to collect, including those data that are prohibited from sending to Google Analytics due to corporate security rules
to use streaming data for realtime control
Point 3 is especially important here, since the point of data collection does not care whether the information came in the form of a Google Analytics model, or it was a GET request that we made separately when certain events occurred on the site.
Moreover, due to the peculiarities of GA4 itself, this data will sometimes be more complete than the streaming data:
This happens because the sending, hardcoded into the customTask, entirely depends on the correct initialization and execution of Google Analytics scripts, which in the Universal Analytics version sometimes fail in certain circumstances: in some versions of browsers, devices, operating systems and user browser security settings.
In such cases, we still initialize Google Analytics, but instead of the clientId, which in these circumstances will be new every time the page is viewed, we set the single value “everything_blocked”.
Therefore, the key change to be made for this functionality is to transfer the sending from the customTask of the Universal version to a full-fledged own sending of data, copying the sending to GA4, but, at the same time, if it is impossible to identify the user by cookies and the localStorage is unavailable, assign the user an ID based on additional device data, IP and location after the fact during modeling, which will make it possible to more correctly group users without clientId.
This will significantly reduce the loss of information, which is likely to be present in the new version of Google Analytics as well.