The Dark Side of AlwaysOn

SQL server has a wonderful technology – AlwaysOn. It is used for:

DR – disaster recovery, asynchronous data replication
HA – high availability, often with automatic failover, which is possible with synchronous replication
And for what we will discuss in the article: readonly replica for DWH/OLAP/Reporting workload.

Nobody is perfect (though I admire the ease of installation of some solutions in MS SQL compared to Postgre And Oracle. At least backups… And AlwaysOn for small databases is launched literally in a couple of clicks). Of the pitfalls, we can mention, for example:

'Tenderness' of the cluster service, which can fall due to lack of memory or CPU, as a result of which the databases go into the 'RESOLVING' state and are unavailable. SQL is thus protected from split brain, but it turns out that the databases are unavailable until the Cluster Service is up and the nodes agree on who is primary and who is secondary
In some cases, the Cluster service is so unstable for unknown reasons (I have seen this in our clouds) that we have to build a cluster-less AlwaysOn
In case of failover, the issue of reassigning SQL agent jobs lies with the DBA. There are some solutions, but they are not automatic.

But today we will look at the problems when using AlwaysOn for DWH/OLAP/Reporting.

What is this for?

The idea itself is simply wonderful. We move the OLAP/Reporting load and other heavy queries to a separate server. In this way, we unload the main server and especially its cache (heavy queries reconfigure the cache for themselves).

All this works so well that many perceive this technology as a silver bullet. Here is an online copy of PROD and do with it whatever you want. Meanwhile, there are problems in the technology that in some cases manifest themselves. Let's build a test layout, a base testwe will enter it into AlwaysOn and create two tables in it:

-- primary
create table A (str varchar(128))
create table B (str varchar(128))
GO
insert into A select 'old'
insert into B select 'old'
insert into A select 'old'
insert into B select 'old'
insert into A select 'old'
insert into B select 'old'
GO

I will write everywhere on which server – primary or secondary, the script is executed. After the script is executed, both tables with the necessary data appear almost instantly on the secondary.

By completing:

-- secondary
select * from B
select * from A

We will see all the data from primary.

Next, I suggest recalling the old joke about the harsh Siberian lumberjacks who cut down trees with hand saws, and how a Japanese chainsaw fell into their hands.

Ooo, said the stern Siberian men. And they stuck a twig in

Let's create two connections on secondary, I'll call them secondary 1 and 2. In secondary 1, we'll do:

-- secondary 1
begin transaction
select * from B
select * from A

Please note that I did not close the transaction. Next, on primary we do:

-- primary
update A set str="new1"
update B set str="new1"

Next, we perform select from both tables in the secondary 1 and 2 windows (on secondary 1, we comment out begin transaction). What will we see?

On secondary 2 we will see the value 'new1'. But on secondary 1 we will see the old value: 'old'. Why? Because AlwaysOn under the hood forces snapshot isolation so that the readonly replica does not read dirty data.

But the 'delta' to the snapshot is stored in tempdb. So by doing begin tran select * from TableUnderHeavyWorkload and going to lunch or on the weekend, we risk bringing down the DWH server – its tempdb will simply fill up. Despite the fact that our load is readonly.

Honestly, I don't know what will happen next when tempdb runs out – will snapshot be reset (this is a good outcome) or will data transfer in AlwaysOn stop. If the latter, then primary will accumulate unconfirmed data in LDF, and after some time primary will crash.

Conclusion – even readonly replica needs hygiene, like prod.

…and the stern Siberian men laid the log

But snapshot holds the data delta under the table schema. What if the table changes?

Let's close the transaction on secondary 1. Then we'll run the command again, starting with begin transaction. Then on primary:

-- primary
alter table A add col varchar(10) null
update A set str="new2"
update B set str="new2"

On secondary 2 we see 'new2'. But on secondary 1:

We get a slap on the wrist. The message in full, as it explains in detail what happened:

Snapshot isolation transaction failed in database 'test' because the object accessed by the statement has been modified by a DDL statement in another concurrent transaction since the start of this transaction. It is disallowed because the metadata is not versioned. A concurrent update to metadata can lead to inconsistency if mixed with snapshot isolation.

…and they stuck in an iron crowbar

What, will select work for a very long time on secondary? I didn't want to upload gigabytes of data to the test database, so I created a function in the neighboring database (not test, since the test database is readonly):

-- secondary 1, база DBA
create function dbo.waiter(@str varchar(1000), @n int)
returns int
as
  begin
  while @n>0 set @n=@n-1
  return 1
  end

Now on secondary 1 let's run the query:

-- secondary 1, в базе DBA
select * from test.dbo.A where dbo.waiter(str,50000000)=1

The command above runs for about 100 seconds. You can vary the duration with the second parameter. Meanwhile, let's add another column to the primary:

-- primary
alter table A add secondcol varchar(10) null
update A set str="new3"
update B set str="new3"

secondary 1 is still running the query, so let's go to secondary 2 and see that the str value hasn't changed and there's no secondcol column in A. We've stopped AlwaysOn data flow!

We can see the blocked log 'roller':

We see the 'roller' (cmd='DB STARTUP'), blocked by our select (119 is its spid), by schema lock.

That's how it is, …, said the stern Siberian men

That is, a long query on the DWH server can block the rollout if the query is active. AlwaysOn will not try to kill the query. It will simply wait for it (this is true only for this database, the rollouts to different databases are independent). All this time (possibly hours!) the secondary database will lag more and more.

With asynchronous AlwaysOn, there are two queues: send queue on primary, and redo queue on the secondary. One would expect that with a blocked rewinder there would be a redo queue building up on the secondary. But it seems that a blocked rewinder cannot send an acknowledgement to the primary ('hardened point'), so the primary cannot clear the log (LDF)

The LDF file on the primary will start growing, the log backup will not help. If the request ends, the situation will resolve itself (but the replica will remain much smaller). But if you are unlucky, the space in the LDF will run out and you will hear squealing and screaming.

We can say that the situation is rare. Indeed, as a rule, the scheme on PROD is stable, with the exception of releases. But alas, there are systems where this is not the case.