Monday, July 15, 2019

DB2 Split Brain - it's still a thing...

Synopsis -
DB2 "Split Brain" is still very much a vulnerability in DB2 v11.5 (multiplatform).
The concept and meaning is discussed, along with the point that the vulnerability is actually multiplied with increased complexity in server frameworks and recent DB2 featuresets, coupled with DB2 Administrators making decisions to enact HADR connectivity overrides in emergency outages, by relying on potentially misleading diagnostic cues.
Brief practical and theoretical research on multiple DB2 Split Brain scenarios across DB2 versions, even given multiple built-in preventative mechanisms.
A summary is given on how best to procedurally avoid Split Brain.


Main Article -
I was recently asked a DB2 interview question : "How does Split Brain occur".
DB2 High Availability being a specialty of mine (a few years ago), I racked my brain for what I thought the simplest summary was, and my response :
Split Brain is when you have a [disconnected] HADR Standby force take-over as Primary, (but the original Primary is still active) and application connections add log records to the new Primary which put the two databases in an irreconcilable state where they cannot reconnect in HADR.

The interviewers had an even simpler summary - i.e. Split Brain occurs (or can occur) when HADR databases are no longer the same.

Reflecting on it, their simpler answer is indeed more valid than my own.  This is not only for the reason that two HADR Standard / Primary mode databases which become divergent from each other cannot be reconnected as a Primary + Standby, (i.e. divergent transaction logs as opposed to a Primary transaction log which is merely further along than the Standby has replayed - the latter situation can still potentially be brought to peer state so long as Primary and Standby are HADR connected and log replay/catchup is allowed to progress before any [forced] takeover).
It is also true because (in HADR_SYNCMODEs other than SYNC or NEARSYNC), a vulnerability opens up every time a HADR Standby falls behind the Primary in log apply, or is in a 'Catch-up' state.
If that catch up state is not successfully brought back to a peer state, transactions from the original Primary side will be lost, and a loss of connectivity to (or outage of) the Primary server before all HADR logs are received by the Standby, means that peer state can never be achieved if the Standby has to be forced into a takeover as Primary state.
If applications start connecting and writing logs to the new Primary while the old Primary is disconnected from Standby, but not from applications/batch connections, the two databases become irreconcilable, with some transaction log records essentially unrecoverable except with a 3rd party log extraction tool (and most likely practically unusable after the fact in a high activity OLTP environment - it would be more expedient for selective application resubmission after assessing referential integrity to find missing/incorrect row/column values).
Bringing both databases back to connected HADR Primary+Standby state after this forced takeover situation (where the two were not in peer state at time of takeover) will require the old Primary database to be restore/replaced from an online backup of the new Primary and started in rollforward pending as HADR Standby, in the same manner as the original Standby was established.   Once in peer state, the original Primary can perform an unforced takeover and the servers can resume their normal roles, which is usually required where the Standby server has been configured with lower capacity (CPU,memory) or is physically located remotely in a DR site.

So, with all that said, I was immediately intrigued by their follow-up statement that DB2 has protections against Split Brain such that it doesn't (cannot) occur in the more recent versions of DB2. (e.g. v10 and later).
I had to do some research to confirm this, because it seemed to me that regardless of the DB2 version, Split Brain will always be a potential situation which can only be avoided by never performing a HADR 'takeover by force', and thus never having two connectable databases at the same time. That still allows for Read_On_Standby (ROS) which only allows Uncommitted Read, (unlogged Selects) and is completely safe.
I hasten to add that by 'connectable', I definitely do not mean that DB2 will allow databases with divergent log streams to re-connect or re-start in HADR mode.  DB2 already has multiple safeguards including detection when a Standby candidate has log records which the Primary candidate does not contain, preventing a HADR Start as Primary to connect to a divergent started Standby (e.g.  with SQL1768N rc7).   'Connectable' includes the obvious non-HADR 'Standard' mode database where manual intervention is performed by DB2 admins under pressure to get the database online again for users. 
As with most things, the inevitability of human error and manual intervention means that all built-in mechanisms to prevent data inconsistency are not foolproof.

It may actually be that conceptually, our ideas of what 'Split Brain' means, are 'divergent' :)
It is not an uncommon phenomenon for IBM's official terminology to differ from usage in the global DB2 support community.
My idea incorporates the broad definition of data inconsistency regardless of the current state of HADR connectivity, whereby the log streams have diverged - i.e. databases have log records which the other lacks, e.g. due to a takeover by force outside of peer state. (or more specifically, outside of peer window), along with subsequent committed transactions on the new HADR Primary.
One way a Split Brain can occur even from a Peer State, is where the original Primary server is still running with connected applications, but the Standby loses connectivity to Primary and an over-enthusiastic DB2 support person thinks that HADR error actually means the Primary server is down and a HADR takeover [by force] is required.  Subsequently, there are two HADR Primary DB2 databases running and connectable - if any application connects to the new Primary and writes a transaction log, that becomes a Split Brain scenario.
Normally, you might expect there to be other factors preventing Split Brain from occurring in an enterprise/multi-server environment purely because applications typically connect to DB2 not locally, but remotely via separate middleware such as Websphere Application Server (WAS), and that normally restricts the application to DB2 network route to a single port number on a single IP address on a specific network adapter.
Additionally, reliance on the Automatic Client Re-route mechanism should serve as a preventative measure, whereby an attempted connection on server A will be autoredirected to Server B if DB2 HADR is currently set to Standby on Server A, (and vice versa, direct connections to Server B redirect to A when Server B is Standby)
Unfortunately, in security vulnerability parlance, this merely widens the 'attack surface' - increasing the number of moving parts (points of failure) which can go wrong.
It is especially true when there are multiple network adapters in play (for firewall security zone demarcation as well as load balancing), where some are designated for remote admin and some are internal application comms - if external zone admin routes fail but internal zone application comms remain, it can be difficult for the support teams or dashboard/monitoring tools to confirm whether applications are still connected to DB2, because they cannot connect in order to run basic diagnostic commands.
Unless there is an explicit process or mechanism preventing applications connecting to one database of a DB2 HADR pair without confirmation that it is the only connectable (Primary / Standard mode) database, the Split Brain vulnerability exists.
I assert this because that situation occurred for our support team years ago, (luckily only on a Disaster Recovery test takeover scenario), whereby the HADR takeover by force was issued on the DR test Standby after confirmation from the system support team that the old Primary was stopped, but some application activity occurred on the original Primary even after the HADR takeover by force on the DR test Standby, creating the Split Brain.
In this scenario, two related points of failure existed -
1) After the shutdown step on the four Primary DB2 servers was given, the Admin Network Adapters indicated that all four were not connectable, which was taken as the signal to proceed as though all four were stopped.  Unfortunately, it turned out that at least one Primary DB2 server was still running, just that the Admin network adapter was stopped, and some batch applications were still connected and processing in DB2 through the internal adpater - some through direct ip/port, not all through Websphere Admin Server (WAS) middleware.
2) The middleware/websphere team had to switch over their registered application server ip addresses, because in a real DR scenario, DB2's Automatic Client Reroute (ACR) cannot autoredirect when the Primary server is down.  There were multiple load balanced application servers to switch over and this took time, allowing transaction processing unbeknown to the system support teams but later discovered by the application team.

To practically test my assertion on current DB2 version 11.5, I went so far as to create a pair of virtualised DB2 11.5 on SLES12x64 servers, pairing the SAMPLE db in HADR via a host-only network adapter for internal DB2 connectivity in addition to the NAT adapter for external/remote (admin) connectivity.
(sles12x64a db2inst1 SAMPLE HADR Primary, sles12x64b db2inst1 SAMPLE HADR Standby)
The Split Brain scenario is still possible purely by virtue of the ability to perform db2 takeover by force on the Standby server while the old Primary is still running but network disconnected from the Standby.
It is also possible when HADR is stopped after any takeover and databases become connectable as Standard mode.
Thankfully, if connectivity between Primary and Standby exists at the time of a forced takeover, there is a DB2 mechanism which prevents subsequent connections on the old Primary:
i.e. start HADR, then attempt connect to old Primary after a forced takeover on still-connected Standby:
db2inst1@sles12x64b:~> db2 start HADR on db sample as Standby
DB20000I  The START HADR ON DATABASE command completed successfully.


db2inst1@sles12x64a:~> db2 start HADR on db sample as Primary
DB20000I  The START HADR ON DATABASE command completed successfully.


db2inst1@sles12x64b:~> db2 takeover db sample by force
SQL0104N  An unexpected token "db" was found following "TAKEOVER".  Expected
tokens may include:  "HADR".  SQLSTATE=42601
db2inst1@sles12x64b:~> db2 takeover HADR on db sample by force
DB20000I  The TAKEOVER HADR ON DATABASE command completed successfully.
db2inst1@sles12x64b:~> db2 connect to sample
   Database Connection Information
  Database server        = DB2/LINUXX8664 11.5.0.0
 SQL authorization ID   = DB2INST1
 Local database alias   = SAMPLE

db2inst1@sles12x64a:~> db2 connect to sample
SQL1776N  The command cannot be issued on an HADR database. Reason code = "6".


Normally, those two databases cannot now be HADR reconciled because both were essentially in PRIMARY state, and to start a database as Standby, it needs to be in rollforward pending state.
However, DB2 has a utility called db2rfpen to force reset a database into rollforward pending state.
We will assume for convenience & purpose of this testing that the takeover by force occurred within Peer Window (otherwise we already have Split Brain).  We will also assume for the same reasons that the old Primary did not have any remaining connected transactions commit after the takeover by force.  The negation of any of these assumptions would indicate a Split Brain scenario has already occurred in terms of divergence of log streams & committed transactions in database, even if those databases are currently preventing new connections.  As stated above, DB2's internal safefguards will at least ensure a successful HADR Start/reconnect will not occur if the log streams are divergent, but that doesn't prevent connections and transactions if HADR is then stopped and databases are in standard mode.
Reconciling such divergent databases requires choosing one to be discarded and overwritten with a fresh full backup of the other database chosen as the best new Primary.
In order to reconcile and restart HADR in this otherwise cleanly forced takeover scenario,
since the Standby issued a takeover by force and potentially had transaction logs subsequently applied to it, and the old connected Primary was prevented from accepting new connections, the logical database to choose to start as Standby would be the old Primary.
Attempting to restart SAMPLE on the old Primary right now gives us the following error:
db2inst1@sles12x64a:~> db2 deactivate db sample
DB20000I  The DEACTIVATE DATABASE command completed successfully.
db2inst1@sles12x64a:~> db2 stop HADR on db sample
DB20000I  The STOP HADR ON DATABASE command completed successfully.
db2inst1@sles12x64a:~> db2 start HADR on db sample as Standby
SQL1767N  Start HADR cannot complete. Reason code = "1".

SQL1767N   rc1 The database was not in roll forward-pending or roll forward-in-progress state when the START HADR AS Standby command was issued.

Not to worry, a quick n dirty db2rfpen + repeat start HADR as Standby has that sorted:
db2inst1@sles12x64a:~> db2rfpen on sample
 ______________________________________________________________________ 
                    ____    D B 2 R F P E N    ____                     
                 IBM - Reset ROLLFORWARD Pending State                  
  The db2rfpen tool is a utility to switch on the database rollforward  
  pending state.                                                        
  It will also reset the database role to STANDARD if the database is   
  identified using the database_alias option.                           
  In a non-HADR environment, this tool should only be used under the     
  advisement of DB2 service.                                            
  In an HADR environment, this tool can be used to reset the database   
  role to STANDARD.                                                     
  SYNTAX: db2rfpen on < database_alias | -path log_file_header_path >  
 ______________________________________________________________________ 
Primary Global LFH file    = /home/db2inst1/db2inst1/NODE0000/SQL00001/SQLOGCTL.GLFH.1
Secondary Global LFH file  = /home/db2inst1/db2inst1/NODE0000/SQL00001/SQLOGCTL.GLFH.2
Path to LFH files          = /home/db2inst1/db2inst1/NODE0000/SQL00001/MEMBER0000
Original rollforward pending state is Off.
Setting rollforward pending State to On.
Setting backup end time to: 1562854483
db2inst1@sles12x64a:~> db2 start HADR on db sample as Standby
DB20000I  The START HADR ON DATABASE command completed successfully.


Thanks to lack of log divergence, the start as Primary on the other side also works and reconnects the DB2 HADR pair.
db2inst1@sles12x64b:~> db2 start HADR on db sample as Primary
DB20000I  The START HADR ON DATABASE command completed successfully.
db2inst1@sles12x64b:~> db2pd -d sample -HADR
Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:00:08 -- Date 2019-07-15-15.48.00.153018
                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = NEARSYNC
                           Standby_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = PEER
                           HADR_FLAGS = TCP_PROTOCOL
                  PRIMARY_MEMBER_HOST = sles12x64b
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                  Standby_MEMBER_HOST = sles12x64a
                     Standby_INSTANCE = db2inst1
                       Standby_MEMBER = 0
                  HADR_CONNECT_STATUS = CONNECTED
             HADR_CONNECT_STATUS_TIME = 07/15/2019 15:47:52.622282 (1563169672)
          HEARTBEAT_INTERVAL(seconds) = 30
                     HEARTBEAT_MISSED = 0
                   HEARTBEAT_EXPECTED = 0
                HADR_TIMEOUT(seconds) = 120
        TIME_SINCE_LAST_RECV(seconds) = 5
             PEER_WAIT_LIMIT(seconds) = 0
           LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000000
   LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.000
                  LOG_HADR_WAIT_COUNT = 0
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 87040
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 374400
            PRIMARY_LOG_FILE,PAGE,POS = S0000009.LOG, 0, 85596001
            Standby_LOG_FILE,PAGE,POS = S0000008.LOG, 0, 81520001
                  HADR_LOG_GAP(bytes) = 0
     Standby_REPLAY_LOG_FILE,PAGE,POS = S0000008.LOG, 0, 81520001
       Standby_RECV_REPLAY_GAP(bytes) = 0
                     PRIMARY_LOG_TIME = 07/12/2019 00:14:43.000000 (1562854483)
                     Standby_LOG_TIME = 07/12/2019 00:14:43.000000 (1562854483)
              Standby_REPLAY_LOG_TIME = 07/12/2019 00:14:43.000000 (1562854483)
         Standby_RECV_BUF_SIZE(pages) = 512
             Standby_RECV_BUF_PERCENT = 0
           Standby_SPOOL_LIMIT(pages) = 13000
                Standby_SPOOL_PERCENT = 0
                   Standby_ERROR_TIME = NULL
                 PEER_WINDOW(seconds) = 120
                      PEER_WINDOW_END = 07/15/2019 15:49:55.000000 (1563169795)
             READS_ON_Standby_ENABLED = N


Now, if anyone had stopped HADR on the old primary db and activated it as standard mode, we'd have transaction connectivity re-enabled and the Split Brain would effectively be irreconcilable.

There are a high number of permutations whereby Split Brain can be artificially induced, many of which can only be partially simulated on two virtual servers, so I have limited my practical experimentation to the above for reasons of expediency, and turned today to the theoretical using good-old internet and IBM knowledge base/manuals.

As a result of my brief online research :
Renowned DB2 expert Steve Pearson perhaps explains it best in this article, with a clear and specific definitional delineation of Split Brain prevention. The initial Q&A has a number of follow-ups which still make relevant points, even though they were written in 2006 when HADR was a new concept.
https://bytes.com/topic/db2/answers/448566-HADR-split-brain-question
to whit:  
Question:
Server A (HADR Primary), Server B (HADR Standby) -
Server A Fails, Server B takes over as new Primary.
Server A restarts, but DB2 is still in HADR Primary state on Server A - what prevents applications connecting?
Answer:
DB2 will NOT ALLOW new connections to a restarted HADR Primary until it successfully reconnects to a HADR Standby.

That was true back in DB2 v8.2, v9.1 HADR, just as it remains true today in v11.5, however it doesn't negate the scenario where Server A was never stopped & restarted, nor does it prevent panicked support teams simply issuing a db2 deactivate db <dbname> and db2 stop HADR on <dbname> on a restarted HADR Primary which puts the database back in standard mode.  This makes both databases connectable by all and sundry, without due regard for the new Primary still running along happily with application connections on the other server, or the transaction logs now completely divergent.

In the earliest versions of DB2 HADR (e.g. v8.2, v9.1), the only IBM definition of Split Brain is simply given in the context of issuing a 'start HADR on db <dbname> as Primary by force' command:
http://public.dhe.ibm.com/ps/products/db2/info/vr9/pdf/letter/en_US/db2hae90.pdf

Caution: Use the START HADR command with the AS PRIMARY BY FORCE option with caution. If the Standby database has been changed to a Primary and the original Primary database is restarted by issuing the START HADR command with the AS PRIMARY BY FORCE option, both copies of your database will be operating independently as primaries. (This is sometimes referred to as split brain or dual Primary.) In this case, each Primary database can accept connections and perform transactions, and neither receives and replays the updates made by the other. As a result, the two copies of the database will become inconsistent with each other.
As of DB2 v9.7, that particular vulnerability was removed when takeover by force is issued against a Standby on an active connected HADR pair.  The old HADR Primary is set to an obsolete state whereby it cannot be restarted as HADR Primary, even by force, giving an SQL1776N rc6: This database is an old primary database. It cannot be started because the standby has become the new primary through forced takeover.
However, as the example below shows, it does not prevent Split Brain from transactions connecting after stopping hadr (db back to Standard mode):
db2inst1@sles12x64b:~> db2pd -d sample -hadr
Database Member 0 -- Database SAMPLE -- Standby -- Up 0 days 00:00:03 -- Date 2019-07-16-08.38.52.353931
                            HADR_ROLE = STANDBY
...
db2inst1@sles12x64b:~> db2 takeover hadr on db sample by force
DB20000I  The TAKEOVER HADR ON DATABASE command completed successfully.
db2inst1@sles12x64b:~> db2pd -d sample -hadr
Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:03:29 -- Date 2019-07-16-08.42.18.559234
                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = NEARSYNC
                           STANDBY_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = DISCONNECTED
...
db2inst1@sles12x64b:~> db2 connect to sample
   Database Connection Information
 Database server        = DB2/LINUXX8664 11.5.0.0
 SQL authorization ID   = DB2INST1
 Local database alias   = SAMPLE


db2inst1@sles12x64a:~> db2 deactivate db sample
DB20000I  The DEACTIVATE DATABASE command completed successfully.
db2inst1@sles12x64a:~> db2 takeover hadr on db sample by force
SQL1776N  The command cannot be issued on an HADR database. Reason code = "6".
db2inst1@sles12x64a:~> db2 stop hadr on db sample
DB20000I  The STOP HADR ON DATABASE command completed successfully.
db2inst1@sles12x64a:~> db2 connect to sample
   Database Connection Information
 Database server        = DB2/LINUXX8664 11.5.0.0
 SQL authorization ID   = DB2INST1
 Local database alias   = SAMPLE


Before DB2 9.5, there was no integration of DB2 HADR with Tivoli System Automation for MultiPlatforms (TSAMP) which later became DB2 HA and better integrated in v9.7, and subsequently replaced by DB2 PureScale in v9.8 to combine the best features of High Availability Clustering with Partitioning, while still allowing for HADR features to exist in hybrid scenarios including multiple Standby/replay targets.

Going to IBM's Knowledge Centre for the latest DB2 LUW (v11.5 as of this article), the issue of Split Brain is now covered in at least 10 locations,
https://www.ibm.com/support/knowledgecenter/search/split%20brain?scope=SSEPGG_11.5.0
mostly dealing with methods for DB2 Administrators to use to avoid Split Brain scenarios, as well as describing the built-in features designed to prevent it (so long as they are not manually overridden) -
additional scenarios for Split Brain which never existed before concepts such as hybrid cluster+HADR and multiple standby HADR are now covered in v11.5:
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.admin.ha.doc/doc/c0059998.html
This alone is sufficient indication that IBM still acknowledges Split Brain as a vulnerability even with all the safeguards in place from the automation in PureScale and built-in connection prereq checking for HADR commands and log shipping/replay.

So in summary, in an active connected HADR pair, Split Brain was always, and is still prevented in the situation as described by Steve Pearson, and since v9.7, also prevented where an active, connected HADR pair experiences a forced takeover on the Standby.  Even so, the potential for Split Brain (through disconnection and manual intervention) even in DB2 v11.5 still exists in my descriptions above.   It was unquestionably mitigated by the PureScale / TSA scripted mechanisms to only force a HADR takeover in the event of an actual cluster Primary server failure, or forcing a takeover/failover and node restart in the event of a cluster quorum failure, taking out a lot of the human error factor. However, the vulnerability remains wherever a hybrid or unmanaged HADR pair exists, and steps are not taken to ensure only one database is connectable by applications at a time.
If you take one point from this entire article, to avoid Split Brain, it would be to never succumb to the pressure of forcing a HADR database 'back online' to transaction processing as fast as possible at any cost, without first absolutely ensuring that the other database server(s) have well and truly been shut down and disconnected from all application connections on all network routes, and that a subsequent restart of those offline database servers must not have any attempts made to force the database to standard mode.  Instead, in re-establishing HADR pairing, those databases must only ever be overwritten from fresh backups of the new Primary (if the takeover was forced/in non-peer state), or restarted as standby and allowed to run log catchup if the takeover was unforced and the interim log activity with time elapsed is not going to take longer than a backup/restore.

My appreciation to anyone who made it all the way to the end of this article, it appears I have not reduced my verbosity as a result of temporal progression.
-Paul (Morte) Descovich.