Synopsis -
DB2 "Split Brain" is still very much a vulnerability in DB2 v11.5 (multiplatform).
The concept and meaning is discussed, along with the point that the vulnerability is actually multiplied with
increased complexity in server
frameworks and recent DB2 featuresets, coupled with DB2 Administrators making decisions to
enact HADR connectivity overrides in emergency outages, by relying
on potentially misleading diagnostic cues.
Brief practical and theoretical research on multiple DB2 Split Brain
scenarios across DB2 versions, even given multiple built-in
preventative mechanisms.
A summary is given on how best to procedurally avoid Split Brain.
Main Article -
I was recently asked a DB2 interview question : "How does Split
Brain occur".
DB2 High Availability being a specialty of mine (a few years ago), I
racked my brain for what I thought the simplest summary was, and my
response :
Split Brain is when you have a [disconnected] HADR Standby force
take-over as Primary, (but the original Primary is still active) and
application connections add log records to the new Primary which put
the two databases in an irreconcilable state where they cannot
reconnect in HADR.
The interviewers had an even simpler summary - i.e. Split Brain
occurs (or can occur) when HADR databases are no longer the same.
Reflecting on it, their simpler answer is indeed more valid than my
own. This is not only for the reason that two HADR Standard /
Primary mode databases which become divergent from each other cannot
be reconnected as a Primary + Standby, (i.e. divergent transaction
logs as opposed to a Primary transaction log which is merely further
along than the Standby has replayed - the latter situation can still
potentially be brought to peer state so long as Primary and Standby
are HADR connected and log replay/catchup is allowed to progress
before any [forced] takeover).
It is also true because (in HADR_SYNCMODEs other than SYNC or
NEARSYNC), a vulnerability opens up every time a HADR Standby falls
behind the Primary in log apply, or is in a 'Catch-up' state.
If that catch up state is not successfully brought back to a peer
state, transactions from the original Primary side will be lost, and
a loss of connectivity to (or outage of) the Primary server before
all HADR logs are received by the Standby, means that peer state can
never be achieved if the Standby has to be forced into a takeover as
Primary state.
If applications start connecting and writing logs to the new Primary
while the old Primary is disconnected from Standby, but not from
applications/batch connections, the two databases become
irreconcilable, with some transaction log records essentially
unrecoverable except with a 3rd party log extraction tool (and most
likely practically unusable after the fact in a high activity OLTP
environment - it would be more expedient for selective application
resubmission after assessing referential integrity to find
missing/incorrect row/column values).
Bringing both databases back to connected HADR Primary+Standby state
after this forced takeover situation (where the two were not in peer
state at time of takeover) will require the old Primary database to
be restore/replaced from an online backup of the new Primary and
started in rollforward pending as HADR Standby, in the same manner
as the original Standby was established. Once in peer
state, the original Primary can perform an unforced takeover and the
servers can resume their normal roles, which is usually required
where the Standby server has been configured with lower capacity
(CPU,memory) or is physically located remotely in a DR site.
So, with all that said, I was immediately intrigued by their
follow-up statement that DB2 has protections against Split Brain
such that it doesn't (cannot) occur in the more recent versions of
DB2. (e.g. v10 and later).
I had to do some research to confirm this, because it seemed to me
that regardless of the DB2 version, Split Brain will always be a
potential situation which can only be avoided by never performing a
HADR 'takeover by force', and thus never having two connectable
databases at the same time. That still allows for Read_On_Standby
(ROS) which only allows Uncommitted Read, (unlogged Selects) and is
completely safe.
I hasten to add that by 'connectable', I definitely do not mean that
DB2 will allow databases with divergent log streams to re-connect
or re-start in HADR mode. DB2 already has multiple
safeguards including detection when a Standby candidate has log
records which the Primary candidate does not contain, preventing a
HADR Start as Primary to connect to a divergent started Standby
(e.g. with SQL1768N rc7). 'Connectable' includes
the obvious non-HADR 'Standard' mode database where manual
intervention is performed by DB2 admins under pressure to get the
database online again for users.
As with most things, the inevitability of human error and manual
intervention means that all built-in mechanisms to prevent data
inconsistency are not foolproof.
It may actually be that conceptually, our ideas of what 'Split
Brain' means, are 'divergent' :)
It is not an uncommon phenomenon for IBM's official terminology to
differ from usage in the global DB2 support community.
My idea incorporates the broad definition of data inconsistency
regardless of the current state of HADR connectivity, whereby the
log streams have diverged - i.e. databases have log records which
the other lacks, e.g. due to a takeover by force outside of peer
state. (or more specifically, outside of peer window), along with
subsequent committed transactions on the new HADR Primary.
One way a Split Brain can occur even from a Peer State, is where the
original Primary server is still running with connected
applications, but the Standby loses connectivity to Primary and an
over-enthusiastic DB2 support person thinks that HADR error actually
means the Primary server is down and a HADR takeover [by force] is
required. Subsequently, there are two HADR Primary DB2
databases running and connectable - if any application connects to
the new Primary and writes a transaction log, that becomes a Split
Brain scenario.
Normally, you might expect there to be other factors preventing
Split Brain from occurring in an enterprise/multi-server environment
purely because applications typically connect to DB2 not locally,
but remotely via separate middleware such as Websphere Application
Server (WAS), and that normally restricts the application to DB2
network route to a single port number on a single IP address on a
specific network adapter.
Additionally, reliance on the Automatic Client Re-route mechanism
should serve as a preventative measure, whereby an attempted
connection on server A will be autoredirected to Server B if DB2
HADR is currently set to Standby on Server A, (and vice versa,
direct connections to Server B redirect to A when Server B is
Standby)
Unfortunately, in security vulnerability parlance, this merely
widens the 'attack surface' - increasing the number of moving parts
(points of failure) which can go wrong.
It is especially true when there are multiple network adapters in
play (for firewall security zone demarcation as well as load
balancing), where some are designated for remote admin and some are
internal application comms - if external zone admin routes fail but
internal zone application comms remain, it can be difficult for the
support teams or dashboard/monitoring tools to confirm whether
applications are still connected to DB2, because they cannot connect
in order to run basic diagnostic commands.
Unless there is an explicit process or mechanism preventing
applications connecting to one database of a DB2 HADR pair without
confirmation that it is the only connectable (Primary / Standard
mode) database, the Split Brain vulnerability exists.
I assert this because that situation occurred for our support team
years ago, (luckily only on a Disaster Recovery test takeover
scenario), whereby the HADR takeover by force was issued on the DR
test Standby after confirmation from the system support team that
the old Primary was stopped, but some application activity occurred
on the original Primary even after the HADR takeover by force on the
DR test Standby, creating the Split Brain.
In this scenario, two related points of failure existed -
1) After the shutdown step on the four Primary DB2 servers was
given, the Admin Network Adapters indicated that all four were not
connectable, which was taken as the signal to proceed as though all
four were stopped. Unfortunately, it turned out that at least
one Primary DB2 server was still running, just that the Admin
network adapter was stopped, and some batch applications were still
connected and processing in DB2 through the internal adpater - some
through direct ip/port, not all through Websphere Admin Server (WAS)
middleware.
2) The middleware/websphere team had to switch over their registered
application server ip addresses, because in a real DR scenario,
DB2's Automatic Client Reroute (ACR) cannot autoredirect when the
Primary server is down. There were multiple load balanced
application servers to switch over and this took time, allowing
transaction processing unbeknown to the system support teams but
later discovered by the application team.
To practically test my assertion on current DB2 version 11.5, I went
so far as to create a pair of virtualised DB2 11.5 on SLES12x64
servers, pairing the SAMPLE db in HADR via a host-only network
adapter for internal DB2 connectivity in addition to the NAT adapter
for external/remote (admin) connectivity.
(sles12x64a db2inst1 SAMPLE HADR Primary, sles12x64b db2inst1 SAMPLE
HADR Standby)
The Split Brain scenario is still possible purely by virtue of the
ability to perform db2 takeover by force on the Standby server while
the old Primary is still running but network disconnected from
the Standby.
It is also possible when HADR is stopped after any takeover and
databases become connectable as Standard mode.
Thankfully, if connectivity between Primary and Standby exists
at the time of a forced takeover, there is a DB2 mechanism which
prevents subsequent connections on the old Primary:
i.e. start HADR, then attempt connect to old Primary after a forced
takeover on still-connected Standby:
db2inst1@sles12x64b:~> db2 start HADR
on db sample as Standby
DB20000I The START HADR ON DATABASE command completed
successfully.
db2inst1@sles12x64a:~> db2 start
HADR on db sample as Primary
DB20000I The START HADR ON DATABASE command completed
successfully.
db2inst1@sles12x64b:~> db2
takeover db sample by force
SQL0104N An unexpected token "db" was found following
"TAKEOVER". Expected
tokens may include: "HADR". SQLSTATE=42601
db2inst1@sles12x64b:~> db2 takeover HADR on db sample by
force
DB20000I The TAKEOVER HADR ON DATABASE command completed
successfully.
db2inst1@sles12x64b:~> db2 connect to sample
Database Connection Information
Database server
= DB2/LINUXX8664 11.5.0.0
SQL authorization ID = DB2INST1
Local database alias = SAMPLE
db2inst1@sles12x64a:~> db2 connect
to sample
SQL1776N The command cannot be issued on an HADR database.
Reason code = "6".
Normally, those two databases cannot now be HADR reconciled because
both were essentially in PRIMARY state, and to start a database as
Standby, it needs to be in rollforward pending state.
However, DB2 has a utility called db2rfpen to force reset a database
into rollforward pending state.
We will assume for convenience & purpose of this testing that
the takeover by force occurred within Peer Window (otherwise we
already have Split Brain). We will also assume for the same
reasons that the old Primary did not have any remaining connected
transactions commit after the takeover by force. The negation
of any of these assumptions would indicate a Split Brain scenario
has already occurred in terms of divergence of log streams &
committed transactions in database, even if those databases are
currently preventing new connections. As stated above, DB2's
internal safefguards will at least ensure a successful HADR
Start/reconnect will not occur if the log streams are divergent, but
that doesn't prevent connections and transactions if HADR is then
stopped and databases are in standard mode.
Reconciling such divergent databases requires choosing one to be
discarded and overwritten with a fresh full backup of the other
database chosen as the best new Primary.
In order to reconcile and restart HADR in this otherwise cleanly
forced takeover scenario,
since the Standby issued a takeover by force and potentially had
transaction logs subsequently applied to it, and the old connected
Primary was prevented from accepting new connections, the logical
database to choose to start as Standby would be the old Primary.
Attempting to restart SAMPLE on the old Primary right now gives us
the following error:
db2inst1@sles12x64a:~> db2 deactivate
db sample
DB20000I The DEACTIVATE DATABASE command completed
successfully.
db2inst1@sles12x64a:~> db2 stop HADR on db sample
DB20000I The STOP HADR ON DATABASE command
completed successfully.
db2inst1@sles12x64a:~> db2 start HADR on db sample as
Standby
SQL1767N Start HADR cannot complete. Reason code =
"1".
SQL1767N rc1 The database
was not in roll forward-pending or roll forward-in-progress state
when the START HADR AS Standby command was issued.
Not to worry, a quick n dirty db2rfpen + repeat start HADR as
Standby has that sorted:
db2inst1@sles12x64a:~> db2rfpen on
sample
______________________________________________________________________
____ D B 2 R F P E N
____
IBM - Reset ROLLFORWARD Pending
State
The db2rfpen tool is a utility to switch on the
database rollforward
pending
state.
It will also reset the database role to STANDARD
if the database is
identified using the database_alias
option.
In a non-HADR environment, this tool should only
be used under the
advisement of DB2
service.
In an HADR environment, this tool can be used to
reset the database
role to
STANDARD.
SYNTAX: db2rfpen on < database_alias | -path
log_file_header_path >
______________________________________________________________________
Primary Global LFH file =
/home/db2inst1/db2inst1/NODE0000/SQL00001/SQLOGCTL.GLFH.1
Secondary Global LFH file =
/home/db2inst1/db2inst1/NODE0000/SQL00001/SQLOGCTL.GLFH.2
Path to LFH
files =
/home/db2inst1/db2inst1/NODE0000/SQL00001/MEMBER0000
Original rollforward pending state is Off.
Setting rollforward pending State to On.
Setting backup end time to: 1562854483
db2inst1@sles12x64a:~> db2 start HADR on db sample as
Standby
DB20000I The START HADR ON DATABASE command
completed successfully.
Thanks to lack of log divergence, the start as Primary on the
other side also works and reconnects the DB2 HADR pair.
db2inst1@sles12x64b:~> db2 start HADR
on db sample as Primary
DB20000I The START HADR ON DATABASE command completed
successfully.
db2inst1@sles12x64b:~> db2pd -d sample -HADR
Database Member 0 -- Database SAMPLE -- Active -- Up 0 days
00:00:08 -- Date 2019-07-15-15.48.00.153018
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = NEARSYNC
Standby_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = PEER
HADR_FLAGS = TCP_PROTOCOL
PRIMARY_MEMBER_HOST = sles12x64b
PRIMARY_INSTANCE = db2inst1
PRIMARY_MEMBER = 0
Standby_MEMBER_HOST = sles12x64a
Standby_INSTANCE = db2inst1
Standby_MEMBER = 0
HADR_CONNECT_STATUS = CONNECTED
HADR_CONNECT_STATUS_TIME = 07/15/2019 15:47:52.622282
(1563169672)
HEARTBEAT_INTERVAL(seconds) = 30
HEARTBEAT_MISSED = 0
HEARTBEAT_EXPECTED = 0
HADR_TIMEOUT(seconds) = 120
TIME_SINCE_LAST_RECV(seconds) = 5
PEER_WAIT_LIMIT(seconds) = 0
LOG_HADR_WAIT_CUR(seconds) = 0.000
LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000000
LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.000
LOG_HADR_WAIT_COUNT = 0
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 87040
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 374400
PRIMARY_LOG_FILE,PAGE,POS = S0000009.LOG, 0, 85596001
Standby_LOG_FILE,PAGE,POS = S0000008.LOG, 0, 81520001
HADR_LOG_GAP(bytes) = 0
Standby_REPLAY_LOG_FILE,PAGE,POS =
S0000008.LOG, 0, 81520001
Standby_RECV_REPLAY_GAP(bytes) = 0
PRIMARY_LOG_TIME = 07/12/2019 00:14:43.000000 (1562854483)
Standby_LOG_TIME = 07/12/2019 00:14:43.000000 (1562854483)
Standby_REPLAY_LOG_TIME = 07/12/2019 00:14:43.000000
(1562854483)
Standby_RECV_BUF_SIZE(pages) = 512
Standby_RECV_BUF_PERCENT = 0
Standby_SPOOL_LIMIT(pages) = 13000
Standby_SPOOL_PERCENT = 0
Standby_ERROR_TIME = NULL
PEER_WINDOW(seconds) = 120
PEER_WINDOW_END = 07/15/2019 15:49:55.000000 (1563169795)
READS_ON_Standby_ENABLED = N
Now, if anyone had stopped HADR on the old primary db and activated
it as standard mode, we'd have transaction connectivity re-enabled
and the Split Brain would effectively be irreconcilable.
There are a high number of permutations whereby Split Brain can be
artificially induced, many of which can only be partially simulated
on two virtual servers, so I have limited my practical
experimentation to the above for reasons of expediency, and turned
today to the theoretical using good-old internet and IBM knowledge
base/manuals.
As a result of my brief online research :
Renowned DB2 expert Steve Pearson perhaps explains it best in this
article, with a clear and specific definitional delineation of Split
Brain prevention. The initial Q&A has a number of follow-ups
which still make relevant points, even though they were written in
2006 when HADR was a new concept.
https://bytes.com/topic/db2/answers/448566-HADR-split-brain-question
to whit:
Question:
Server A (HADR Primary), Server B (HADR Standby) -
Server A Fails, Server B takes over as new Primary.
Server A restarts, but DB2 is still in HADR Primary state on Server
A - what prevents applications connecting?
Answer:
DB2 will NOT ALLOW new connections to a restarted HADR
Primary until it successfully reconnects to a HADR Standby.
That was true back in DB2 v8.2, v9.1 HADR, just as it remains true
today in v11.5, however it doesn't negate the scenario where Server
A was never stopped & restarted, nor does it prevent panicked
support teams simply issuing a db2 deactivate db <dbname> and
db2 stop HADR on <dbname> on a restarted HADR Primary which
puts the database back in standard mode. This makes both databases connectable by
all and sundry, without due regard for the new Primary still running
along happily with application connections on the other server, or
the transaction logs now completely divergent.
In the earliest versions of DB2 HADR (e.g. v8.2, v9.1), the only IBM
definition of Split Brain is simply given in the context of issuing
a 'start HADR on db <dbname> as Primary by force' command:
http://public.dhe.ibm.com/ps/products/db2/info/vr9/pdf/letter/en_US/db2hae90.pdf
Caution: Use the START HADR command with the
AS PRIMARY BY FORCE option with caution. If the Standby database
has been changed to a Primary and the original Primary database is
restarted by issuing the START HADR command with the AS PRIMARY BY
FORCE option, both copies of your database will be operating
independently as primaries. (This is sometimes referred to as
split brain or dual Primary.) In this case, each Primary database
can accept connections and perform transactions, and neither
receives and replays the updates made by the other. As a result,
the two copies of the database will become inconsistent with each other.
As of DB2 v9.7, that particular vulnerability was removed when takeover by force is issued against a Standby on an active connected HADR pair. The old HADR Primary is set to an obsolete state whereby it cannot be restarted as HADR Primary, even by force, giving an SQL1776N rc6: This database is an old primary
database. It cannot be started because the standby has become the new
primary through forced takeover.
However, as the example below shows, it does not prevent Split Brain
from transactions connecting after stopping hadr (db back to Standard
mode):
db2inst1@sles12x64b:~> db2pd -d sample -hadr
Database Member 0 -- Database SAMPLE -- Standby -- Up 0 days 00:00:03 -- Date 2019-07-16-08.38.52.353931
HADR_ROLE = STANDBY
...
db2inst1@sles12x64b:~> db2 takeover hadr on db sample by force
DB20000I The TAKEOVER HADR ON DATABASE command completed successfully.
db2inst1@sles12x64b:~> db2pd -d sample -hadr
Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:03:29 -- Date 2019-07-16-08.42.18.559234
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = NEARSYNC
STANDBY_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = DISCONNECTED
...
db2inst1@sles12x64b:~> db2 connect to sample
Database Connection Information
Database server = DB2/LINUXX8664 11.5.0.0
SQL authorization ID = DB2INST1
Local database alias = SAMPLE
db2inst1@sles12x64a:~> db2 deactivate db sample
DB20000I The DEACTIVATE DATABASE command completed successfully.
db2inst1@sles12x64a:~> db2 takeover hadr on db sample by force
SQL1776N The command cannot be issued on an HADR database. Reason code = "6".
db2inst1@sles12x64a:~> db2 stop hadr on db sample
DB20000I The STOP HADR ON DATABASE command completed successfully.
db2inst1@sles12x64a:~> db2 connect to sample
Database Connection Information
Database server = DB2/LINUXX8664 11.5.0.0
SQL authorization ID = DB2INST1
Local database alias = SAMPLE
Before DB2 9.5, there was no integration of DB2 HADR with
Tivoli System Automation for MultiPlatforms (TSAMP) which later
became DB2 HA and better integrated in v9.7, and subsequently replaced by DB2 PureScale in v9.8 to combine
the best features of High Availability Clustering with Partitioning,
while still allowing for HADR features to exist in hybrid scenarios
including multiple Standby/replay targets.
Going to IBM's Knowledge Centre for the latest DB2 LUW (v11.5 as of
this article), the issue of Split Brain is now covered in at least
10 locations,
https://www.ibm.com/support/knowledgecenter/search/split%20brain?scope=SSEPGG_11.5.0
mostly dealing with methods for DB2 Administrators to use to avoid
Split Brain scenarios, as well as describing the built-in features
designed to prevent it (so long as they are not manually overridden)
-
additional scenarios for Split Brain which never existed before
concepts such as hybrid cluster+HADR and multiple standby HADR are
now covered in v11.5:
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.admin.ha.doc/doc/c0059998.html
This alone is sufficient indication that IBM still acknowledges
Split Brain as a vulnerability even with all the safeguards in place
from the automation in PureScale and built-in connection prereq
checking for HADR commands and log shipping/replay.
So in summary, in an active connected HADR pair, Split Brain was
always, and is still prevented in the situation as described by
Steve Pearson, and since v9.7, also prevented where an active, connected HADR pair
experiences a forced takeover on the Standby. Even so, the
potential for Split Brain (through disconnection and manual
intervention) even in DB2 v11.5 still exists in my descriptions
above. It was unquestionably mitigated by the PureScale
/ TSA scripted mechanisms to only force a HADR takeover in the event
of an actual cluster Primary server failure, or forcing a
takeover/failover and node restart in the event of a cluster quorum
failure, taking out a lot of the human error factor. However, the
vulnerability remains wherever a hybrid or unmanaged HADR pair
exists, and steps are not taken to ensure only one database is
connectable by applications at a time.
If you take one point from this entire article, to avoid Split
Brain, it would be to never succumb to the pressure of forcing a
HADR database 'back online' to transaction processing as fast as
possible at any cost, without first absolutely ensuring that the
other database server(s) have well and truly been shut down and
disconnected from all application connections on all network routes,
and that a subsequent restart of those offline database servers must
not have any attempts made to force the database to standard
mode. Instead, in re-establishing HADR pairing, those
databases must only ever be overwritten from fresh backups of the
new Primary (if the takeover was forced/in non-peer state), or
restarted as standby and allowed to run log catchup if the takeover
was unforced and the interim log activity with time elapsed is not
going to take longer than a backup/restore.
My appreciation to anyone who made it all the way to the end of this article, it appears
I have not reduced my verbosity as a result of temporal progression.
-Paul (Morte) Descovich.