Synopsis - 
    DB2 "Split Brain" is still very much a vulnerability in DB2 v11.5 (multiplatform).
The concept and meaning is discussed, along with the point that the vulnerability is actually multiplied with
    increased complexity in server
    frameworks and recent DB2 featuresets, coupled with DB2 Administrators making decisions to
    enact HADR connectivity overrides in emergency outages, by relying
    on potentially misleading diagnostic cues.
    Brief practical and theoretical research on multiple DB2 Split Brain
    scenarios across DB2 versions, even given multiple built-in
    preventative mechanisms.
    A summary is given on how best to procedurally avoid Split Brain.
    
    
    Main Article -
    I was recently asked a DB2 interview question : "How does Split
    Brain occur".
    DB2 High Availability being a specialty of mine (a few years ago), I
    racked my brain for what I thought the simplest summary was, and my
    response : 
    Split Brain is when you have a [disconnected] HADR Standby force
    take-over as Primary, (but the original Primary is still active) and
    application connections add log records to the new Primary which put
    the two databases in an irreconcilable state where they cannot
    reconnect in HADR.
    
    The interviewers had an even simpler summary - i.e. Split Brain
    occurs (or can occur) when HADR databases are no longer the same.
    
    Reflecting on it, their simpler answer is indeed more valid than my
    own.  This is not only for the reason that two HADR Standard /
    Primary mode databases which become divergent from each other cannot
    be reconnected as a Primary + Standby, (i.e. divergent transaction
    logs as opposed to a Primary transaction log which is merely further
    along than the Standby has replayed - the latter situation can still
    potentially be brought to peer state so long as Primary and Standby
    are HADR connected and log replay/catchup is allowed to progress
    before any [forced] takeover).
    It is also true because (in HADR_SYNCMODEs other than SYNC or
    NEARSYNC), a vulnerability opens up every time a HADR Standby falls
    behind the Primary in log apply, or is in a 'Catch-up' state.
    If that catch up state is not successfully brought back to a peer
    state, transactions from the original Primary side will be lost, and
    a loss of connectivity to (or outage of) the Primary server before
    all HADR logs are received by the Standby, means that peer state can
    never be achieved if the Standby has to be forced into a takeover as
    Primary state.
    If applications start connecting and writing logs to the new Primary
    while the old Primary is disconnected from Standby, but not from
    applications/batch connections, the two databases become
    irreconcilable, with some transaction log records essentially
    unrecoverable except with a 3rd party log extraction tool (and most
    likely practically unusable after the fact in a high activity OLTP
    environment - it would be more expedient for selective application
    resubmission after assessing referential integrity to find
    missing/incorrect row/column values).
    Bringing both databases back to connected HADR Primary+Standby state
    after this forced takeover situation (where the two were not in peer
    state at time of takeover) will require the old Primary database to
    be restore/replaced from an online backup of the new Primary and
    started in rollforward pending as HADR Standby, in the same manner
    as the original Standby was established.   Once in peer
    state, the original Primary can perform an unforced takeover and the
    servers can resume their normal roles, which is usually required
    where the Standby server has been configured with lower capacity
    (CPU,memory) or is physically located remotely in a DR site.
    
    So, with all that said, I was immediately intrigued by their
    follow-up statement that DB2 has protections against Split Brain
    such that it doesn't (cannot) occur in the more recent versions of
    DB2. (e.g. v10 and later).
    I had to do some research to confirm this, because it seemed to me
    that regardless of the DB2 version, Split Brain will always be a
    potential situation which can only be avoided by never performing a
    HADR 'takeover by force', and thus never having two connectable
    databases at the same time. That still allows for Read_On_Standby
    (ROS) which only allows Uncommitted Read, (unlogged Selects) and is
    completely safe.
    I hasten to add that by 'connectable', I definitely do not mean that
    DB2 will allow databases with divergent log streams to re-connect
    or re-start in HADR mode.  DB2 already has multiple
    safeguards including detection when a Standby candidate has log
    records which the Primary candidate does not contain, preventing a
    HADR Start as Primary to connect to a divergent started Standby
    (e.g.  with SQL1768N rc7).   'Connectable' includes
    the obvious non-HADR 'Standard' mode database where manual
    intervention is performed by DB2 admins under pressure to get the
    database online again for users.  
    As with most things, the inevitability of human error and manual
    intervention means that all built-in mechanisms to prevent data
    inconsistency are not foolproof.
    
    It may actually be that conceptually, our ideas of what 'Split
    Brain' means, are 'divergent' :)
    It is not an uncommon phenomenon for IBM's official terminology to
    differ from usage in the global DB2 support community.
    My idea incorporates the broad definition of data inconsistency
    regardless of the current state of HADR connectivity, whereby the
    log streams have diverged - i.e. databases have log records which
    the other lacks, e.g. due to a takeover by force outside of peer
    state. (or more specifically, outside of peer window), along with
    subsequent committed transactions on the new HADR Primary.
    One way a Split Brain can occur even from a Peer State, is where the
    original Primary server is still running with connected
    applications, but the Standby loses connectivity to Primary and an
    over-enthusiastic DB2 support person thinks that HADR error actually
    means the Primary server is down and a HADR takeover [by force] is
    required.  Subsequently, there are two HADR Primary DB2
    databases running and connectable - if any application connects to
    the new Primary and writes a transaction log, that becomes a Split
    Brain scenario.
    Normally, you might expect there to be other factors preventing
    Split Brain from occurring in an enterprise/multi-server environment
    purely because applications typically connect to DB2 not locally,
    but remotely via separate middleware such as Websphere Application
    Server (WAS), and that normally restricts the application to DB2
    network route to a single port number on a single IP address on a
    specific network adapter.
    Additionally, reliance on the Automatic Client Re-route mechanism
    should serve as a preventative measure, whereby an attempted
    connection on server A will be autoredirected to Server B if DB2
    HADR is currently set to Standby on Server A, (and vice versa,
    direct connections to Server B redirect to A when Server B is
    Standby)
    Unfortunately, in security vulnerability parlance, this merely
    widens the 'attack surface' - increasing the number of moving parts
    (points of failure) which can go wrong.
    It is especially true when there are multiple network adapters in
    play (for firewall security zone demarcation as well as load
    balancing), where some are designated for remote admin and some are
    internal application comms - if external zone admin routes fail but
    internal zone application comms remain, it can be difficult for the
    support teams or dashboard/monitoring tools to confirm whether
    applications are still connected to DB2, because they cannot connect
    in order to run basic diagnostic commands.
    Unless there is an explicit process or mechanism preventing
    applications connecting to one database of a DB2 HADR pair without
    confirmation that it is the only connectable (Primary / Standard
    mode) database, the Split Brain vulnerability exists.
    I assert this because that situation occurred for our support team
    years ago, (luckily only on a Disaster Recovery test takeover
    scenario), whereby the HADR takeover by force was issued on the DR
    test Standby after confirmation from the system support team that
    the old Primary was stopped, but some application activity occurred
    on the original Primary even after the HADR takeover by force on the
    DR test Standby, creating the Split Brain. 
    In this scenario, two related points of failure existed - 
    1) After the shutdown step on the four Primary DB2 servers was
    given, the Admin Network Adapters indicated that all four were not
    connectable, which was taken as the signal to proceed as though all
    four were stopped.  Unfortunately, it turned out that at least
    one Primary DB2 server was still running, just that the Admin
    network adapter was stopped, and some batch applications were still
    connected and processing in DB2 through the internal adpater - some
    through direct ip/port, not all through Websphere Admin Server (WAS)
    middleware.
    2) The middleware/websphere team had to switch over their registered
    application server ip addresses, because in a real DR scenario,
    DB2's Automatic Client Reroute (ACR) cannot autoredirect when the
    Primary server is down.  There were multiple load balanced
    application servers to switch over and this took time, allowing
    transaction processing unbeknown to the system support teams but
    later discovered by the application team.
    
    To practically test my assertion on current DB2 version 11.5, I went
    so far as to create a pair of virtualised DB2 11.5 on SLES12x64
    servers, pairing the SAMPLE db in HADR via a host-only network
    adapter for internal DB2 connectivity in addition to the NAT adapter
    for external/remote (admin) connectivity.
    (sles12x64a db2inst1 SAMPLE HADR Primary, sles12x64b db2inst1 SAMPLE
    HADR Standby)
    The Split Brain scenario is still possible purely by virtue of the
    ability to perform db2 takeover by force on the Standby server while
    the old Primary is still running but network disconnected from
      the Standby. 
    It is also possible when HADR is stopped after any takeover and
    databases become connectable as Standard mode.
    Thankfully, if connectivity between Primary and Standby exists
    at the time of a forced takeover, there is a DB2 mechanism which
    prevents subsequent connections on the old Primary:
    i.e. start HADR, then attempt connect to old Primary after a forced
    takeover on still-connected Standby:
    db2inst1@sles12x64b:~> db2 start HADR
        on db sample as Standby
        DB20000I  The START HADR ON DATABASE command completed
        successfully.
      
    db2inst1@sles12x64a:~> db2 start
        HADR on db sample as Primary
        DB20000I  The START HADR ON DATABASE command completed
        successfully.
    
    db2inst1@sles12x64b:~> db2
        takeover db sample by force
        SQL0104N  An unexpected token "db" was found following
        "TAKEOVER".  Expected 
        tokens may include:  "HADR".  SQLSTATE=42601
        db2inst1@sles12x64b:~> db2 takeover HADR on db sample by
        force
        DB20000I  The TAKEOVER HADR ON DATABASE command completed
        successfully.
        db2inst1@sles12x64b:~> db2 connect to sample
           Database Connection Information
 
        Database server       
        = DB2/LINUXX8664 11.5.0.0
         SQL authorization ID   = DB2INST1
         Local database alias   = SAMPLE
        
      db2inst1@sles12x64a:~> db2 connect
        to sample
        SQL1776N  The command cannot be issued on an HADR database.
        Reason code = "6".
    
    Normally, those two databases cannot now be HADR reconciled because
    both were essentially in PRIMARY state, and to start a database as
    Standby, it needs to be in rollforward pending state.
    However, DB2 has a utility called db2rfpen to force reset a database
    into rollforward pending state.
    We will assume for convenience & purpose of this testing that
    the takeover by force occurred within Peer Window (otherwise we
    already have Split Brain).  We will also assume for the same
    reasons that the old Primary did not have any remaining connected
    transactions commit after the takeover by force.  The negation
    of any of these assumptions would indicate a Split Brain scenario
    has already occurred in terms of divergence of log streams &
    committed transactions in database, even if those databases are
    currently preventing new connections.  As stated above, DB2's
    internal safefguards will at least ensure a successful HADR
    Start/reconnect will not occur if the log streams are divergent, but
    that doesn't prevent connections and transactions if HADR is then
    stopped and databases are in standard mode.
    Reconciling such divergent databases requires choosing one to be
    discarded and overwritten with a fresh full backup of the other
    database chosen as the best new Primary.
    In order to reconcile and restart HADR in this otherwise cleanly
    forced takeover scenario, 
    since the Standby issued a takeover by force and potentially had
    transaction logs subsequently applied to it, and the old connected
    Primary was prevented from accepting new connections, the logical
    database to choose to start as Standby would be the old Primary.
    Attempting to restart SAMPLE on the old Primary right now gives us
    the following error:
    db2inst1@sles12x64a:~> db2 deactivate
        db sample
      DB20000I  The DEACTIVATE DATABASE command completed
        successfully.
      db2inst1@sles12x64a:~> db2 stop HADR on db sample
      DB20000I  The STOP HADR ON DATABASE command
        completed successfully.
      db2inst1@sles12x64a:~> db2 start HADR on db sample as
        Standby
      SQL1767N  Start HADR cannot complete. Reason code =
        "1".
    SQL1767N   rc1 The database
      was not in roll forward-pending or roll forward-in-progress state
      when the START HADR AS Standby command was issued.
    
    Not to worry, a quick n dirty db2rfpen + repeat start HADR as
    Standby has that sorted:
    db2inst1@sles12x64a:~> db2rfpen on
        sample
       ______________________________________________________________________ 
      
                         
        ____    D B 2 R F P E N   
        ____                     
      
                      
        IBM - Reset ROLLFORWARD Pending
        State                  
      
        The db2rfpen tool is a utility to switch on the
        database rollforward   
        pending
        state.                                                        
      
        It will also reset the database role to STANDARD
        if the database is    
        identified using the database_alias
        option.                           
      
        In a non-HADR environment, this tool should only
        be used under the      
        advisement of DB2
        service.                                            
      
        In an HADR environment, this tool can be used to
        reset the database    
        role to
        STANDARD.                                                     
      
        SYNTAX: db2rfpen on < database_alias | -path
        log_file_header_path >   
       ______________________________________________________________________ 
      
      Primary Global LFH file    =
        /home/db2inst1/db2inst1/NODE0000/SQL00001/SQLOGCTL.GLFH.1
      Secondary Global LFH file  =
        /home/db2inst1/db2inst1/NODE0000/SQL00001/SQLOGCTL.GLFH.2
      Path to LFH
        files          =
        /home/db2inst1/db2inst1/NODE0000/SQL00001/MEMBER0000
      Original rollforward pending state is Off.
      Setting rollforward pending State to On.
      Setting backup end time to: 1562854483
      db2inst1@sles12x64a:~> db2 start HADR on db sample as
        Standby
      DB20000I  The START HADR ON DATABASE command
        completed successfully.
    
    Thanks to lack of log divergence, the start as Primary on the
    other side also works and reconnects the DB2 HADR pair.
    db2inst1@sles12x64b:~> db2 start HADR
        on db sample as Primary
        DB20000I  The START HADR ON DATABASE command completed
        successfully.
        db2inst1@sles12x64b:~> db2pd -d sample -HADR
        Database Member 0 -- Database SAMPLE -- Active -- Up 0 days
        00:00:08 -- Date 2019-07-15-15.48.00.153018
                                   
        HADR_ROLE = PRIMARY
                                 
        REPLAY_TYPE = PHYSICAL
                               
        HADR_SYNCMODE = NEARSYNC
                                  
        Standby_ID = 1
                               
        LOG_STREAM_ID = 0
                                  
        HADR_STATE = PEER
                                  
        HADR_FLAGS = TCP_PROTOCOL
                         
        PRIMARY_MEMBER_HOST = sles12x64b
                            
        PRIMARY_INSTANCE = db2inst1
                              
        PRIMARY_MEMBER = 0
                         
        Standby_MEMBER_HOST = sles12x64a
                            
        Standby_INSTANCE = db2inst1
                              
        Standby_MEMBER = 0
                         
        HADR_CONNECT_STATUS = CONNECTED
                    
        HADR_CONNECT_STATUS_TIME = 07/15/2019 15:47:52.622282
        (1563169672)
                 
        HEARTBEAT_INTERVAL(seconds) = 30
                            
        HEARTBEAT_MISSED = 0
                          
        HEARTBEAT_EXPECTED = 0
                       
        HADR_TIMEOUT(seconds) = 120
               
        TIME_SINCE_LAST_RECV(seconds) = 5
                    
        PEER_WAIT_LIMIT(seconds) = 0
                  
        LOG_HADR_WAIT_CUR(seconds) = 0.000
            LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000000
           LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.000
                         
        LOG_HADR_WAIT_COUNT = 0
        SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 87040
        SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 374400
                   
        PRIMARY_LOG_FILE,PAGE,POS = S0000009.LOG, 0, 85596001
                   
        Standby_LOG_FILE,PAGE,POS = S0000008.LOG, 0, 81520001
                         
        HADR_LOG_GAP(bytes) = 0
             Standby_REPLAY_LOG_FILE,PAGE,POS =
        S0000008.LOG, 0, 81520001
              
        Standby_RECV_REPLAY_GAP(bytes) = 0
                            
        PRIMARY_LOG_TIME = 07/12/2019 00:14:43.000000 (1562854483)
                            
        Standby_LOG_TIME = 07/12/2019 00:14:43.000000 (1562854483)
                     
        Standby_REPLAY_LOG_TIME = 07/12/2019 00:14:43.000000
        (1562854483)
                
        Standby_RECV_BUF_SIZE(pages) = 512
                    
        Standby_RECV_BUF_PERCENT = 0
                  
        Standby_SPOOL_LIMIT(pages) = 13000
                       
        Standby_SPOOL_PERCENT = 0
                          
        Standby_ERROR_TIME = NULL
                        
        PEER_WINDOW(seconds) = 120
                             
        PEER_WINDOW_END = 07/15/2019 15:49:55.000000 (1563169795)
                    
        READS_ON_Standby_ENABLED = N
      
    Now, if anyone had stopped HADR on the old primary db and activated
    it as standard mode, we'd have transaction connectivity re-enabled
    and the Split Brain would effectively be irreconcilable.
    
    There are a high number of permutations whereby Split Brain can be
    artificially induced, many of which can only be partially simulated
    on two virtual servers, so I have limited my practical
    experimentation to the above for reasons of expediency, and turned
    today to the theoretical using good-old internet and IBM knowledge
    base/manuals.
    
    As a result of my brief online research :
    Renowned DB2 expert Steve Pearson perhaps explains it best in this
    article, with a clear and specific definitional delineation of Split
    Brain prevention. The initial Q&A has a number of follow-ups
    which still make relevant points, even though they were written in
    2006 when HADR was a new concept.
    https://bytes.com/topic/db2/answers/448566-HADR-split-brain-question
    to whit:   
    Question:
    Server A (HADR Primary), Server B (HADR Standby) - 
    Server A Fails, Server B takes over as new Primary.
    Server A restarts, but DB2 is still in HADR Primary state on Server
    A - what prevents applications connecting?
    Answer:
    DB2 will NOT ALLOW new connections to a restarted HADR
    Primary until it successfully reconnects to a HADR Standby.
    
    That was true back in DB2 v8.2, v9.1 HADR, just as it remains true
    today in v11.5, however it doesn't negate the scenario where Server
    A was never stopped & restarted, nor does it prevent panicked
    support teams simply issuing a db2 deactivate db <dbname> and
    db2 stop HADR on <dbname> on a restarted HADR Primary which
    puts the database back in standard mode.  This makes both databases connectable by
    all and sundry, without due regard for the new Primary still running
    along happily with application connections on the other server, or
    the transaction logs now completely divergent.
    
    In the earliest versions of DB2 HADR (e.g. v8.2, v9.1), the only IBM
    definition of Split Brain is simply given in the context of issuing
    a 'start HADR on db <dbname> as Primary by force' command:
    http://public.dhe.ibm.com/ps/products/db2/info/vr9/pdf/letter/en_US/db2hae90.pdf
    
    Caution: Use the START HADR command with the
      AS PRIMARY BY FORCE option with caution. If the Standby database
      has been changed to a Primary and the original Primary database is
      restarted by issuing the START HADR command with the AS PRIMARY BY
      FORCE option, both copies of your database will be operating
      independently as primaries. (This is sometimes referred to as
      split brain or dual Primary.) In this case, each Primary database
      can accept connections and perform transactions, and neither
      receives and replays the updates made by the other. As a result,
      the two copies of the database will become inconsistent with each other.
As of DB2 v9.7, that particular vulnerability was removed when takeover by force is issued against a Standby on an active connected HADR pair.  The old HADR Primary is set to an obsolete state whereby it cannot be restarted as HADR Primary, even by force, giving an SQL1776N rc6: This database is an old primary 
database. It cannot be started because the standby has become the new 
primary through forced takeover. 
However, as the example below shows, it does not prevent Split Brain 
from transactions connecting after stopping hadr (db back to Standard 
mode):
db2inst1@sles12x64b:~> db2pd -d sample -hadr
Database Member 0 -- Database SAMPLE -- Standby -- Up 0 days 00:00:03 -- Date 2019-07-16-08.38.52.353931
                            HADR_ROLE = STANDBY
  ...
  db2inst1@sles12x64b:~> db2 takeover hadr on db sample by force
DB20000I  The TAKEOVER HADR ON DATABASE command completed successfully.
db2inst1@sles12x64b:~> db2pd -d sample -hadr
Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:03:29 -- Date 2019-07-16-08.42.18.559234
                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = NEARSYNC
                           STANDBY_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = DISCONNECTED
...
db2inst1@sles12x64b:~> db2 connect to sample
   Database Connection Information
 Database server        = DB2/LINUXX8664 11.5.0.0
 SQL authorization ID   = DB2INST1
 Local database alias   = SAMPLE
  
db2inst1@sles12x64a:~> db2 deactivate db sample
DB20000I  The DEACTIVATE DATABASE command completed successfully.
db2inst1@sles12x64a:~> db2 takeover hadr on db sample by force
SQL1776N  The command cannot be issued on an HADR database. Reason code = "6".
db2inst1@sles12x64a:~> db2 stop hadr on db sample
DB20000I  The STOP HADR ON DATABASE command completed successfully.
db2inst1@sles12x64a:~> db2 connect to sample
   Database Connection Information
 Database server        = DB2/LINUXX8664 11.5.0.0
 SQL authorization ID   = DB2INST1
 Local database alias   = SAMPLE
Before DB2 9.5, there was no integration of DB2 HADR with
    Tivoli System Automation for MultiPlatforms (TSAMP) which later
    became DB2 HA and better integrated in v9.7, and subsequently replaced by DB2 PureScale in v9.8 to combine
    the best features of High Availability Clustering with Partitioning,
    while still allowing for HADR features to exist in hybrid scenarios
    including multiple Standby/replay targets.
    
    Going to IBM's Knowledge Centre for the latest DB2 LUW (v11.5 as of
    this article), the issue of Split Brain is now covered in at least
    10 locations, 
    https://www.ibm.com/support/knowledgecenter/search/split%20brain?scope=SSEPGG_11.5.0
    mostly dealing with methods for DB2 Administrators to use to avoid
    Split Brain scenarios, as well as describing the built-in features
    designed to prevent it (so long as they are not manually overridden)
    - 
    additional scenarios for Split Brain which never existed before
    concepts such as hybrid cluster+HADR and multiple standby HADR are
    now covered in v11.5:
    https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.admin.ha.doc/doc/c0059998.html
    This alone is sufficient indication that IBM still acknowledges
    Split Brain as a vulnerability even with all the safeguards in place
    from the automation in PureScale and built-in connection prereq
    checking for HADR commands and log shipping/replay. 
    
    So in summary, in an active connected HADR pair, Split Brain was
    always, and is still prevented in the situation as described by
    Steve Pearson, and since v9.7, also prevented where an active, connected HADR pair
    experiences a forced takeover on the Standby.  Even so, the
    potential for Split Brain (through disconnection and manual
    intervention) even in DB2 v11.5 still exists in my descriptions
    above.   It was unquestionably mitigated by the PureScale
    / TSA scripted mechanisms to only force a HADR takeover in the event
    of an actual cluster Primary server failure, or forcing a
    takeover/failover and node restart in the event of a cluster quorum
    failure, taking out a lot of the human error factor. However, the
    vulnerability remains wherever a hybrid or unmanaged HADR pair
    exists, and steps are not taken to ensure only one database is
    connectable by applications at a time.
    If you take one point from this entire article, to avoid Split
    Brain, it would be to never succumb to the pressure of forcing a
    HADR database 'back online' to transaction processing as fast as
    possible at any cost, without first absolutely ensuring that the
    other database server(s) have well and truly been shut down and
    disconnected from all application connections on all network routes,
    and that a subsequent restart of those offline database servers must
    not have any attempts made to force the database to standard
    mode.  Instead, in re-establishing HADR pairing, those
    databases must only ever be overwritten from fresh backups of the
    new Primary (if the takeover was forced/in non-peer state), or
    restarted as standby and allowed to run log catchup if the takeover
    was unforced and the interim log activity with time elapsed is not
    going to take longer than a backup/restore.
    
My appreciation to anyone who made it all the way to the end of this article, it appears
    I have not reduced my verbosity as a result of temporal progression.
    -Paul (Morte) Descovich.
  
 

