【MySQL】【高可用】基于MHA架构的MySQL高可用故障自动切换架构

基于MHA架构的MySQL高可用切换架构

环境：

CentOS7+MySQL 5.7 + GTID 业务系统：mainBusiness

盐边网站制作公司哪家好，找成都创新互联公司！从网页设计、网站建设、微信开发、APP开发、响应式网站开发等网站项目制作，到程序开发，运营维护。成都创新互联公司成立于2013年到现在10年的时间，我们拥有了丰富的建站经验和运维经验，来保证我们的工作的顺利进行。专注于网站建设就选成都创新互联公司。

node1 : 192.168.1.109 port:3109

node2 : 192.168.1.110 port:3110

VIP :192.168.1.88

manager:192.168.1.8

1.背景：

除了galera cluster(Mariadb Cluster,GroupReplication,PXC)和KeepAlived之外，业界广泛使用的MySQL高可用就是MHA架构了。

MHA作者在离开DeNA加入facebook后就极少更新了这个工具了。

2.安装：

rpm包安装的方式最简单，但是作者在27天前增加了对从库上启用了super-read-only参数的优化，简而言之就是：当开启这个参数后，有可能会发生配置文件中的用户无法对差异事务进行应用的问题。于是增加了判断super-read-only参数是否开启的逻辑判断，若开启，则先关闭此参数，然后进行应用差异事务然后重新开启。

所以这里我们采用编译Github上最新的代码的办法进行安装。地址为：

https://github.com/yoshinorim/mha4mysql-manager

https://github.com/yoshinorim/mha4mysql-node

在node1和node2上：

#创建工作文件夹
mkdir -p /data/mha/
#下载源码包
wget  https://codeload.github.com/yoshinorim/mha4mysql-node/zip/master -O /usr/local/src/mha-node.zip
wget https://codeload.github.com/yoshinorim/mha4mysql-manager/zip/master -O /usr/local/src/mha-manager.zip
#解压
unzip /usr/local/src/mha-manager.zip -d /usr/local/src
unzip /usr/local/src/mha-node.zip -d /usr/local/src
#安装perl及其相关依赖
yum -y install perl perl-ExtUtils-MakeMaker perl-ExtUtils-CBuilder perl-Parallel-ForkManager  perl-Config-Tiny 
yum -y install 'perl(inc::Module::Install)'
yum -y install 'perl(Test::Without::Module)'
yum -y install 'perl(Log::Dispatch)'
#编译管理端
cd /usr/local/src/mha4mysql-manager-master/
perl Makefile.PL #提示时输入y
make &&make install
#编译节点端
cd /usr/local/src/mha4mysql-node-master/
perl Makefile.PL 
make &&make install
#更改配置文件
cp /usr/
#在数据库中创建用于MHA系统工作的管理员权限账号
#node1
mysql --login-path=3109 -e 'create user mha@'192.168.1.8' identified by 'sa123456''
mysql --login-path=3109 -e 'grant all privileges on *.* to mha@'192.168.1.8''
mysql --login-path=3109 -e 'flush privilges'
#node2
mysql --login-path=3110 -e 'create user mha@'192.168.1.8' identified by 'sa123456''
mysql --login-path=3110 -e 'grant all privileges on *.* to mha@'192.168.1.8''
mysql --login-path=3110 -e 'flush privilges'

在manager上

#创建工作目录
mkdir -p /data/mha/mainBusiness
#下载源码包
wget  https://codeload.github.com/yoshinorim/mha4mysql-node/zip/master -O /usr/local/src/mha-node.zip
wget https://codeload.github.com/yoshinorim/mha4mysql-manager/zip/master -O /usr/local/src/mha-manager.zip
#解压
unzip /usr/local/src/mha-manager.zip -d /usr/local/src
unzip /usr/local/src/mha-node.zip -d /usr/local/src
#安装perl及其相关依赖
yum -y install perl perl-ExtUtils-MakeMaker perl-ExtUtils-CBuilder perl-Parallel-ForkManager  perl-Config-Tiny 
yum -y install 'perl(inc::Module::Install)'
yum -y install 'perl(Test::Without::Module)'
yum -y install 'perl(Log::Dispatch)'
#编译管理端
cd /usr/local/src/mha4mysql-manager-master/
perl Makefile.PL #提示时输入y
make &&make install
#编译节点端
cd /usr/local/src/mha4mysql-node-master/
perl Makefile.PL 
make &&make install
#创建配置文件
cp /usr/local/src/mha4mysql-manager-master/samples/conf/* /etc/mha
mv /etc/mha/app1.cnf /etc/mha/mainBusiness.cnf

配置文件内容：

manager_workdir=/data/mha/mainBusiness                 #设置MHA的工作目录
manager_log=/data/mha/mainBusiness/manager.log         #MHA manager的日志输出
remote_workdir=/data/mha/                              #预设MHA node端的工作目录

master_binlog_dir= /data/mysql/3109/log/,/data/mysql/3110/log/  #预设MHA node端的binlog目录                                                                                       
#secondary_check_script= masterha_secondary_check -s 192.168.1.109 -s 192.168.1.110                                                                    
master_ip_failover_script= /etc/mha/master_ip_failover              #故障时VIP更改
master_ip_online_change_script= /etc/mha/master_ip_online_change    #在线切换时VIP更改
report_script= /etc/mha/send_report    #报警脚本
ping_interval=1                                       #设置MHA manager的检测间隔（1秒）

[server1]
hostname=192.168.1.109
port=3109
ssh_user=root
ssh_port=22
candidate_master=1                                    #设置该节点是否可以提升为主，1为是,0否
check_repl_delay=0                                    #发生故障后是否检查本实例主从落后程度,0否,1是

[server2]
hostname=192.168.1.110
port=3110
ssh_user=root
ssh_port=22
candidate_master=1                                    #设置该节点是否可以提升为主，1为是,0否
check_repl_delay=0                                    #发生故障后是否检查本实例主从落后程度,0否,1是

报警脚本修改：

vim /etc/mha/send_report
#!/usr/bin/perl

use strict;
use warnings FATAL => 'all';
use Mail::Sender;
use Getopt::Long;

#new_master_host and new_slave_hosts are set only when recovering master succeeded
my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body );
my $smtp='smtp.mxhichina.com';
my $mail_from='zabbix@xxx.com';
my $mail_user='zabbix@xxx.com';
my $mail_pass='123456';
my $mail_to=['psyduck007@outlook.com'];
GetOptions(
  'orig_master_host=s' => \$dead_master_host,
  'new_master_host=s'  => \$new_master_host,
  'new_slave_hosts=s'  => \$new_slave_hosts,
  'subject=s'          => \$subject,
  'body=s'             => \$body,
);

# Do whatever you want here
mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body);

sub mailToContacts {
    my ( $smtp, $mail_from, $user, $passwd, $mail_to, $subject, $msg ) = @_;
    open my $DEBUG, "> /tmp/monitormail.log"
       or die "Can't open the debug      file:$!\n";
    my $sender = new Mail::Sender {
        ctype       => 'text/plain; charset=utf-8',
        encoding    => 'utf-8',
        smtp        => $smtp,
        from        => $mail_from,
        auth        => 'LOGIN',
        TLS_allowed => '0',
        authid      => $user,
        authpwd     => $passwd,
        to          => $mail_to,
        subject     => $subject,
        debug       => $DEBUG
    };

    $sender->MailMsg(
        {   msg   => $msg,
            debug => $DEBUG
        }
    ) or print $Mail::Sender::Error;
    return 1;
}

增加定时清理relay log的脚本：

cat >/etc/auto_clean_relay_log.sh<> $log_dir/purge_relay_logs.log 2>&1

crontab -e

0 0 */3 * * sh /etc/auto_clean_relay_log.sh

修改故障切换VIP脚本：

vim /etc/mha/master_ip_failover
#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use MHA::DBHelper;
#自定义该组机器的vip
my $vip = "192.168.1.100";
my $if = "eth0";
my (
  $command,        $ssh_user,         $orig_master_host,
  $orig_master_ip, $orig_master_port, $new_master_host,
  $new_master_ip,  $new_master_port,  $new_master_user,
  $new_master_password
);

GetOptions(
  'command=s'             => \$command,
  'ssh_user=s'            => \$ssh_user,
  'orig_master_host=s'    => \$orig_master_host,
  'orig_master_ip=s'      => \$orig_master_ip,
  'orig_master_port=i'    => \$orig_master_port,
  'new_master_host=s'     => \$new_master_host,
  'new_master_ip=s'       => \$new_master_ip,
  'new_master_port=i'     => \$new_master_port,
  'new_master_user=s'     => \$new_master_user,
  'new_master_password=s' => \$new_master_password,
);

sub add_vip {
    my $output1 = `ssh -o ConnectTimeout=15  -o ConnectionAttempts=3 $orig_master_host /sbin/ip addr del $vip/32 dev $if`;
    my $output2 = `ssh -o ConnectTimeout=15  -o ConnectionAttempts=3 $new_master_host /sbin/ip addr add $vip/32 dev $if`;

}
exit &main();

sub main {
  if ( $command eq "stop" || $command eq "stopssh" ) {

    # $orig_master_host, $orig_master_ip, $orig_master_port are passed.
    # If you manage master ip address at global catalog database,
    # invalidate orig_master_ip here.
    my $exit_code = 1;
    eval {

      # updating global catalog, etc
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "start" ) {

    # all arguments are passed.
    # If you manage master ip address at global catalog database,
    # activate new_master_ip here.
    # You can also grant write access (create user, set read_only=0, etc) here.
    my $exit_code = 10;
    eval {
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );

      ## Set read_only=0 on the new master
      $new_master_handler->disable_log_bin_local();
      print "Set read_only=0 on the new master.\n";
      $new_master_handler->disable_read_only();

      ## Creating an app user on the new master
      #print "Creating app user on the new master..\n";
      #FIXME_xxx_create_user( $new_master_handler->{dbh} );
      $new_master_handler->enable_log_bin_local();
      $new_master_handler->disconnect();

      ## Update master ip on the catalog database, etc
      &add_vip();
      $exit_code = 0;
    };
    if ($@) {
      warn $@;

      # If you want to continue failover, exit 10.
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "status" ) {

    # do nothing
    exit 0;
  }
  else {
    &usage();
    exit 1;
  }
}

sub usage {
  print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

vim /etc/mha/master_ip_online_change

use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use MHA::DBHelper;
use MHA::NodeUtil;
use Time::HiRes qw( sleep gettimeofday tv_interval );
use Data::Dumper;

my $_tstart;
my $_running_interval = 0.1;
#添加vip定义
my $vip = "192.168.1.100";
my $if = "eth0";

my (
  $command,              $orig_master_is_new_slave, $orig_master_host,
  $orig_master_ip,       $orig_master_port,         $orig_master_user,
  $orig_master_password, $orig_master_ssh_user,     $new_master_host,
  $new_master_ip,        $new_master_port,          $new_master_user,
  $new_master_password,  $new_master_ssh_user,
);
GetOptions(
  'command=s'                => \$command,
  'orig_master_is_new_slave' => \$orig_master_is_new_slave,
  'orig_master_host=s'       => \$orig_master_host,
  'orig_master_ip=s'         => \$orig_master_ip,
  'orig_master_port=i'       => \$orig_master_port,
  'orig_master_user=s'       => \$orig_master_user,
  'orig_master_password=s'   => \$orig_master_password,
  'orig_master_ssh_user=s'   => \$orig_master_ssh_user,
  'new_master_host=s'        => \$new_master_host,
  'new_master_ip=s'          => \$new_master_ip,
  'new_master_port=i'        => \$new_master_port,
  'new_master_user=s'        => \$new_master_user,
  'new_master_password=s'    => \$new_master_password,
  'new_master_ssh_user=s'    => \$new_master_ssh_user,
);

exit &main();
sub drop_vip {
        my $output = `ssh -o ConnectTimeout=15  -o ConnectionAttempts=3 $orig_master_host /sbin/ip addr del $vip/32 dev $if`;
    #mysql里的连接全部干掉
    #FIXME
}
sub add_vip {
        my $output = `ssh -o ConnectTimeout=15  -o ConnectionAttempts=3 $new_master_host /sbin/ip addr add $vip/32 dev $if`;

}

sub current_time_us {
  my ( $sec, $microsec ) = gettimeofday();
  my $curdate = localtime($sec);
  return $curdate . " " . sprintf( "%06d", $microsec );
}

sub sleep_until {
  my $elapsed = tv_interval($_tstart);
  if ( $_running_interval > $elapsed ) {
    sleep( $_running_interval - $elapsed );
  }
}

sub get_threads_util {
  my $dbh                    = shift;
  my $my_connection_id       = shift;
  my $running_time_threshold = shift;
  my $type                   = shift;
  $running_time_threshold = 0 unless ($running_time_threshold);
  $type                   = 0 unless ($type);
  my @threads;

  my $sth = $dbh->prepare("SHOW PROCESSLIST");
  $sth->execute();

  while ( my $ref = $sth->fetchrow_hashref() ) {
    my $id         = $ref->{Id};
    my $user       = $ref->{User};
    my $host       = $ref->{Host};
    my $command    = $ref->{Command};
    my $state      = $ref->{State};
    my $query_time = $ref->{Time};
    my $info       = $ref->{Info};
    $info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);
    next if ( $my_connection_id == $id );
    next if ( defined($query_time) && $query_time < $running_time_threshold );
    next if ( defined($command)    && $command eq "Binlog Dump" );
    next if ( defined($user)       && $user eq "system user" );
    next
      if ( defined($command)
      && $command eq "Sleep"
      && defined($query_time)
      && $query_time >= 1 );

    if ( $type >= 1 ) {
      next if ( defined($command) && $command eq "Sleep" );
      next if ( defined($command) && $command eq "Connect" );
    }

    if ( $type >= 2 ) {
      next if ( defined($info) && $info =~ m/^select/i );
      next if ( defined($info) && $info =~ m/^show/i );
    }

    push @threads, $ref;
  }
  return @threads;
}

sub main {
  if ( $command eq "stop" ) {
    ## Gracefully killing connections on the current master
    # 1. Set read_only= 1 on the new master
    # 2. DROP USER so that no app user can establish new connections
    # 3. Set read_only= 1 on the current master
    # 4. Kill current queries
    # * Any database access failure will result in script die.
    my $exit_code = 1;
    eval {
      ## Setting read_only=1 on the new master (to avoid accident)
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error(die_on_error)_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );
      print current_time_us() . " Set read_only on the new master.. ";
      $new_master_handler->enable_read_only();
      if ( $new_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }
      $new_master_handler->disconnect();

      # Connecting to the orig master, die if any database error happens
      my $orig_master_handler = new MHA::DBHelper();
      $orig_master_handler->connect( $orig_master_ip, $orig_master_port,
        $orig_master_user, $orig_master_password, 1 );

      ## Drop application user so that nobody can connect. Disabling per-session binlog beforehand
      $orig_master_handler->disable_log_bin_local();
     # print current_time_us() . " Drpping app user on the orig master..\n";
      print current_time_us() . " drop vip $vip..\n";
      #drop_app_user($orig_master_handler);
     &drop_vip();

      ## Waiting for N * 100 milliseconds so that current connections can exit
      my $time_until_read_only = 15;
      $_tstart = [gettimeofday];
      my @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_read_only > 0 && $#threads >= 0 ) {
        if ( $time_until_read_only % 5 == 0 ) {
          printf
"%s Waiting all running %d threads are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_read_only * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_read_only--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }

      ## Setting read_only=1 on the current master so that nobody(except SUPER) can write
      print current_time_us() . " Set read_only=1 on the orig master.. ";
      $orig_master_handler->enable_read_only();
      if ( $orig_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }

      ## Waiting for M * 100 milliseconds so that current update queries can complete
      my $time_until_kill_threads = 5;
      @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {
        if ( $time_until_kill_threads % 5 == 0 ) {
          printf
"%s Waiting all running %d queries are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_kill_threads * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_kill_threads--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }

      ## Terminating all threads
      print current_time_us() . " Killing all application threads..\n";
      $orig_master_handler->kill_threads(@threads) if ( $#threads >= 0 );
      print current_time_us() . " done.\n";
      $orig_master_handler->enable_log_bin_local();
      $orig_master_handler->disconnect();

      ## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "start" ) {
    ## Activating master ip on the new master
    # 1. Create app user with write privileges
    # 2. Moving backup script if needed
    # 3. Register new master's ip to the catalog database

# We don't return error even though activating updatable accounts/ip failed so that we don't interrupt slaves' recovery.
# If exit code is 0 or 10, MHA does not abort
    my $exit_code = 10;
    eval {
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );

      ## Set read_only=0 on the new master
      $new_master_handler->disable_log_bin_local();
      print current_time_us() . " Set read_only=0 on the new master.\n";
      $new_master_handler->disable_read_only();

      ## Creating an app user on the new master
      #print current_time_us() . " Creating app user on the new master..\n";
      print current_time_us() . "Add vip $vip on $if..\n";
     # create_app_user($new_master_handler);
      &add_vip();
      $new_master_handler->enable_log_bin_local();
      $new_master_handler->disconnect();

      ## Update master ip on the catalog database, etc
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "status" ) {

    # do nothing
    exit 0;
  }
  else {
    &usage();
    exit 1;
  }
}

sub usage {
  print
"Usage: master_ip_online_change --command=start|stop|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
  die;
}

至此：安装与配置已经完全结束，开始进入运行环节

运行：

masterha_manager --conf=/etc/mha/mainBusiness.cnf &

检查masterha_manager运行情况：

masterha_check_status --conf=/etc/mha/mainBusiness.cnf

故障转移：

我们将主库关机，进行观测整个故障转移的过程

[g:\~]$ Connecting to 192.168.1.110:22...
Could not connect to '192.168.1.110' (port 22): Connection failed.

Type `help' to learn how to use Xshell prompt.

manager端输出如下：

Sun Feb  4 12:26:46 2018 - [info] Starting ping health check on 192.168.1.110(192.168.1.110:3110)..
Sun Feb  4 12:26:46 2018 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Sun Feb  4 12:30:41 2018 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
Sun Feb  4 12:30:41 2018 - [info] Executing SSH check script: exit 0
Sun Feb  4 12:30:41 2018 - [warning] HealthCheck: SSH to 192.168.1.110 is NOT reachable.
Sun Feb  4 12:30:43 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.110' (4))
Sun Feb  4 12:30:43 2018 - [warning] Connection failed 2 time(s)..
Sun Feb  4 12:30:44 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.110' (4))
Sun Feb  4 12:30:44 2018 - [warning] Connection failed 3 time(s)..
Sun Feb  4 12:30:45 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.110' (4))
Sun Feb  4 12:30:45 2018 - [warning] Connection failed 4 time(s)..
Sun Feb  4 12:30:45 2018 - [warning] Master is not reachable from health checker!
Sun Feb  4 12:30:45 2018 - [warning] Master 192.168.1.110(192.168.1.110:3110) is not reachable!
Sun Feb  4 12:30:45 2018 - [warning] SSH is NOT reachable.
Sun Feb  4 12:30:45 2018 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/mha/mainBusiness.cnf again, and trying to connect to all servers to check server status..
Sun Feb  4 12:30:45 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Feb  4 12:30:45 2018 - [info] Reading application default configuration from /etc/mha/mainBusiness.cnf..
Sun Feb  4 12:30:45 2018 - [info] Reading server configuration from /etc/mha/mainBusiness.cnf..
Sun Feb  4 12:30:46 2018 - [info] GTID failover mode = 1
Sun Feb  4 12:30:46 2018 - [info] Dead Servers:
Sun Feb  4 12:30:46 2018 - [info]   192.168.1.110(192.168.1.110:3110)
Sun Feb  4 12:30:46 2018 - [info] Alive Servers:
Sun Feb  4 12:30:46 2018 - [info]   192.168.1.109(192.168.1.109:3109)
Sun Feb  4 12:30:46 2018 - [info] Alive Slaves:
Sun Feb  4 12:30:46 2018 - [info]   192.168.1.109(192.168.1.109:3109)  Version=5.7.19-log (oldest major version between slaves) log-bin:enabled
Sun Feb  4 12:30:46 2018 - [info]     GTID ON
Sun Feb  4 12:30:46 2018 - [info]     Replicating from 192.168.1.110(192.168.1.110:3110)
Sun Feb  4 12:30:46 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Feb  4 12:30:46 2018 - [info] Checking slave configurations..
Sun Feb  4 12:30:46 2018 - [info] Checking replication filtering settings..
Sun Feb  4 12:30:46 2018 - [info]  Replication filtering check ok.
Sun Feb  4 12:30:46 2018 - [info] Master is down!
Sun Feb  4 12:30:46 2018 - [info] Terminating monitoring script.
Sun Feb  4 12:30:46 2018 - [info] Got exit code 20 (Master dead).
Sun Feb  4 12:30:46 2018 - [info] MHA::MasterFailover version 0.57.
Sun Feb  4 12:30:46 2018 - [info] Starting master failover.
Sun Feb  4 12:30:46 2018 - [info] 
Sun Feb  4 12:30:46 2018 - [info] * Phase 1: Configuration Check Phase..
Sun Feb  4 12:30:46 2018 - [info] 
Sun Feb  4 12:30:47 2018 - [info] GTID failover mode = 1
Sun Feb  4 12:30:47 2018 - [info] Dead Servers:
Sun Feb  4 12:30:47 2018 - [info]   192.168.1.110(192.168.1.110:3110)
Sun Feb  4 12:30:47 2018 - [info] Checking master reachability via MySQL(double check)...
Sun Feb  4 12:30:48 2018 - [info]  ok.
Sun Feb  4 12:30:48 2018 - [info] Alive Servers:
Sun Feb  4 12:30:48 2018 - [info]   192.168.1.109(192.168.1.109:3109)
Sun Feb  4 12:30:48 2018 - [info] Alive Slaves:
Sun Feb  4 12:30:48 2018 - [info]   192.168.1.109(192.168.1.109:3109)  Version=5.7.19-log (oldest major version between slaves) log-bin:enabled
Sun Feb  4 12:30:48 2018 - [info]     GTID ON
Sun Feb  4 12:30:48 2018 - [info]     Replicating from 192.168.1.110(192.168.1.110:3110)
Sun Feb  4 12:30:48 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Feb  4 12:30:48 2018 - [info] Starting GTID based failover.
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info] * Phase 2: Dead Master Shutdown Phase..
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info] Forcing shutdown so that applications never connect to the current master..
Sun Feb  4 12:30:48 2018 - [info] Executing master IP deactivation script:
Sun Feb  4 12:30:48 2018 - [info]   /etc/mha/master_ip_failover --orig_master_host=192.168.1.110 --orig_master_ip=192.168.1.110 --orig_master_port=3110 --command=stop 
Sun Feb  4 12:30:48 2018 - [info]  done.
Sun Feb  4 12:30:48 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sun Feb  4 12:30:48 2018 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info] * Phase 3: Master Recovery Phase..
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info] The latest binary log file/position on all slaves is 3110binlog.000073:234
Sun Feb  4 12:30:48 2018 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sun Feb  4 12:30:48 2018 - [info]   192.168.1.109(192.168.1.109:3109)  Version=5.7.19-log (oldest major version between slaves) log-bin:enabled
Sun Feb  4 12:30:48 2018 - [info]     GTID ON
Sun Feb  4 12:30:48 2018 - [info]     Replicating from 192.168.1.110(192.168.1.110:3110)
Sun Feb  4 12:30:48 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Feb  4 12:30:48 2018 - [info] The oldest binary log file/position on all slaves is 3110binlog.000073:234
Sun Feb  4 12:30:48 2018 - [info] Oldest slaves:
Sun Feb  4 12:30:48 2018 - [info]   192.168.1.109(192.168.1.109:3109)  Version=5.7.19-log (oldest major version between slaves) log-bin:enabled
Sun Feb  4 12:30:48 2018 - [info]     GTID ON
Sun Feb  4 12:30:48 2018 - [info]     Replicating from 192.168.1.110(192.168.1.110:3110)
Sun Feb  4 12:30:48 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info] * Phase 3.3: Determining New Master Phase..
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info] Searching new master from slaves..
Sun Feb  4 12:30:48 2018 - [info]  Candidate masters from the configuration file:
Sun Feb  4 12:30:48 2018 - [info]   192.168.1.109(192.168.1.109:3109)  Version=5.7.19-log (oldest major version between slaves) log-bin:enabled
Sun Feb  4 12:30:48 2018 - [info]     GTID ON
Sun Feb  4 12:30:48 2018 - [info]     Replicating from 192.168.1.110(192.168.1.110:3110)
Sun Feb  4 12:30:48 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Feb  4 12:30:48 2018 - [info]  Non-candidate masters:
Sun Feb  4 12:30:48 2018 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Sun Feb  4 12:30:48 2018 - [info] New master is 192.168.1.109(192.168.1.109:3109)
Sun Feb  4 12:30:48 2018 - [info] Starting master failover..
Sun Feb  4 12:30:48 2018 - [info] 
From:
192.168.1.110(192.168.1.110:3110) (current master)
 +--192.168.1.109(192.168.1.109:3109)

To:
192.168.1.109(192.168.1.109:3109) (new master)
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info] * Phase 3.3: New Master Recovery Phase..
Sun Feb  4 12:30:48 2018 - [info] 
Sun Feb  4 12:30:48 2018 - [info]  Waiting all logs to be applied.. 
Sun Feb  4 12:30:48 2018 - [info]   done.
Sun Feb  4 12:30:48 2018 - [info] Getting new master's binlog name and position..
Sun Feb  4 12:30:48 2018 - [info]  3109binlog.000089:234
Sun Feb  4 12:30:48 2018 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.1.109', MASTER_PORT=3109, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Sun Feb  4 12:30:48 2018 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: 3109binlog.000089, 234, 2285de8a-9bc2-11e7-8a15-000c298e9a6f:1-14,
28ea40ab-9bbd-11e7-8cd1-000c29c31069:1-15308048
Sun Feb  4 12:30:48 2018 - [info] Executing master IP activate script:
Sun Feb  4 12:30:48 2018 - [info]   /etc/mha/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.1.110 --orig_master_ip=192.168.1.110 --orig_master_port=3110 --new_master_host=192.168.1.109 --new_master_ip=192.168.1.109 --new_master_port=3109 --new_master_user='mha'   --new_master_password=xxx
Set read_only=0 on the new master.
ssh: connect to host 192.168.1.110 port 22: Connection timed out
Sun Feb  4 12:31:35 2018 - [info]  OK.
Sun Feb  4 12:31:35 2018 - [info] ** Finished master recovery successfully.
Sun Feb  4 12:31:35 2018 - [info] * Phase 3: Master Recovery Phase completed.
Sun Feb  4 12:31:35 2018 - [info] 
Sun Feb  4 12:31:35 2018 - [info] * Phase 4: Slaves Recovery Phase..
Sun Feb  4 12:31:35 2018 - [info] 
Sun Feb  4 12:31:35 2018 - [info] 
Sun Feb  4 12:31:35 2018 - [info] * Phase 4.1: Starting Slaves in parallel..
Sun Feb  4 12:31:35 2018 - [info] 
Sun Feb  4 12:31:35 2018 - [info] All new slave servers recovered successfully.
Sun Feb  4 12:31:35 2018 - [info] 
Sun Feb  4 12:31:35 2018 - [info] * Phase 5: New master cleanup phase..
Sun Feb  4 12:31:35 2018 - [info] 
Sun Feb  4 12:31:35 2018 - [info] Resetting slave info on the new master..
Sun Feb  4 12:31:35 2018 - [info]  192.168.1.109: Resetting slave info succeeded.
Sun Feb  4 12:31:35 2018 - [info] Master failover to 192.168.1.109(192.168.1.109:3109) completed successfully.
Sun Feb  4 12:31:35 2018 - [info] 

----- Failover Report -----

mainBusiness: MySQL Master failover 192.168.1.110(192.168.1.110:3110) to 192.168.1.109(192.168.1.109:3109) succeeded

Master 192.168.1.110(192.168.1.110:3110) is down!

Check MHA Manager logs at Client-Cent7-IP008:/data/mha/mainBusiness/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 192.168.1.110(192.168.1.110:3110)
Selected 192.168.1.109(192.168.1.109:3109) as a new master.
192.168.1.109(192.168.1.109:3109): OK: Applying all logs succeeded.
192.168.1.109(192.168.1.109:3109): OK: Activated master IP address.
192.168.1.109(192.168.1.109:3109): Resetting slave info succeeded.
Master failover to 192.168.1.109(192.168.1.109:3109) completed successfully.
Sun Feb  4 12:31:35 2018 - [info] Sending mail..
Option new_slave_hosts requires an argument     #双节点，没有多余的从节点，所以会报错，下面也是。
Unknown option: conf

故障切换解析：

1.每秒检查MySQL，连续4次无法连上MySQL服务后，进入SSH检查阶段，SSH也不通后，确认实例故障。由于故障实例为主库，触发切换主库的操作。

2.再次读取配置文件信息，获取所有注册的实例，及其切换偏好。关闭manager节点，启用切换脚本进行切换操作。切换操作的逻辑与之前的【MySQL】【高可用】purge_relay_logs工具的使用文章中分析的相近。

3.切换主库成功后，输出切换报告，同时在/data/mha中生成 mainBusiness.failover.complete文件。接着在新的主库上进行虚拟IP的挂载，发送故障报告邮件。

附：

两篇前言文章：
【MySQL】【高可用】purge_relay_logs工具的使用
【MySQL】【高可用】从masterha_master_switch工具简单分析MHA的切换逻辑

1.千万要注意定时清理relay log。我在搭建完之后，直接进行了压测数据的写入，大量relay log用尽了所有的硬盘空间。这时从库MySQL服务假死，mha所有的脚本也都会因为得不到从库的回应也同样卡住。

2.另一种简单的send_report的脚本：

vim /etc/mha/mha_error.sh
#!/bin/sh
SYSTEM=$1
echo $1 'mha error,please check it immediately!'|mail -s 'mha error' psyduck007@outlook.com

邮件系统安装脚本：

printf "安装mail相关软件和服务"
yum -y install sendmail* > /dev/null 2>&1
yum -y install mailx > /dev/null 2>&1
printf "\033[32;1m%20s\033[0m\n" "[ OK ]"

printf "创建邮件发送配置"
cat > /etc/mail.rc< /dev/null 2>&1
printf "\033[32;1m%20s\033[0m\n" "[ OK ]"

printf "启动邮件服务"
systemctl restart sendmail > /dev/null 2>&1
printf "\033[32;1m%20s\033[0m\n" "[ OK ]"

printf "发送测试邮件"
hostname |mail -s "test" psyduck007@outlook.com
printf "\033[32;1m%20s\033[0m\n" "[ OK ]"

调用方式：

/etc/mha/mah_error.sh mainBusiness

网页名称：【MySQL】【高可用】基于MHA架构的MySQL高可用故障自动切换架构
当前URL：http://bzwzjz.com/article/jcidgg.html

用户体验为先导为品牌带来生命力

【MySQL】【高可用】基于MHA架构的MySQL高可用故障自动切换架构

基于MHA架构的MySQL高可用切换架构

环境：

1.背景：

2.安装：

在node1和node2上：

在manager上

配置文件内容：

报警脚本修改：

增加定时清理relay log的脚本：

修改故障切换VIP脚本：

运行：

故障转移：

故障切换解析：

附：

其他资讯

用户体验为先导 为品牌带来生命力

【MySQL】【高可用】基于MHA架构的MySQL高可用故障自动切换架构

基于MHA架构的MySQL高可用切换架构

环境：

1.背景：

2.安装：

在node1和node2上：

在manager上

配置文件内容：

报警脚本修改：

增加定时清理relay log的脚本：

修改故障切换VIP脚本：

运行：

故障转移：

故障切换解析：

附：

其他资讯

用户体验为先导为品牌带来生命力