今天看啥  ›  专栏  ›  陶老师运维笔记

MySQL MGR 故障切换及节点管理

陶老师运维笔记  · 掘金  ·  · 2019-12-15 15:57
阅读 7

MySQL MGR 故障切换及节点管理

MySQL MGR 故障切换及节点管理

MySQL Group Replication(MGR)框架让MySQL具备了自动主从切换和故障恢复能力。本文测试,单主模式下,master发生了crash后,DB系统恢复情况

1. 测试环境

1.1 环境规划如下

角色 IP port server-id
DB-1 192.110.103.41 3106 103413106
DB-2 192.110.103.42 3106 103423106
DB-3 192.110.103.43 3106 103423106

说明:机器上port 3306已在使用被占了,只好改用3106。

1.2 安装MGR

见上节的MGR安装步骤。MySQL5.7 MGR安装及介绍

2. MGR技术介绍

MGR以组视图(Group View,简称视图)为基础来进行成员管理。视图指Group在一段时间内的成员状态,如果在这段时间内没有成员变化,也就是说没有成员加入或退出,则这段连续的时间为一个视图,如果发生了成员加入或退出变化,则视图也就发生了变化,MGR使用视图ID(View ID)来跟踪视图的变化并区分视图的先后时间。

2.1 单主/多主模式

MySQL的组复制可以配置为单主模型和多主模型两种工作模式。 以下是两种工作模式的特性简介:

  • 单主模型:从复制组中众多个MySQL节点中自动选举一个master节点,只有master节点可以写,其他节点自动设置为read only。当master节点故障时,会自动选举一个新的master节点,选举成功后,它将设置为可写,其他slave将指向这个新的master。
  • 多主模型:复制组中的任何一个节点都可以写,因此没有master和slave的概念,只要突然故障的节点数量不太多,这个多主模型就能继续可用。

MySQL组复制使用Paxos分布式算法来提供节点间的分布式协调。正因如此,它要求组中大多数节点在线才能达到法定票数,从而对一个决策做出一致的决定。

大多数指的是N/2+1(N是组中目前节点总数),例如目前组中有5个节点,则需要3个节点才能达到大多数的要求。所以,允许出现故障的节点数量如下图:

组大小 大多数数量 故障容忍数量
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2

2.2 配置说明

[mysqld]
datadir=/data
socket=/data/mysql.sock

server-id=100                      # 必须
gtid_mode=on                       # 必须
enforce_gtid_consistency=on        # 必须
log-bin=/data/master-bin           # 必须
binlog_format=row                  # 必须
binlog_checksum=none               # 必须
master_info_repository=TABLE       # 必须
relay_log_info_repository=TABLE    # 必须
relay_log=/data/relay-log          # 必须,如果不给,将采用默认值
log_slave_updates=ON               # 必须
sync-binlog=1                      # 建议
log-error=/data/error.log
pid-file=/data/mysqld.pid

transaction_write_set_extraction=XXHASH64         # 必须
loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"  # 必须
loose-group_replication_start_on_boot=off        # 建议设置为OFF
loose-group_replication_member_weigth = 40   # 非必需,mysql 5.7.20才开始支持该选项
loose-group_replication_local_address = "192.110.103.41:31061"  #必须,下一行也必须
loose-group_replication_group_seeds = "192.110.103.41:31061,192.110.103.42:31061,192.110.103.43:31061"
loose-group_replication_bootstrap_group = OFF
loose-group_replication_single_primary_mode = FALSE # = multi-primary
loose-group_replication_enforce_update_everywhere_checks=ON # = multi-primary
复制代码

分析一下上面的配置选项:

  • 1).因为组复制基于GTID,所以必须开启gtid_mode和enforce_gtid_consistency。
  • 2).组复制必须开启二进制日志,且必须设置为行格式的二进制日志,这样才能从日志记录中收集信息且保证数据一致性。所以设置log_bin和binlog_format。
  • 3).由于MySQL对复制事件校验的设计缺陷,组复制不能对他们校验,所以设置binlog_checksum=none。
  • 4).组复制要将master和relay log的元数据写入到mysql.slave_master_info和mysql.slave_relay_log_info中。
  • 5).组中的每个节点都保留了完整的数据副本,它是share-nothing的模式。所以所有节点上都必须开启log_slave_updates,这样新节点随便选哪个作为donor都可以进行异步复制。
  • 6).sync_binlog是为了保证每次事务提交都立刻将binlog刷盘,保证出现故障也不丢失日志。
  • 7).最后的6行是组复制插件的配置。以loose_开头表示即使启动组复制插件,MySQL也继续正常允许下去。这个前缀是可选的。
  • 8).倒数第6行表示写集合以XXHASH64的算法进行hash。所谓写集,是对事务中所修改的行进行的唯一标识,在后续检测并发事务之间是否修改同一行冲突时使用。它基于主键生成,所以使用组复制,表中必须要有主键。
  • 9).倒数第5行表示这个复制组的名称。它必须是一个有效的UUID值。嫌可以直接和上面一样全写字母a。在Linux下,可以使用uuidgen工具来生成UUID值。

3. 组复制管理

操作组复制的语句。

SELECT * FROM performance_schema.replication_group_members;
#查看master
SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb  WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID;
SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='group_replication_primary_member';
>SHOW STATUS LIKE 'group_replication_primary_member';

start group_replication;
stop group_replication;
复制代码

当要停止组中的某个成员中的组复制功能时,需要在那个节点上执行stop group_replication语句。但一定要注意,在执行这个语句之前,必须要保证这个节点不会向外提供MySQL服务,否则有可能会有新数据写入(例如主节点停止时),或者读取到过期数据。 故要安全地重启整个组,最佳方法是先停止所有非主节点的MySQL实例(不仅是停止组复制功能),然后停止主节点的MySQL实例,再先重启主节点,在这个节点上引导组,并启动它的组复制功能。最后再将各slave节点加入组。

组复制中,有两种节点离组的情况:自愿离组、非自愿离组。

自愿离组:

执行stop group_replication;语句。

  • 1).执行该语句表示该节点自愿离组,它会触发视图自动配置,并将该视图更改操作复制到组内所有节点,直到大多数节点都同意新的视图配置,该节点才会离组。
  • 2).节点自愿离组时,不会丢失法定票数。所以无论多少节点自愿离组,都不会出现"达不到大多数"的要求而阻塞组。
  • 3).举个例子,5个节点的组,自愿退出一个节点A后,这个组的大小为4。这个组认为节点A从来都没有出现过。

非自愿离组:

除了上面自愿离组的情况,所有离组的情况都是非自愿离组。比如节点宕机,断网等等。

1).节点非自愿离组时,故障探测机制会检测到这个问题,于是向组中报告这个问题。然后会触发组视图成员自动配置,需要大多数节点同意新视图。

2).非自愿离组时,组的大小不会改变,无论多少个节点的组,节点非自愿退出后,组大小还是5,只不过这些离组的节点被标记为非ONLINE。

3).非自愿离组时,会丢失法定票数。所以,当非自愿离组节点数量过多时,导致组中剩余节点数量达不到大多数的要求,组就会被阻塞。

4).举个例子,5节点的组,非自愿退出1个节点A后,这个组的大小还是5,但是节点A在新的视图中被标记为unreachable或其他状态。当继续非自愿退出2个节点后,组中只剩下2个ONLINE节点,这时达不到大多数的要求,组就会被阻塞。

4. 故障切换测试

4.1 当前状态

当前状态为单主模式,状态如下:

select * from performance_schema.global_variables  where VARIABLE_NAME 
 in ('group_replication_single_primary_mode','group_replication_enforce_update_everywhere_checks');
+----------------------------------------------------+----------------+
| VARIABLE_NAME                                      | VARIABLE_VALUE |
+----------------------------------------------------+----------------+
| group_replication_enforce_update_everywhere_checks | OFF            |
| group_replication_single_primary_mode              | ON             |
+----------------------------------------------------+----------------+
2 rows in set (0.00 sec)
 select * from performance_schema.global_variables  where VARIABLE_NAME like '%read_only';
+-----------------------+----------------+
| VARIABLE_NAME         | VARIABLE_VALUE |
+-----------------------+----------------+
| innodb_read_only      | OFF            |
| read_only             | OFF            |
| super_read_only       | OFF            |
+-----------------------+----------------+
 > SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 |        3106 | ONLINE       |
| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 |        3106 | ONLINE       |
| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 |        3106 | ONLINE       |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
复制代码

当前master: 只有单主模型的组复制才需要查找主节点,多主模型没有master/slave的概念,所以无需查找。

SELECT * FROM performance_schema.global_status WHERE VARIABLE_NAME='group_replication_primary_member';
+----------------------------------+--------------------------------------+
| VARIABLE_NAME                    | VARIABLE_VALUE                       |
+----------------------------------+--------------------------------------+
| group_replication_primary_member | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 |
+----------------------------------+--------------------------------------+
#查看Master
 SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb  WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID;
+----------------------------------+--------------------------------------+---------------+-------------+--------------+
| VARIABLE_NAME                    | VARIABLE_VALUE                       | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+----------------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_primary_member | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 |        3106 | ONLINE       |
+----------------------------------+--------------------------------------+---------------+-------------+--------------+
1 row in set (0.00 sec)
复制代码

4.2 主库故障

1)正常关停master

用kill命令关停master.

#Master 机器192.110.103.41上kill mysql 。
$ps aux|grep 3106
mysql    122456  0.0  0.0 106252  1444 ?        S    Oct23   0:00 /bin/sh /usr/local/mysql-5.7.23/bin/mysqld_safe --defaults-file=/data1/mysql_3106/etc/my.cnf
mysql    123471  0.1  0.8 11575792 1069584 ?    Sl   Oct23   1:45 /usr/local/mysql-5.7.23/bin/mysqld --defaults-file=/data1/mysql_3106/etc/my.cnf --basedir=/usr/local/mysql-5.7.23 --datadir=/data1/mysql_3106/data --plugin-dir=/usr/local/mysql-5.7.23/lib/plugin --log-error=/data1/mysql_3106/logs/mysqld.err --open-files-limit=8192 --pid-file=/data1/mysql_3106/tmp/mysql.pid --socket=/data1/mysql_3106/tmp/mysql.sock --port=3106
$kill 122456 123471 ; tail -f /data1/mysql_3106/logs/mysqld.err 
复制代码

日志:

原DB Master日志

$tail -f /data1/mysql_3106/logs/mysqld.err 
2019-10-24T03:10:32.746843Z 0 [Warning] /usr/local/mysql-5.7.23/bin/mysqld: Forcing close of thread 31  user: 'root'

2019-10-24T03:10:32.746873Z 0 [Note] Plugin group_replication reported: 'Plugin 'group_replication' is stopping.'
2019-10-24T03:10:32.746901Z 0 [Note] Plugin group_replication reported: 'Going to wait for view modification'
2019-10-24T03:10:35.797258Z 0 [Note] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2019-10-24T03:10:40.799923Z 0 [Note] Plugin group_replication reported: 'auto_increment_increment is reset to 1'
2019-10-24T03:10:40.799954Z 0 [Note] Plugin group_replication reported: 'auto_increment_offset is reset to 1'
2019-10-24T03:10:40.800110Z 7 [Note] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed
2019-10-24T03:10:40.800431Z 7 [Note] Slave SQL thread for channel 'group_replication_applier' exiting, replication stopped in log 'FIRST' at position 65
2019-10-24T03:10:40.800652Z 4 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'
2019-10-24T03:10:40.800787Z 0 [Note] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.'
2019-10-24T03:10:40.800799Z 0 [Note] Event Scheduler: Purging the queue. 0 events
2019-10-24T03:10:40.801278Z 0 [Note] Binlog end
2019-10-24T03:10:40.802272Z 0 [Note] Shutting down plugin 'group_replication'
2019-10-24T03:10:40.802322Z 0 [Note] Plugin group_replication reported: 'All Group Replication server observers have been successfully unregistered'
...
2019-10-24T03:10:42.804477Z 0 [Note] Shutting down plugin 'binlog'
2019-10-24T03:10:42.805238Z 0 [Note] /usr/local/mysql-5.7.23/bin/mysqld: Shutdown complete

2019-10-24T03:10:42.814933Z mysqld_safe mysqld from pid file /data1/mysql_3106/tmp/mysql.pid ended
复制代码

新DB Master日志:

$tail -n 30 /data1/mysql_3106/logs/mysqld.err
2019-10-23T11:11:00.671705Z 0 [Note] Plugin group_replication reported: 'XCom protocol version: 3'
2019-10-23T11:11:00.671736Z 0 [Note] Plugin group_replication reported: 'XCom initialized and ready to accept incoming connections on port 31061'
2019-10-23T11:11:05.400823Z 2 [Note] Plugin group_replication reported: 'This server is working as secondary member with primary member address 192.110.103.41:3106.'
2019-10-23T11:11:05.401138Z 20 [Note] Plugin group_replication reported: 'Establishing group recovery connection with a possible donor. Attempt 1/10'
2019-10-23T11:11:05.401143Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.41:3106, 192.110.103.42:3106 on view 15718289704352993:2.'
2019-10-23T11:11:05.402757Z 20 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='192.110.103.41', master_port= 3106, master_log_file='', master_log_pos= 4, master_bind=''.
2019-10-23T11:11:05.404717Z 20 [Note] Plugin group_replication reported: 'Establishing connection to a group replication recovery donor 509810ee-f3d7-11e9-a7d5-a0369fac2de4 at 192.110.103.41 port: 3106.'
2019-10-23T11:11:05.404998Z 22 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2019-10-23T11:11:05.406423Z 22 [Note] Slave I/O thread for channel 'group_replication_recovery': connected to master 'repl@192.110.103.41:3106',replication started in log 'FIRST' at position 4
2019-10-23T11:11:05.442349Z 23 [Note] Slave SQL thread for channel 'group_replication_recovery' initialized, starting replication in log 'FIRST' at position 0, relay log './mysql-relay-bin-group_replication_recovery.000001' position: 4
2019-10-23T11:11:05.461483Z 20 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.'
2019-10-23T11:11:05.461910Z 23 [Note] Slave SQL thread for channel 'group_replication_recovery' exiting, replication stopped in log 'mysql-bin.000002' at position 934
2019-10-23T11:11:05.462119Z 22 [Note] Slave I/O thread killed while reading event for channel 'group_replication_recovery'
2019-10-23T11:11:05.462143Z 22 [Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'mysql-bin.000002', position 934
2019-10-23T11:11:05.523357Z 20 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='192.110.103.41', master_port= 3106, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2019-10-23T11:11:05.526137Z 0 [Note] Plugin group_replication reported: 'This server was declared online within the replication group'
2019-10-23T11:15:33.426684Z 0 [Note] Plugin group_replication reported: 'Members joined the group: 192.110.103.43:3106'
2019-10-23T11:15:33.426832Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.41:3106, 192.110.103.42:3106, 192.110.103.43:3106 on view 15718289704352993:3.'
2019-10-23T11:15:34.094942Z 0 [Note] Plugin group_replication reported: 'The member with address 192.110.103.43:3106 was declared online within the replication group'
2019-10-24T03:10:32.839967Z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: 192.110.103.41:3106'
2019-10-24T03:10:32.839985Z 0 [Note] Plugin group_replication reported: 'Primary server with address 192.110.103.41:3106 left the group. Electing new Primary.'
2019-10-24T03:10:32.840052Z 0 [Note] Plugin group_replication reported: 'A new primary with address 192.110.103.42:3106 was elected, enabling conflict detection until the new primary applies all relay logs.'
2019-10-24T03:10:32.840086Z 41 [Note] Plugin group_replication reported: 'This server is working as primary member.'
2019-10-24T03:10:32.840107Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.42:3106, 192.110.103.43:3106 on view 15718289704352993:4.'
2019-10-24T03:12:01.677869Z 4 [Note] Plugin group_replication reported: 'Primary had applied all relay logs, disabled conflict detection'
复制代码

DB自动切换:

可以看到DB已顺利完成自动切换。

root@192.110.103.42 : (none) > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 |        3106 | ONLINE       |
| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 |        3106 | ONLINE       |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
2 rows in set (0.00 sec)

root@192.110.103.42 : (none) > SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb  WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID;
+----------------------------------+--------------------------------------+---------------+-------------+--------------+
| VARIABLE_NAME                    | VARIABLE_VALUE                       | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+----------------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_primary_member | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 |        3106 | ONLINE       |
+----------------------------------+--------------------------------------+---------------+-------------+--------------+
1 row in set (0.00 sec)
复制代码

原master恢复,重新加入机群

原master恢复,START GROUP_REPLICATION;

> select @@group_replication_bootstrap_group;
+-------------------------------------+
| @@group_replication_bootstrap_group |
+-------------------------------------+
|                                   0 |
+-------------------------------------+
> START GROUP_REPLICATION;
> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 |        3106 | ONLINE       |
| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 |        3106 | ONLINE       |
| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 |        3106 | ONLINE       |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
复制代码

2)异常关停master

异常关停kill -9 master节点,192.110.103.42。从日志中可以看到自动选主,并主变成为192.110.103.41。

#kill -9  master节点,192.110.103.42上执行.
$kill -9 mysql_pid

#192.110.103.41机器上日志
tail -f /data1/mysql_3106/logs/mysqld.err     
2019-10-24T06:17:31.473849Z 0 [Warning] Plugin group_replication reported: 'Member with address 192.110.103.42:3106 has become unreachable.'
2019-10-24T06:17:32.479299Z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: 192.110.103.42:3106'
2019-10-24T06:17:32.479323Z 0 [Note] Plugin group_replication reported: 'Primary server with address 192.110.103.42:3106 left the group. Electing new Primary.'
2019-10-24T06:17:32.479395Z 0 [Note] Plugin group_replication reported: 'A new primary with address 192.110.103.41:3106 was elected, enabling conflict detection until the new primary applies all relay logs.'
2019-10-24T06:17:32.479439Z 37 [Note] Plugin group_replication reported: 'This server is working as primary member.'
2019-10-24T06:17:32.479465Z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.41:3106, 192.110.103.43:3106 on view 15718289704352993:6.'
复制代码

root@192.110.103.41 : (none) >  select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 |        3106 | ONLINE       |
| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 |        3106 | ONLINE       |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
2 rows in set (0.00 sec)

root@192.110.103.41 : (none) > SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb  WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID;
+----------------------------------+--------------------------------------+---------------+-------------+--------------+
| VARIABLE_NAME                    | VARIABLE_VALUE                       | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+----------------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_primary_member | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 |        3106 | ONLINE       |
+----------------------------------+--------------------------------------+---------------+-------------+--------------+
1 row in set (0.00 sec)

复制代码

5. MGR节点管理(增/删节点)

5.1 组中节点删除

组中节点删除较简单,执行stop group_replication语句,即可。

stop group_replication;
select * from performance_schema.replication_group_members;
复制代码

5.2 节点加入

1)原MGR节点重启后,恢复加入 若是以前的MGR某节点关停了,然后服务恢复了。通常只要**start group_replication;**命令如下:。

set global group_replication_enforce_update_everywhere_checks=OFF;
set global group_replication_single_primary_mode=ON;
START GROUP_REPLICATION;
select * from performance_schema.replication_group_members;
复制代码

2)全新节点加入 若是新增一个节点或恢复的Node 故障时间太长,master log已purge了,则不能直接START GROUP_REPLICATION; 恢复。需要自己手动在MGR集群中备份,MGR集群不存在SST和IST概念,而是通过GTID和binlog来实现“同步,追数据”的一个操作。

示例: 192.110.103.42:3106 故障时间太长,MGR中master log已purge了,导致状态一直为RECOVERING或ERROR,无法加入MGR。 说明: MGR宕机节点会询问存活集群,是否能补全binlog?如果能补齐,那么就会正常传输,进行追数据 ;如果宕机节点需要的日志不存在了,则该节点无法正常加入到集群环境中。

root@192.110.103.42 : (none) > show variables like 'group_replication_group_seeds';
+-------------------------------+-------------------------------------------------------------+
| Variable_name                 | Value                                                       |
+-------------------------------+-------------------------------------------------------------+
| group_replication_group_seeds | 192.110.103.41:31061,192.110.103.42:31061,192.110.103.43:31061 |
+-------------------------------+-------------------------------------------------------------+
1 row in set (0.00 sec)

root@192.110.103.42 : (none) > start group_replication;
Query OK, 0 rows affected (3.35 sec)

root@192.110.103.42 : (none) > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 |        3106 | ONLINE       |
| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 |        3106 | RECOVERING   |
| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 |        3106 | ONLINE       |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
3 rows in set (0.00 sec)

root@192.110.103.42 : (none) > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 |        3106 | ERROR        |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
1 row in set (0.00 sec)
复制代码

错误日志:

2019-10-24T08:15:11.015032Z 198 [ERROR] Error reading packet from server for channel 'group_replication_recovery': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but t
he master has purged binary logs containing GTIDs that the slave requires. (server_errno=1236)
2019-10-24T08:15:11.015060Z 198 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CH
ANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236
2019-10-24T08:15:11.015076Z 198 [Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'FIRST', position 4
2019-10-24T08:15:11.015131Z 196 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.'
2019-10-24T08:15:11.015164Z 199 [Note] Error reading relay log event for channel 'group_replication_recovery': slave SQL thread was killed
2019-10-24T08:15:11.015530Z 199 [Note] Slave SQL thread for channel 'group_replication_recovery' exiting, replication stopped in log 'FIRST' at position 0
2019-10-24T08:15:11.017239Z 196 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='192.110.103.41', master_port= 3106, master_log_file='', m
aster_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2019-10-24T08:15:11.019165Z 196 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 2/10'
复制代码

此种情况下,重新制做从库,并处理如下: 1)制作备份

/usr/local/mysql/bin/mysqldump -h 127.0.0.1 -P3106 --all-databases --default-character-set=utf8 -R -q --triggers --master-data=2 --single-transaction > mysql3106_online.sql ;
复制代码

2)关闭同步进程、只读、并清理本地GTID信息。

show variables like 'group_replication_group_seeds';
STOP GROUP_REPLICATION;
set global super_read_only=0;
show master logs;
reset master;
show master logs;
复制代码

3)导入备份数据:

mysql -h 127.0.0.1 -P3106  --default-character-set=utf8 < mysql3106_online.sql
复制代码

4)开启恢复MGR:

set global group_replication_enforce_update_everywhere_checks=OFF;
set global group_replication_single_primary_mode=ON;
START GROUP_REPLICATION;
复制代码

5)结果:

因为是新制作的从库,能正常同步追上数据。故可用START GROUP_REPLICATION;成功加入到MGR中。

>START GROUP_REPLICATION;
> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
| group_replication_applier | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | 192.110.103.41 |        3106 | ONLINE       |
| group_replication_applier | 53e462be-f3d7-11e9-9125-a0369fa6cce4 | 192.110.103.42 |        3106 | ONLINE       |
| group_replication_applier | ee4a9cec-f3d5-11e9-9ded-a0369fa6cd30 | 192.110.103.43 |        3106 | ONLINE       |
+---------------------------+--------------------------------------+---------------+-------------+--------------+
复制代码

参考:




原文地址:访问原文地址
快照地址: 访问文章快照