mysql主从同步异常修复-摩杜云开发者社区

说明

mysql 集群部署在 k8s 上，架构是“一主两从”，机房突然断电，导致 mysql 集群启动之后，发生主从同步异常。

主库上查看binlog信息：

mysql> show master status \G;
*************************** 1. row ***************************
             File: mysql-0-bin.000365
         Position: 499726305
     Binlog_Do_DB: 
 Binlog_Ignore_DB: 
Executed_Gtid_Set: 
1 row in set (0.00 sec)

ERROR: 
No query specified

从库上重新执行主从同步：

mysql> stop slave;
mysql> reset slave;
mysql> change master to master_host='mysql-0.mysql.middle.svc.cluster.local',master_user='root',master_password='f11K40ziyu8Y',master_log_file='mysql-0-bin.000365',master_log_pos=499726305,master_port=3306;
mysql> start slave;
mysql> show slave status \G;

查看主从同步状态，发现同步异常，遇到的报错如下文。

主从同步异常

MySQL 连接出错 mysqladmin flush-hosts

详细信息如下：

Host is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.

同一个ip在短时间内产生太多(超过mysql数据库max_connection_errors的最大值)中断的数据库连接而导致的阻塞，抛出mysqladmin flush-host。

这里提供三种方法，我使用的是第二种方法。

方法一：提高允许的max_connect_errors数量

临时处理方法，治标不治本。

临时修改，可使用命令行形式：

## 登入数据库
mysql -h mysql_ip -P mysql_port -uroot -pmysql_passwd
## 查看max_connection_errors
mysql> show variables like '%max_connect_errors%';
## 修改max_connection_errors的数量为1000
mysql> set global max_connect_errors=1000;
## 修改 max_connections 的数量为1000 
mysql> set global max_connections = 1000;

命令行修改，只是临时修改，重启MySQL后会失效；如果需要永久生效，需要在my.cnf配置文件中修改。

## 登入数据库
mysql -h mysql_ip -P mysql_port -uroot -pmysql_passwd
## 查看max_connect_errors
mysql> show variables like '%max_connect_errors%';
mysql-connection-error-mysqladmin-flush-hosts-solutions-00

max_connect_errors 默认是10 或 100，修改方法如下：

vim /etc/my.cnf
## 添加一行
max_connect_errors=1000
## 重启MySQL，修改才会生效！
systemctl restart mysqld
mysql-connection-error-mysqladmin-flush-hosts-solutions-01

方法二：命令清理hosts文件

## 直接命令修改flush-hosts
mysqladmin flush-hosts -h mysql_ip -P mysql_port -uroot -pmysql_passwd

说明：配置有master/slave主从数据库的最好把主库和从库都修改一遍~

上面是直接命令行修改，我们也可以登入进 mysql 中修改：

mysql -h mysql_ip -P mysql_port -uroot -pmysql_passwd
mysql> flush hosts;

方法三：重启 mysql

如果是生产环境，不建议重启 mysql 操作。

systemctl restart mysqld

上面提到的方法都是治标不治本的，造成上述故障原因有很多，具体还是要根据自己的实际业务场景进行排查，我这里简单说两点：

*（1）由于程序中创建了过多的mysql连接，通常情况下，程序开始的运行的时候建立与数据库的连接，运行期间进行数据库的一些增删改查操作，程序关闭的时候，断开与数据库的连接。

*（2）部署主从同步这种架构，当集群因某种原因重启后，主从节点同步失败，从节点也会短时间内产生过多的连接，造成阻塞。

从库上记录删除失败,Error_code: 1032

在备库执行同步命令后，使用 show slave status \G;显示下面错误：

Last_Error: Could not execute Update_rows event on table master.connection_alert_contents; Can't find record in 'connection_alert_contents', Error_code: 1032;handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000067, end_log_pos 1275813

1032错误----现在生产库中好多数据，在从库误删或从库无这条数据，生产库更新后找不到了，现在主从不同步了，再跳过错误也没用，因为没这条，再更新还会报错。

这里提供四种解决方法（我使用的是方法三），不过有的方法需要进一步验证。特别提醒：数据操作有风险，操作前记得先备份数据。

方法一：忽略1032 ERROR报错

貌似作用不大

mysql> set global sql_slave_skip_counter=1;  
mysql> stop slave; 
mysql> start slave;

也可以在 my.cnf中配置 slave-skip-errors=1032从而跳过日志中1032 ERROR报错，需要重启mysql服务。

方法二：从节点补丢失的数据

如果是主从同步或读写分离架构，从节点可能没有写权限，需要先设置从节点可写。报错信息如下：

1209 - The MySQL server is running with the–read-only option so it cannot execute this statement

查询从库上read_only参数的值，1为只读，0为读写。

mysql> select @@read_only;
mysql> set global read_only=0;

具体方法可参考mysql主从复制中的1032错误。

方法三：备份主库导入从库

这种方法会造成从库数据有丢失

详见mysqldump导入导出，然后再重新开启主从同步。

方法四：重新建从节点

重新部署新的从节点，再开启主从同步。

参考文档

mysql主从同步异常修复.md