一、自我介绍
我目前是一名IT运维工程师,主要做云服务运维,刚刚在51CTO上进行学习并考取了CKA证书。目前在学习go的编程开发。在这里,我希望通过这个平台来进行多方涉猎和广泛的学习,提高自己的专业技能,增强自己的能力。
二、技术分享
1.排查集群及各节点状态
1.1查看集群状态
[root@ops-k8smaster01 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/node-k8s-master-01.pem --key=/etc/ssl/etcd/ssl/node-k8s-master-01-key.pem --endpoints="https://10.100.11.47:2379,https://10.100.11.48:2379,https://10.100.11.49:2379" endpoint health -w table +---------------------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +---------------------------+--------+------------+-------+ | https://10.100.11.47:2379 | true | 8.380239ms | | | https://10.100.11.49:2379 | true | 8.693631ms | | | https://10.100.11.48:2379 | true | 8.568265ms | | +---------------------------+--------+------------+-------+ |
集群节点健康状态均为OK
1.2查看各节点状态
[root@ops-k8smaster01 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/node-k8s-master-01.pem --key=/etc/ssl/etcd/ssl/node-k8s-master-01-key.pem --endpoints="https://10.100.11.47:2379,https://10.100.11.48:2379,https://10.100.11.49:2379" endpoint status -w table +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.100.11.47:2379 | 610822a36ea59feb | 3.4.13 | 2.0 GB | true | false | 108 | 191742060 | 191742060 | | | https://10.100.11.48:2379 | 69b52ce831bc87db | 3.4.13 | 2.0 GB | false | false | 108 | 191742060 | 191742060 | | | https://10.100.11.49:2379 | ceff1f8b9b557645 | 3.4.13 | 42 MB | false | false | 108 | 191742060 | 191742060 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ |
从以上结果可以看出10.100.11.49,DB SIZE异常,数据存在不一致。
2.处理思路
1.备份正常节点的etcd数据和对应的数据目录
2.停止异常数据etcd
3.正常etcd节点,删除异常member
4.清除member/ wal/目录下的数据
5异常节点重新加入集群
6.启动etcd服务
3.处理问题节点的etcd服务恢复数据
3.1确定leader
由步骤1.2可知,10.100.11.47为leader,如果需要迁移leader,执行如下命令
# ETCDCTL_API=3 etcdctl --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/node-k8s-master-01.pem --key=etc/ssl/etcd/ssl/node-k8s-master-01-key.pem --endpoints="https://10.100.11.47:2379" move-leader 610822a36ea59feb |
3.2修补不一致数据的节点10.100.11.49
备份10.100.11.47数据节点的etcd数据
# mkdir -p /data/etcd_backup_dir# ETCDCTL_API=3 etcdctl --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/node-k8s-master-01.pem --key=/etc/ssl/etcd/ssl/node-k8s-master-01-key.pem --endpoints="https://10.100.11.47:2379" snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db# ll /data/etcd_backup_dir/ total 1953088 -rw------- 1 root root 1999958048 Aug 9 16:45 etcd-snapshot-20230809.db |
最好把/var/lib/etcd目录也备份下
#cp -R etcd etcd-20230809bak 或者压缩备份 # tar -czvf etcd-20220907bak.taz.gz etcd |
停止10.100.11.49节点上的etcd服务
# systemctl status etcd# systemctl stop etcd |
etcd集群中删除节点
# ETCDCTL_API=3 etcdctl --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/node-k8s-master-01.pem --key=/etc/ssl/etcd/ssl/node-k8s-master-01-key.pem --endpoints="https://10.100.11.47:2379" member remove ceff1f8b9b557645 Member ceff1f8b9b557645 removed from cluster 386324aaf4745646 |
删除掉异常etcd节点的数据
三、立一个flag!
在2024年考取软考高级认证,加油!