【我和openGauss的故事】openGauss GAUSS-51400/53600 其它节点状态unknow问题处置
一、检查状态
[omm@Euler1 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Unavailable
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
---------------------------------------------------------------------------------
1 Euler1 172.16.220.151 26000 6001 /gauss/data/db1 P Down Manually stopped
2 Euler2 172.16.220.152 26000 6002 /gauss/data/db1 S Unknown Unknown
3 Euler3 172.16.220.153 26000 6003 /gauss/data/db1 C Unknown Unknown
二、GAUSS-51400
[omm@Euler1 ~]$ gs_om -t start
Starting cluster.
=========================================
omm@euler2's password:
[GAUSS-51400] : Failed to execute the command: scp Euler3:/gauss/app_5b3e5810/bin/cluster_dynamic_config /gauss/app_5b3e5810/bin/cluster_dynamic_config_Euler3. Error:
ssh: connect to host euler3 port 22: No route to host
Euler3节点主机有问题,检查发现主机未正常启动,重启主机
三、GAUSS-53600/51400
再次启动,发现报错GAUSS-53600/51400
[omm@Euler1 ~]$ gs_om -t start
Starting cluster.
=========================================
omm@euler2's password:
omm@euler3's password:
[SUCCESS] Euler1
2023-07-11 16:33:56.783 64ad13f4.1 [unknown] 140702557879360 [unknown] 0 dn_6001_6002_6003 01000 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
2023-07-11 16:33:56.785 64ad13f4.1 [unknown] 140702557879360 [unknown] 0 dn_6001_6002_6003 01000 0 [BACKEND] WARNING: Failed to initialize the memory protect for g_instance.attr.attr_storage.cstore_buffers (16 Mbytes) or shared memory (1000 Mbytes) is larger.
=========================================
[GAUSS-53600]: Can not start the database, the cmd is source /home/omm/.bashrc; python3 '/gauss/om/script/local/StartInstance.py' -U omm -R /gauss/app -t 300 --security-mode=off, Error:
[GAUSS-51400] : Failed to execute the command: source /home/omm/.bashrc; python3 '/gauss/om/script/local/StartInstance.py' -U omm -R /gauss/app -t 300 --security-mode=off. Error:
[FAILURE] Euler2:
.[GAUSS-51400] : Failed to execute the command: source /home/omm/.bashrc; python3 '/gauss/om/script/local/StartInstance.py' -U omm -R /gauss/app -t 300 --security-mode=off. Error:
[FAILURE] Euler3:
脚本执行存在问题,python太不靠谱了,关闭节点排查一下
[omm@Euler1 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Degraded
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
---------------------------------------------------------------------------------
1 Euler1 172.16.220.151 26000 6001 /gauss/data/db1 P Primary Normal
2 Euler2 172.16.220.152 26000 6002 /gauss/data/db1 S Unknown Unknown
3 Euler3 172.16.220.153 26000 6003 /gauss/data/db1 C Unknown Unknown
[omm@Euler1 ~]$ gs_ctl stop -D /gauss/data/db1
[2023-07-11 16:35:58.075][39021][][gs_ctl]: gs_ctl stopped ,datadir is /gauss/data/db1
waiting for server to shut down......... done
omm@Euler1 ~]$ python
Python 3.7.4 (default, Mar 3 2022, 14:19:16)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
[omm@Euler1 ~]$
[omm@Euler1 ~]$
[omm@Euler1 ~]$ which python
/usr/bin/python
[omm@Euler1 ~]$ cd /usr/bin/
[root@Euler1 bin]# ls -lsa python
0 lrwxrwxrwx 1 root root 7 Jul 4 16:33 python -> python3
[root@Euler1 bin]# rm python
rm: remove symbolic link 'python'? y
[root@Euler1 bin]# ln -s python2.7 python
删除软连接,换成python2
[omm@Euler1 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Degraded
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
---------------------------------------------------------------------------------
1 Euler1 172.16.220.151 26000 6001 /gauss/data/db1 P Primary Normal
2 Euler2 172.16.220.152 26000 6002 /gauss/data/db1 S Unknown Unknown
3 Euler3 172.16.220.153 26000 6003 /gauss/data/db1 C Unknown Unknown
[omm@Euler1 ~]$ gs_om -t stop
Stopping cluster.
=========================================
[GAUSS-53606]: Can not stop the database, the cmd is source /home/omm/.bashrc; python3 '/gauss/om/script/local/StopInstance.py' -U omm -R /gauss/app -t 300 -m fast, Error:
[GAUSS-51400] : Failed to execute the command: source /home/omm/.bashrc; python3 '/gauss/om/script/local/StopInstance.py' -U omm -R /gauss/app -t 300 -m fast. Error:
[FAILURE] Euler1:
[FAILURE] Euler2:
[FAILURE] Euler3:
..
[omm@Euler1 ~]$ ls -lsa /gauss/om/script/local/StopInstance.py
8 -rwx------ 1 omm dbgrp 4719 Nov 12 2022 /gauss/om/script/local/StopInstance.py
[omm@Euler1 ~]$ chmod 777 /gauss/om/script/local/StopInstance.py
[omm@Euler1 ~]$ gs_om -t stop
Stopping cluster.
=========================================
[GAUSS-53606]: Can not stop the database, the cmd is source /home/omm/.bashrc; python3 '/gauss/om/script/local/StopInstance.py' -U omm -R /gauss/app -t 300 -m fast, Error:
[GAUSS-51400] : Failed to execute the command: source /home/omm/.bashrc; python3 '/gauss/om/script/local/StopInstance.py' -U omm -R /gauss/app -t 300 -m fast. Error:
[FAILURE] Euler1:
[FAILURE] Euler2:
[FAILURE] Euler3:
再次关闭,依然报错,重新修改权限,依然报错
[omm@Euler1 ~]$ ls -lsa /gauss/om/script/local/StopInstance.py
8 -rwxrwxrwx 1 omm dbgrp 4719 Nov 12 2022 /gauss/om/script/local/StopInstance.py
[omm@Euler1 ~]$ python3 /gauss/om/script/local/StopInstance.py -U omm -R /gauss/app -t 300 -m fast
[omm@Euler1 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Unavailable
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
---------------------------------------------------------------------------------
1 Euler1 172.16.220.151 26000 6001 /gauss/data/db1 P Down Manually stopped
2 Euler2 172.16.220.152 26000 6002 /gauss/data/db1 S Unknown Unknown
3 Euler3 172.16.220.153 26000 6003 /gauss/data/db1 C Unknown Unknown
手工执行脚本,可以执行,看来是上面需要将python3换成python2。可以启动,但是需要手工输入其它节点omm用户密码,看来互信失效,同时发现其它节点state状态均为Unknown。
四、解决互信
重新补充互信,这里利用oracle自带sshUserSetup.sh脚本添加互信
五、检查状态
[omm@Euler1 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Unavailable
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
---------------------------------------------------------------------------------
1 Euler1 172.16.220.151 26000 6001 /gauss/data/db1 P Down Manually stopped
2 Euler2 172.16.220.152 26000 6002 /gauss/data/db1 S Standby Need repair(Connecting)
3 Euler3 172.16.220.153 26000 6003 /gauss/data/db1 C Standby Need repair(Connecting)
再做恢复,让子弹飞一会
[omm@Euler1 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Normal
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
---------------------------------------------------------------------------------
1 Euler1 172.16.220.151 26000 6001 /gauss/data/db1 P Primary Normal
2 Euler2 172.16.220.152 26000 6002 /gauss/data/db1 S Standby Normal
3 Euler3 172.16.220.153 26000 6003 /gauss/data/db1 C Standby Normal
Euler3节点state为C Standby Normal,正常应该是Cascade,级别不对
六、修复级别状态
[omm@Euler3 ~]$ gs_ctl stop -D /gauss/data/db1
[2023-07-11 17:13:56.050][73024][][gs_ctl]: gs_ctl stopped ,datadir is /gauss/data/db1
waiting for server to shut down.... done
server stopped
登录三节点关闭该节点数据库
[omm@Euler3 ~]$ gs_ctl start -D /gauss/data/db1 -M cascade_standby
[2023-07-11 17:14:27.234][73207][][gs_ctl]: gs_ctl started,datadir is /gauss/data/db1
[2023-07-11 17:14:27.257][73207][][gs_ctl]: waiting for server to start...
.0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
0 LOG: [Alarm Module]Host Name: Euler3
0 LOG: [Alarm Module]Host IP: 172.16.220.153
0 LOG: [Alarm Module]Cluster Name: gscluster
0 LOG: [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57
0 WARNING: failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING: failed to parse feature control file: gaussdb.version.
0 WARNING: Failed to load the product control file, so gaussdb cannot distinguish product version.
The core dump path is an invalid directory
2023-07-11 17:14:27.310 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 DB010 0 [REDO] LOG: Recovery parallelism, cpu count = 1, max = 4, actual = 1
2023-07-11 17:14:27.310 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 DB010 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: [Alarm Module]Host Name: Euler3
2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: [Alarm Module]Host IP: 172.16.220.153
2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: [Alarm Module]Cluster Name: gscluster
2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57
2023-07-11 17:14:27.316 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: loaded library "security_plugin"
2023-07-11 17:14:27.317 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 01000 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
2023-07-11 17:14:27.318 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: InitNuma numaNodeNum: 1 numa_distribute_mode: none inheritThreadPool: 0.
2023-07-11 17:14:27.318 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 01000 0 [BACKEND] WARNING: Failed to initialize the memory protect for g_instance.attr.attr_storage.cstore_buffers (16 Mbytes) or shared memory (1000 Mbytes) is larger.
2023-07-11 17:14:27.329 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [CACHE] LOG: set data cache size(12582912)
2023-07-11 17:14:27.330 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [CACHE] LOG: set metadata cache size(4194304)
2023-07-11 17:14:27.415 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [SEGMENT_PAGE] LOG: Segment-page constants: DF_MAP_SIZE: 8156, DF_MAP_BIT_CNT: 65248, DF_MAP_GROUP_EXTENTS: 4175872, IPBLOCK_SIZE: 8168, EXTENTS_PER_IPBLOCK: 1021, IPBLOCK_GROUP_SIZE: 4090, BMT_HEADER_LEVEL0_TOTAL_PAGES: 8323072, BktMapEntryNumberPerBlock: 2038, BktMapBlockNumber: 25, BktBitMaxMapCnt: 512
2023-07-11 17:14:27.447 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: gaussdb: fsync file "/gauss/data/db1/gaussdb.state.temp" success
2023-07-11 17:14:27.447 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: create gaussdb state file success: db state(STARTING_STATE), server mode(Cascade Standby), connection index(1)
2023-07-11 17:14:27.447 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000 0 [BACKEND] LOG: max_safe_fds = 974, usable_fds = 1000, already_open = 16
The core dump path is an invalid directory
.
[2023-07-11 17:14:29.268][73207][][gs_ctl]: done
[2023-07-11 17:14:29.268][73207][][gs_ctl]: server started (/gauss/data/db1)
重建模式
[omm@Euler3 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Normal
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
---------------------------------------------------------------------------------
1 Euler1 172.16.220.151 26000 6001 /gauss/data/db1 P Primary Normal
2 Euler2 172.16.220.152 26000 6002 /gauss/data/db1 S Standby Normal
3 Euler3 172.16.220.153 26000 6003 /gauss/data/db1 C Cascade Normal
状态正常
[omm@Euler3 ~]$ gs_om -t refreshconf
Generating dynamic configuration file for all nodes.
Successfully generated dynamic configuration file.
保存状态
[omm@Euler1 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Normal
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
---------------------------------------------------------------------------------
1 Euler1 172.16.220.151 26000 6001 /gauss/data/db1 P Primary Normal
2 Euler2 172.16.220.152 26000 6002 /gauss/data/db1 S Standby Normal
3 Euler3 172.16.220.153 26000 6003 /gauss/data/db1 C Cascade Normal
[omm@Euler1 ~]$
登录一节点验证集群状态,一切正常
七、总结
openGauss集群中omm用户的互信很重要,互信出现问题会出现报错现象,openGauss操作对python依赖较为严重,鉴于python不同版本差距较大,向下兼容较差,安装时主机配置python环境。