【我和openGauss的故事】openGauss GAUSS-51400/53600 其它节点状态unknow问题处置
  lYE0sTgD5uUi 2023年11月02日 62 0

【我和openGauss的故事】openGauss GAUSS-51400/53600 其它节点状态unknow问题处置

一、检查状态
[omm@Euler1 ~]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Unavailable
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node  node_ip         port      instance                state
---------------------------------------------------------------------------------
1  Euler1 172.16.220.151  26000      6001 /gauss/data/db1   P Down    Manually stopped
2  Euler2 172.16.220.152  26000      6002 /gauss/data/db1   S Unknown Unknown
3  Euler3 172.16.220.153  26000      6003 /gauss/data/db1   C Unknown Unknown
二、GAUSS-51400
[omm@Euler1 ~]$ gs_om -t start 
Starting cluster.
=========================================
omm@euler2's password: 
[GAUSS-51400] : Failed to execute the command: scp Euler3:/gauss/app_5b3e5810/bin/cluster_dynamic_config /gauss/app_5b3e5810/bin/cluster_dynamic_config_Euler3. Error:
ssh: connect to host euler3 port 22: No route to host

Euler3节点主机有问题,检查发现主机未正常启动,重启主机

三、GAUSS-53600/51400

再次启动,发现报错GAUSS-53600/51400

[omm@Euler1 ~]$ gs_om -t start 
Starting cluster.
=========================================
omm@euler2's password: 
omm@euler3's password: 
[SUCCESS] Euler1
2023-07-11 16:33:56.783 64ad13f4.1 [unknown] 140702557879360 [unknown] 0 dn_6001_6002_6003 01000  0 [BACKEND] WARNING:  could not create any HA TCP/IP sockets
2023-07-11 16:33:56.785 64ad13f4.1 [unknown] 140702557879360 [unknown] 0 dn_6001_6002_6003 01000  0 [BACKEND] WARNING:  Failed to initialize the memory protect for g_instance.attr.attr_storage.cstore_buffers (16 Mbytes) or shared memory (1000 Mbytes) is larger.
=========================================
[GAUSS-53600]: Can not start the database, the cmd is source /home/omm/.bashrc; python3 '/gauss/om/script/local/StartInstance.py' -U omm -R /gauss/app -t 300 --security-mode=off,  Error:
[GAUSS-51400] : Failed to execute the command: source /home/omm/.bashrc; python3 '/gauss/om/script/local/StartInstance.py' -U omm -R /gauss/app -t 300 --security-mode=off. Error:
[FAILURE] Euler2:
.[GAUSS-51400] : Failed to execute the command: source /home/omm/.bashrc; python3 '/gauss/om/script/local/StartInstance.py' -U omm -R /gauss/app -t 300 --security-mode=off. Error:
[FAILURE] Euler3:

脚本执行存在问题,python太不靠谱了,关闭节点排查一下

[omm@Euler1 ~]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Degraded
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node  node_ip         port      instance                state
---------------------------------------------------------------------------------
1  Euler1 172.16.220.151  26000      6001 /gauss/data/db1   P Primary Normal
2  Euler2 172.16.220.152  26000      6002 /gauss/data/db1   S Unknown Unknown
3  Euler3 172.16.220.153  26000      6003 /gauss/data/db1   C Unknown Unknown
[omm@Euler1 ~]$ gs_ctl stop -D /gauss/data/db1
[2023-07-11 16:35:58.075][39021][][gs_ctl]: gs_ctl stopped ,datadir is /gauss/data/db1 
waiting for server to shut down......... done
omm@Euler1 ~]$ python
Python 3.7.4 (default, Mar  3 2022, 14:19:16) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
[omm@Euler1 ~]$ 
[omm@Euler1 ~]$ 
[omm@Euler1 ~]$ which python
/usr/bin/python
[omm@Euler1 ~]$ cd /usr/bin/
[root@Euler1 bin]# ls -lsa python
0 lrwxrwxrwx 1 root root 7 Jul  4 16:33 python -> python3
[root@Euler1 bin]# rm python
rm: remove symbolic link 'python'? y

[root@Euler1 bin]# ln -s python2.7 python

删除软连接,换成python2

[omm@Euler1 ~]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Degraded
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node  node_ip         port      instance                state
---------------------------------------------------------------------------------
1  Euler1 172.16.220.151  26000      6001 /gauss/data/db1   P Primary Normal
2  Euler2 172.16.220.152  26000      6002 /gauss/data/db1   S Unknown Unknown
3  Euler3 172.16.220.153  26000      6003 /gauss/data/db1   C Unknown Unknown
[omm@Euler1 ~]$ gs_om -t stop
Stopping cluster.
=========================================
[GAUSS-53606]: Can not stop the database, the cmd is source /home/omm/.bashrc; python3 '/gauss/om/script/local/StopInstance.py' -U omm -R /gauss/app -t 300 -m fast,  Error:
[GAUSS-51400] : Failed to execute the command: source /home/omm/.bashrc; python3 '/gauss/om/script/local/StopInstance.py' -U omm -R /gauss/app -t 300 -m fast. Error:
[FAILURE] Euler1:
[FAILURE] Euler2:
[FAILURE] Euler3:
..
[omm@Euler1 ~]$ ls -lsa /gauss/om/script/local/StopInstance.py
8 -rwx------ 1 omm dbgrp 4719 Nov 12  2022 /gauss/om/script/local/StopInstance.py
[omm@Euler1 ~]$ chmod 777 /gauss/om/script/local/StopInstance.py
[omm@Euler1 ~]$ gs_om -t stop
Stopping cluster.
=========================================
[GAUSS-53606]: Can not stop the database, the cmd is source /home/omm/.bashrc; python3 '/gauss/om/script/local/StopInstance.py' -U omm -R /gauss/app -t 300 -m fast,  Error:
[GAUSS-51400] : Failed to execute the command: source /home/omm/.bashrc; python3 '/gauss/om/script/local/StopInstance.py' -U omm -R /gauss/app -t 300 -m fast. Error:
[FAILURE] Euler1:
[FAILURE] Euler2:
[FAILURE] Euler3:

再次关闭,依然报错,重新修改权限,依然报错

[omm@Euler1 ~]$ ls -lsa /gauss/om/script/local/StopInstance.py
8 -rwxrwxrwx 1 omm dbgrp 4719 Nov 12  2022 /gauss/om/script/local/StopInstance.py
[omm@Euler1 ~]$ python3 /gauss/om/script/local/StopInstance.py -U omm -R /gauss/app -t 300 -m fast

[omm@Euler1 ~]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Unavailable
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node  node_ip         port      instance                state
---------------------------------------------------------------------------------
1  Euler1 172.16.220.151  26000      6001 /gauss/data/db1   P Down    Manually stopped
2  Euler2 172.16.220.152  26000      6002 /gauss/data/db1   S Unknown Unknown
3  Euler3 172.16.220.153  26000      6003 /gauss/data/db1   C Unknown Unknown

手工执行脚本,可以执行,看来是上面需要将python3换成python2。可以启动,但是需要手工输入其它节点omm用户密码,看来互信失效,同时发现其它节点state状态均为Unknown。

四、解决互信

重新补充互信,这里利用oracle自带sshUserSetup.sh脚本添加互信

五、检查状态
[omm@Euler1 ~]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Unavailable
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node  node_ip         port      instance                state
---------------------------------------------------------------------------------
1  Euler1 172.16.220.151  26000      6001 /gauss/data/db1   P Down    Manually stopped
2  Euler2 172.16.220.152  26000      6002 /gauss/data/db1   S Standby Need repair(Connecting)
3  Euler3 172.16.220.153  26000      6003 /gauss/data/db1   C Standby Need repair(Connecting)

再做恢复,让子弹飞一会

[omm@Euler1 ~]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node  node_ip         port      instance                state
---------------------------------------------------------------------------------
1  Euler1 172.16.220.151  26000      6001 /gauss/data/db1   P Primary Normal
2  Euler2 172.16.220.152  26000      6002 /gauss/data/db1   S Standby Normal
3  Euler3 172.16.220.153  26000      6003 /gauss/data/db1   C Standby Normal

Euler3节点state为C Standby Normal,正常应该是Cascade,级别不对

六、修复级别状态
[omm@Euler3 ~]$ gs_ctl stop -D /gauss/data/db1
[2023-07-11 17:13:56.050][73024][][gs_ctl]: gs_ctl stopped ,datadir is /gauss/data/db1 
waiting for server to shut down.... done
server stopped

登录三节点关闭该节点数据库

[omm@Euler3 ~]$ gs_ctl start -D /gauss/data/db1 -M cascade_standby
[2023-07-11 17:14:27.234][73207][][gs_ctl]: gs_ctl started,datadir is /gauss/data/db1 
[2023-07-11 17:14:27.257][73207][][gs_ctl]: waiting for server to start...
.0 LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

0 LOG:  [Alarm Module]Host Name: Euler3 

0 LOG:  [Alarm Module]Host IP: 172.16.220.153 

0 LOG:  [Alarm Module]Cluster Name: gscluster 

0 LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57

0 WARNING:  failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING:  failed to parse feature control file: gaussdb.version.
0 WARNING:  Failed to load the product control file, so gaussdb cannot distinguish product version.
The core dump path is an invalid directory
2023-07-11 17:14:27.310 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  Recovery parallelism, cpu count = 1, max = 4, actual = 1
2023-07-11 17:14:27.310 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Host Name: Euler3 

2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Host IP: 172.16.220.153 

2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Cluster Name: gscluster 

2023-07-11 17:14:27.314 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57

2023-07-11 17:14:27.316 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  loaded library "security_plugin"
2023-07-11 17:14:27.317 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 01000  0 [BACKEND] WARNING:  could not create any HA TCP/IP sockets
2023-07-11 17:14:27.318 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  InitNuma numaNodeNum: 1 numa_distribute_mode: none inheritThreadPool: 0.
2023-07-11 17:14:27.318 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 01000  0 [BACKEND] WARNING:  Failed to initialize the memory protect for g_instance.attr.attr_storage.cstore_buffers (16 Mbytes) or shared memory (1000 Mbytes) is larger.
2023-07-11 17:14:27.329 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [CACHE] LOG:  set data cache  size(12582912)
2023-07-11 17:14:27.330 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [CACHE] LOG:  set metadata cache  size(4194304)
2023-07-11 17:14:27.415 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [SEGMENT_PAGE] LOG:  Segment-page constants: DF_MAP_SIZE: 8156, DF_MAP_BIT_CNT: 65248, DF_MAP_GROUP_EXTENTS: 4175872, IPBLOCK_SIZE: 8168, EXTENTS_PER_IPBLOCK: 1021, IPBLOCK_GROUP_SIZE: 4090, BMT_HEADER_LEVEL0_TOTAL_PAGES: 8323072, BktMapEntryNumberPerBlock: 2038, BktMapBlockNumber: 25, BktBitMaxMapCnt: 512
2023-07-11 17:14:27.447 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  gaussdb: fsync file "/gauss/data/db1/gaussdb.state.temp" success
2023-07-11 17:14:27.447 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  create gaussdb state file success: db state(STARTING_STATE), server mode(Cascade Standby), connection index(1)
2023-07-11 17:14:27.447 64ad1d73.1 [unknown] 139936913943616 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  max_safe_fds = 974, usable_fds = 1000, already_open = 16
The core dump path is an invalid directory
.
[2023-07-11 17:14:29.268][73207][][gs_ctl]:  done
[2023-07-11 17:14:29.268][73207][][gs_ctl]: server started (/gauss/data/db1)

重建模式

[omm@Euler3 ~]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node  node_ip         port      instance                state
---------------------------------------------------------------------------------
1  Euler1 172.16.220.151  26000      6001 /gauss/data/db1   P Primary Normal
2  Euler2 172.16.220.152  26000      6002 /gauss/data/db1   S Standby Normal
3  Euler3 172.16.220.153  26000      6003 /gauss/data/db1   C Cascade Normal

状态正常

[omm@Euler3 ~]$ gs_om -t refreshconf
Generating dynamic configuration file for all nodes.
Successfully generated dynamic configuration file.

保存状态

[omm@Euler1 ~]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node  node_ip         port      instance                state
---------------------------------------------------------------------------------
1  Euler1 172.16.220.151  26000      6001 /gauss/data/db1   P Primary Normal
2  Euler2 172.16.220.152  26000      6002 /gauss/data/db1   S Standby Normal
3  Euler3 172.16.220.153  26000      6003 /gauss/data/db1   C Cascade Normal
[omm@Euler1 ~]$

登录一节点验证集群状态,一切正常

七、总结

openGauss集群中omm用户的互信很重要,互信出现问题会出现报错现象,openGauss操作对python依赖较为严重,鉴于python不同版本差距较大,向下兼容较差,安装时主机配置python环境。

【版权声明】本文内容来自摩杜云社区用户原创、第三方投稿、转载,内容版权归原作者所有。本网站的目的在于传递更多信息,不拥有版权,亦不承担相应法律责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@moduyun.com

  1. 分享:
最后一次编辑于 2023年11月08日 0

暂无评论

推荐阅读
lYE0sTgD5uUi