[Linux-ha-jp] スプリットブレイン時のSTONITHエラーについて


renay****@ybb***** renay****@ybb*****
2015年 3月 16日 (月) 21:48:53 JST



以下に去年のOSC Tokyoでのfencing_topologyのサンプルがあるようです。

 * http://linux-ha.sourceforge.jp/wp/wp-content/uploads/osc2014_crm.txt


fencing_topology \

server01: prmStonith1 \ server02: prmStonith2

1行に対象ノード: 実行するstonithエージェントを記載...[複数可能]
* http://clusterlabs.org/wiki/Fencing_topology

----- Original Message -----
>From: Masamichi Fukuda - elf-systems <masamichi_fukud****@elf-s*****>
>To: "linux****@lists*****" <linux****@lists*****> 
>Date: 2015/3/16, Mon 19:24
>Subject: Re: [Linux-ha-jp] スプリットブレイン時のSTONITHエラーについて
>crm_mon -rfAの表示です。
>Last updated: Mon Mar 16 18:26:37 2015
>Last change: Mon Mar 16 18:04:31 2015
>Stack: heartbeat
>Current DC: lbv2.beta.com (82ffc36f-1ad8-8686-7db0-35686465c624) - parti
>tion with quorum
>Version: 1.1.12-561c4cf
>2 Nodes configured
>10 Resources configured
>Online: [ lbv1.beta.com lbv2.beta.com ]
>Full list of resources:
> Resource Group: HAvarnish
>     vip_208    (ocf::heartbeat:IPaddr2):       Stopped
>     varnishd   (lsb:varnish):  Stopped
> Resource Group: grpStonith1
>     Stonith1-1 (stonith:external/stonith-helper):      Stopped
>     Stonith1-2 (stonith:external/xen0):        Stopped
>     Stonith1-3 (stonith:meatware):     Stopped
> Resource Group: grpStonith2
>     Stonith2-1 (stonith:external/stonith-helper):      Stopped
>     Stonith2-2 (stonith:external/xen0):        Stopped
>     Stonith2-3 (stonith:meatware):     Stopped
> Clone Set: clone_ping [ping]
>     Stopped: [ lbv1.beta.com lbv2.beta.com ]
>Node Attributes:
>* Node lbv1.beta.com:
>* Node lbv2.beta.com:
>Migration summary:
>* Node lbv2.beta.com: 
>   Stonith1-1: migration-threshold=1 fail-count=1000000 last-failure='Mon Mar 16
> 18:23:47 2015'
>   ping: migration-threshold=1 fail-count=1000000 last-failure='Mon Mar 16 18:23
>:47 2015'
>* Node lbv1.beta.com: 
>   Stonith2-1: migration-threshold=1 fail-count=1000000 last-failure='Mon Mar 16
> 18:23:48 2015'
>   ping: migration-threshold=1 fail-count=1000000 last-failure='Mon Mar 16 18:23
>:55 2015'
>Failed actions:
>    Stonith1-1_start_0 on lbv2.beta.com 'unknown error' (1): call=39, st
>atus=Error, last-rc-change='Mon Mar 16 18:23:44 2015', queued=0ms, exec=2014ms
>    ping_start_0 on lbv2.beta.com 'unknown error' (1): call=40, status=c
>omplete, last-rc-change='Mon Mar 16 18:23:45 2015', queued=0ms, exec=995ms
>    Stonith2-1_start_0 on lbv1.beta.com 'unknown error' (1): call=39, st
>atus=Error, last-rc-change='Mon Mar 16 18:23:45 2015', queued=0ms, exec=2009ms
>    ping_start_0 on lbv1.beta.com 'unknown error' (1): call=41, status=c
>omplete, last-rc-change='Mon Mar 16 18:23:54 2015', queued=0ms, exec=182ms
>Mar 16 18:22:47 lbv1.beta.com heartbeat: [1914]: info: Pacemaker support: yes
>Mar 16 18:22:47 lbv1.beta.com heartbeat: [1914]: WARN: File /etc/ha.d//haresources exists.
>Mar 16 18:22:47 lbv1.beta.com heartbeat: [1914]: WARN: This file is not used because pacemaker is enabled
>Mar 16 18:22:47 lbv1.beta.com heartbeat: [1914]: debug: Checking access of: /usr/local/heartbeat/libexec/heartbeat/ccm
>Mar 16 18:22:47 lbv1.beta.com heartbeat: [1914]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/cib
>Mar 16 18:22:47 lbv1.beta.com heartbeat: [1914]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/stonithd
>Mar 16 18:22:47 lbv1.beta.com heartbeat: [1914]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/lrmd
>Mar 16 18:22:47 lbv1.beta.com heartbeat: [1914]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/attrd
>Mar 16 18:22:47 lbv1.beta.com heartbeat: [1914]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/crmd
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1914]: WARN: Core dumps could be lost if multiple dumps occur.
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1914]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1914]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1914]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1914]: info: **************************
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1914]: info: Configuration validated. Starting heartbeat 3.0.6
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1957]: info: heartbeat: version 3.0.6
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1957]: info: Heartbeat generation: 1423534103
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1957]: info: seed is -1702799346
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1957]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1957]: info: glib: ucast: bound send socket to device: eth1
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1957]: info: glib: ucast: set SO_REUSEADDR
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1957]: info: glib: ucast: bound receive socket to device: eth1
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1957]: info: glib: ucast: started on port 694 interface eth1 to
>Mar 16 18:22:48 lbv1.beta.com heartbeat: [1957]: info: Local status now set to: 'up'
>Mar 16 18:22:53 lbv1.beta.com heartbeat: [1957]: info: Link lbv2.beta.com:eth1 up.
>Mar 16 18:22:53 lbv1.beta.com heartbeat: [1957]: info: Status update for node lbv2.beta.com: status up
>Mar 16 18:22:53 lbv1.beta.com heartbeat: [1957]: debug: get_delnodelist: delnodelist= 
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: Comm_now_up(): updating status to active
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: Local status now set to: 'active'
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: Starting child client "/usr/local/heartbeat/libexec/heartbeat/ccm" (109,113)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/cib" (109,113)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/stonithd" (0,0)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/lrmd" (0,0)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/attrd" (109,113)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/crmd" (109,113)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: Status update for node lbv2.beta.com: status active
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [2868]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/stonithd" as uid 0  gid 0 (pid 2868)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [2866]: info: Starting "/usr/local/heartbeat/libexec/heartbeat/ccm" as uid 109  gid 113 (pid 2866)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [2871]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/crmd" as uid 109  gid 113 (pid 2871)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [2869]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/lrmd" as uid 0  gid 0 (pid 2869)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [2867]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/cib" as uid 109  gid 113 (pid 2867)
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [2870]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/attrd" as uid 109  gid 113 (pid 2870)
>Mar 16 18:22:54 lbv1.beta.com ccm: [2866]: info: Hostname: lbv1.beta.com
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: the send queue length from heartbeat to client ccm is set to 1024
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: the send queue length from heartbeat to client attrd is set to 1024
>Mar 16 18:22:54 lbv1.beta.com heartbeat: [1957]: info: the send queue length from heartbeat to client stonithd is set to 1024
>Mar 16 18:22:55 lbv1.beta.com heartbeat: [1957]: info: the send queue length from heartbeat to client cib is set to 1024
>Mar 16 18:22:58 lbv1.beta.com heartbeat: [1957]: WARN: 1 lost packet(s) for [lbv2.beta.com] [33:35]
>Mar 16 18:22:58 lbv1.beta.com heartbeat: [1957]: info: No pkts missing from lbv2.beta.com!
>Mar 16 18:22:59 lbv1.beta.com heartbeat: [1957]: info: the send queue length from heartbeat to client crmd is set to 1024
>Mar 16 18:22:59 lbv1.beta.com heartbeat: [1957]: WARN: 1 lost packet(s) for [lbv2.beta.com] [40:42]
>Mar 16 18:22:59 lbv1.beta.com heartbeat: [1957]: info: No pkts missing from lbv2.beta.com!
>ping(ping)[3164]:    2015/03/16_18:23:54 WARNING: Could not update default_ping_set = 100: rc=127
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: info: Pacemaker support: yes
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: WARN: File /etc/ha.d//haresources exists.
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: WARN: This file is not used because pacemaker is enabled
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: debug: Checking access of: /usr/local/heartbeat/libexec/heartbeat/ccm
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/cib
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/stonithd
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/lrmd
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/attrd
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: debug: Checking access of: /usr/local/heartbeat/libexec/pacemaker/crmd
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: WARN: Core dumps could be lost if multiple dumps occur.
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: info: **************************
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1925]: info: Configuration validated. Starting heartbeat 3.0.6
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1977]: info: heartbeat: version 3.0.6
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1977]: info: Heartbeat generation: 1423534179
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1977]: info: seed is 2086609325
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1977]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1977]: info: glib: ucast: bound send socket to device: eth1
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1977]: info: glib: ucast: set SO_REUSEADDR
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1977]: info: glib: ucast: bound receive socket to device: eth1
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1977]: info: glib: ucast: started on port 694 interface eth1 to
>Mar 16 18:22:47 lbv2.beta.com heartbeat: [1977]: info: Local status now set to: 'up'
>Mar 16 18:22:48 lbv2.beta.com heartbeat: [1977]: info: Link lbv1.beta.com:eth1 up.
>Mar 16 18:22:48 lbv2.beta.com heartbeat: [1977]: info: Status update for node lbv1.beta.com: status up
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [1977]: debug: get_delnodelist: delnodelist= 
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [1977]: info: Comm_now_up(): updating status to active
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [1977]: info: Local status now set to: 'active'
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [1977]: info: Starting child client "/usr/local/heartbeat/libexec/heartbeat/ccm" (109,113)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [1977]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/cib" (109,113)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [1977]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/stonithd" (0,0)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [1977]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/lrmd" (0,0)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [1977]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/attrd" (109,113)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [1977]: info: Starting child client "/usr/local/heartbeat/libexec/pacemaker/crmd" (109,113)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [3026]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/attrd" as uid 109  gid 113 (pid 3026)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [3023]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/cib" as uid 109  gid 113 (pid 3023)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [3025]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/lrmd" as uid 0  gid 0 (pid 3025)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [3024]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/stonithd" as uid 0  gid 0 (pid 3024)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [3022]: info: Starting "/usr/local/heartbeat/libexec/heartbeat/ccm" as uid 109  gid 113 (pid 3022)
>Mar 16 18:22:53 lbv2.beta.com heartbeat: [3027]: info: Starting "/usr/local/heartbeat/libexec/pacemaker/crmd" as uid 109  gid 113 (pid 3027)
>Mar 16 18:22:54 lbv2.beta.com ccm: [3022]: info: Hostname: lbv2.beta.com
>Mar 16 18:22:54 lbv2.beta.com heartbeat: [1977]: info: the send queue length from heartbeat to client ccm is set to 1024
>Mar 16 18:22:54 lbv2.beta.com heartbeat: [1977]: info: the send queue length from heartbeat to client attrd is set to 1024
>Mar 16 18:22:54 lbv2.beta.com heartbeat: [1977]: info: Status update for node lbv1.beta.com: status active
>Mar 16 18:22:54 lbv2.beta.com heartbeat: [1977]: info: the send queue length from heartbeat to client stonithd is set to 1024
>Mar 16 18:22:54 lbv2.beta.com heartbeat: [1977]: info: the send queue length from heartbeat to client cib is set to 1024
>Mar 16 18:22:58 lbv2.beta.com ccm: [3022]: debug: quorum plugin: majority
>Mar 16 18:22:58 lbv2.beta.com ccm: [3022]: debug: cluster:linux-ha, member_count=1, member_quorum_votes=100
>Mar 16 18:22:58 lbv2.beta.com ccm: [3022]: debug: total_node_count=2, total_quorum_votes=200
>Mar 16 18:22:58 lbv2.beta.com ccm: [3022]: debug: quorum plugin: twonodes
>Mar 16 18:22:58 lbv2.beta.com ccm: [3022]: debug: cluster:linux-ha, member_count=1, member_quorum_votes=100
>Mar 16 18:22:58 lbv2.beta.com ccm: [3022]: debug: total_node_count=2, total_quorum_votes=200
>Mar 16 18:22:58 lbv2.beta.com ccm: [3022]: info: Break tie for 2 nodes cluster
>Mar 16 18:22:58 lbv2.beta.com heartbeat: [1977]: WARN: 1 lost packet(s) for [lbv1.beta.com] [30:32]
>Mar 16 18:22:58 lbv2.beta.com heartbeat: [1977]: info: No pkts missing from lbv1.beta.com!
>Mar 16 18:22:58 lbv2.beta.com heartbeat: [1977]: info: the send queue length from heartbeat to client crmd is set to 1024
>Mar 16 18:22:59 lbv2.beta.com heartbeat: [1977]: WARN: 1 lost packet(s) for [lbv1.beta.com] [35:37]
>Mar 16 18:22:59 lbv2.beta.com heartbeat: [1977]: info: No pkts missing from lbv1.beta.com!
>Mar 16 18:22:59 lbv2.beta.com ccm: [3022]: debug: quorum plugin: majority
>Mar 16 18:22:59 lbv2.beta.com ccm: [3022]: debug: cluster:linux-ha, member_count=2, member_quorum_votes=200
>Mar 16 18:22:59 lbv2.beta.com ccm: [3022]: debug: total_node_count=2, total_quorum_votes=200
>ping(ping)[3144]:    2015/03/16_18:23:46 WARNING: Could not update default_ping_set = 100: rc=127
>2015年3月16日 18:53 Takehiro Matsushima <takeh****@gmail*****>:
>>ping RAのstartでunknown errorになっているのも気になりますので、
>>pingやStonith Helperについて、各RAが標準出力・標準エラー出力に吐き出した部分も含めて
>>Takehiro Matsushima
>>Linux-ha-japan mailing list
>ELF Systems
>Masamichi Fukuda
>mail to: masamichi_fukud****@elf-s*****
>Linux-ha-japan mailing list

Linux-ha-japan メーリングリストの案内