[Linux-ha-jp] スプリットブレイン時のSTONITHエラーについて

アーカイブの一覧に戻る

Masamichi Fukuda - elf-systems masamichi_fukud****@elf-s*****
2015年 3月 17日 (火) 10:31:09 JST


山内さん
cc:松島さん

おはようございます、福田です。
crmの例をありがとうございます。

早速、こちらの環境に合わせてみました。

$ cat test.crm
### Cluster Option ###
property \
    no-quorum-policy="ignore" \
    stonith-enabled="true" \
    startup-fencing="false" \
    stonith-timeout="710s" \
    crmd-transition-delay="2s"

### Resource Default ###
rsc_defaults \
    resource-stickiness="INFINITY" \
    migration-threshold="1"

### Group Configuration ###
group HAvarnish \
    vip_208 \
    varnishd

group grpStonith1 \
    Stonith1-1 \
    Stonith1-2

group grpStonith2 \
    Stonith2-1 \
    Stonith2-2

### Clone Configuration ###
clone clone_ping \
    ping

### Fencing Topology ###
fencing_topology \
    lbv1.beta.com: Stonith1-1 Stonith1-2 \
    lbv2.beta.com: Stonith2-1 Stonith2-2

### Primitive Configuration ###
primitive vip_208 ocf:heartbeat:IPaddr2 \
    params \
        ip="192.168.17.208" \
        nic="eth0" \
        cidr_netmask="24" \
    op start interval="0s" timeout="90s" on-fail="restart" \
    op monitor interval="5s" timeout="60s" on-fail="restart" \
    op stop interval="0s" timeout="100s" on-fail="fence"

primitive varnishd lsb:varnish \
    op start interval="0s" timeout="90s" on-fail="restart" \
    op monitor interval="10s" timeout="60s" on-fail="restart" \
    op stop interval="0s" timeout="100s" on-fail="fence"

primitive ping ocf:pacemaker:ping \
    params \
        name="default_ping_set" \
        host_list="192.168.17.254" \
        multiplier="100" \
        dampen="1" \
    op start interval="0s" timeout="90s" on-fail="restart" \
    op monitor interval="10s" timeout="60s" on-fail="restart" \
    op stop interval="0s" timeout="100s" on-fail="fence"

primitive Stonith1-1 stonith:external/stonith-helper \
    params \
        pcmk_reboot_retries="1" \
        pcmk_reboot_timeout="40s" \
        hostlist="lbv1.beta.com" \
        dead_check_target="192.168.17.132 10.0.17.132" \
        standby_check_command="/usr/local/sbin/crm_resource -r varnishd -W
| grep -q `hostname`" \
        run_online_check="yes" \
    op start interval="0s" timeout="60s" on-fail="restart" \
    op stop interval="0s" timeout="60s" on-fail="ignore"

primitive Stonith1-2 stonith:external/xen0 \
    params \
        pcmk_reboot_timeout="60s" \
        hostlist="lbv1.beta.com:/etc/xen/lbv1.cfg" \
        dom0="xen0.beta.com" \
    op start interval="0s" timeout="60s" on-fail="restart" \
    op monitor interval="3600s" timeout="60s" on-fail="restart" \
    op stop interval="0s" timeout="60s" on-fail="ignore"

primitive Stonith2-1 stonith:external/stonith-helper \
    params \
        pcmk_reboot_retries="1" \
        pcmk_reboot_timeout="40s" \
        hostlist="lbv2.beta.com" \
        dead_check_target="192.168.17.133 10.0.17.133" \
        standby_check_command="/usr/local/sbin/crm_resource -r varnishd -W
| grep -q `hostname`" \
        run_online_check="yes" \
    op start interval="0s" timeout="60s" on-fail="restart" \
    op stop interval="0s" timeout="60s" on-fail="ignore"

primitive Stonith2-2 stonith:external/xen0 \
    params \
        pcmk_reboot_timeout="60s" \
        hostlist="lbv2.beta.com:/etc/xen/lbv2.cfg" \
        dom0="xen0.beta.com" \
    op start interval="0s" timeout="60s" on-fail="restart" \
    op monitor interval="3600s" timeout="60s" on-fail="restart" \
    op stop interval="0s" timeout="60s" on-fail="ignore"

### Resource Location ###
location HA_location-1 HAvarnish \
    rule 200: #uname eq lbv1.beta.com \
    rule 100: #uname eq lbv2.beta.com

location HA_location-2 HAvarnish \
    rule -INFINITY: not_defined default_ping_set or default_ping_set lt 100

location HA_location-3 grpStonith1 \
    rule -INFINITY: #uname eq lbv1.beta.com

location HA_location-4 grpStonith2 \
    rule -INFINITY: #uname eq lbv2.beta.com


これを流しこんだところ、昨日とはメッセージが異なります。
pingのメッセージはなくなっていました。

# crm_mon -rfA
Last updated: Tue Mar 17 10:21:28 2015
Last change: Tue Mar 17 10:21:09 2015
Stack: heartbeat
Current DC: lbv2.beta.com (82ffc36f-1ad8-8686-7db0-35686465c624) - parti
tion with quorum
Version: 1.1.12-561c4cf
2 Nodes configured
8 Resources configured


Online: [ lbv1.beta.com lbv2.beta.com ]

Full list of resources:

 Resource Group: HAvarnish
     vip_208    (ocf::heartbeat:IPaddr2):       Started lbv1.beta.com
     varnishd   (lsb:varnish):  Started lbv1.beta.com
 Resource Group: grpStonith1
     Stonith1-1 (stonith:external/stonith-helper):      Stopped
     Stonith1-2 (stonith:external/xen0):        Stopped
 Resource Group: grpStonith2
     Stonith2-1 (stonith:external/stonith-helper):      Stopped
     Stonith2-2 (stonith:external/xen0):        Stopped
 Clone Set: clone_ping [ping]
     Started: [ lbv1.beta.com lbv2.beta.com ]

Node Attributes:
* Node lbv1.beta.com:
    + default_ping_set                  : 100
* Node lbv2.beta.com:
    + default_ping_set                  : 100

Migration summary:
* Node lbv2.beta.com:
   Stonith1-1: migration-threshold=1 fail-count=1000000 last-failure='Tue
Mar 17
 10:21:17 2015'
* Node lbv1.beta.com:
   Stonith2-1: migration-threshold=1 fail-count=1000000 last-failure='Tue
Mar 17
 10:21:17 2015'

Failed actions:
    Stonith1-1_start_0 on lbv2.beta.com 'unknown error' (1): call=31, st
atus=Error, last-rc-change='Tue Mar 17 10:21:15 2015', queued=0ms,
exec=1082ms
    Stonith2-1_start_0 on lbv1.beta.com 'unknown error' (1): call=31, st
atus=Error, last-rc-change='Tue Mar 17 10:21:16 2015', queued=0ms,
exec=1079ms


/var/log/ha-debugのログです。

IPaddr2(vip_208)[7851]: 2015/03/17_10:21:22 INFO: Adding inet address
192.168.17.208/24 with broadcast address 192.168.17.255 to device eth0
IPaddr2(vip_208)[7851]: 2015/03/17_10:21:22 INFO: Bringing device eth0 up
IPaddr2(vip_208)[7851]: 2015/03/17_10:21:22 INFO:
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
/var/run/resource-agents/send_arp-192.168.17.208 eth0 192.168.17.208 auto
not_used not_used

標準出力や標準エラー出力はありませんでした。

stonith-helperがおかしいのでしょうか。
stonith-helperはシェルスクリプトなのでインストールはあまり気にしていなかったのですが。
stonith-helperはここに配置されています。
/usr/local/heartbeat/lib/stonith/plugins/external/stonith-helper

宜しくお願いします。

以上

2015-03-17 9:45 GMT+09:00 <renay****@ybb*****>:

> 福田さん
>
> おはようございます。山内です。
>
> 念の為、手元にある複数のstonithを利用した場合の例を抜粋してお送りします。
> (実際には、改行に気を付けてください)
>
> 以下の例は、PM1.1系での設定で、
> nodeaは、prmStonith1-1、 prmStonith1-2の順でstonithが実行されます。
> nodebは、prmStonith2-1、 prmStonith2-2の順でstonithが実行されます。
>
> stonith自体は、helperとsshです。
>
>
> (snip)
> ### Group Configuration ###
> group grpStonith1 \
> prmStonith1-1 \
> prmStonith1-2
>
> group grpStonith2 \
> prmStonith2-1 \
> prmStonith2-2
>
> ### Fencing Topology ###
> fencing_topology \
> nodea: prmStonith1-1 prmStonith1-2 \
> nodeb: prmStonith2-1 prmStonith2-2
> (snp)
> primitive prmStonith1-1 stonith:external/stonith-helper \
> params \
>
> pcmk_reboot_retries="1" \
> pcmk_reboot_timeout="40s" \
> hostlist="nodea" \
> dead_check_target="192.168.28.60 192.168.28.70" \
> standby_check_command="/usr/sbin/crm_resource -r prmRES -W | grep -qi
> `hostname`" \
> run_online_check="yes" \
> op start interval="0s" timeout="60s" on-fail="restart" \
> op stop interval="0s" timeout="60s" on-fail="ignore"
>
> primitive prmStonith1-2 stonith:external/ssh \
> params \
> pcmk_reboot_timeout="60s" \
> hostlist="nodea" \
> op start interval="0s" timeout="60s" on-fail="restart" \
> op monitor interval="3600s" timeout="60s" on-fail="restart" \
> op stop interval="0s" timeout="60s" on-fail="ignore"
>
> primitive prmStonith2-1 stonith:external/stonith-helper \
> params \
> pcmk_reboot_retries="1" \
> pcmk_reboot_timeout="40s" \
> hostlist="nodeb" \
> dead_check_target="192.168.28.61 192.168.28.71" \
> standby_check_command="/usr/sbin/crm_resource -r prmRES -W | grep -qi
> `hostname`" \
> run_online_check="yes" \
> op start interval="0s" timeout="60s" on-fail="restart" \
> op stop interval="0s" timeout="60s" on-fail="ignore"
>
> primitive prmStonith2-2 stonith:external/ssh \
> params \
> pcmk_reboot_timeout="60s" \
> hostlist="nodeb" \
> op start interval="0s" timeout="60s" on-fail="restart" \
> op monitor interval="3600s" timeout="60s" on-fail="restart" \
> op stop interval="0s" timeout="60s" on-fail="ignore"
> (snip)
> location rsc_location-grpStonith1-2 grpStonith1 \
> rule -INFINITY: #uname eq nodea
> location rsc_location-grpStonith2-3 grpStonith2 \
> rule -INFINITY: #uname eq nodeb
>
>
> 以上です。
>
>
>
>

-- 
ELF Systems
Masamichi Fukuda
mail to: *masamichi_fukud****@elf-s***** <elfsy****@gmail*****>*
-------------- next part --------------
HTML$B$NE:IU%U%!%$%k$rJ]4I$7$^$7$?(B...
ダウンロード 



Linux-ha-japan メーリングリストの案内
アーカイブの一覧に戻る