Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cs_clone with drbd failed to promote/demote node #463

Open
rotulet opened this issue Mar 27, 2019 · 1 comment
Open

cs_clone with drbd failed to promote/demote node #463

rotulet opened this issue Mar 27, 2019 · 1 comment
Labels
bug Something isn't working

Comments

@rotulet
Copy link

rotulet commented Mar 27, 2019

Affected Puppet, Ruby, OS and module versions/distributions

  • Puppet: 6.3.0
  • Ruby: ruby 2.5.3p105 (2018-10-18 revision 65156)
  • Distribution: debian sid
  • Module version:
    ├─┬ puppet-corosync (v6.0.1)
    │ └── puppetlabs-stdlib (v4.25.1)
    ├─┬ puppet-drbd (v0.5.2)
    │ └─┬ puppetlabs-concat (v5.3.0)
    │ └── puppetlabs-translate (v1.2.0)
  • pcs: 0.10.1.2
  • pacemaker: 2.0.1-1
  • corosync: 3.0.1-2

How to reproduce (e.g Puppet code you use)

  file { '/root/corosync_authkey':
    ensure => file,
    mode   => '0600',
    owner  => 'root',
    source => 'puppet:///modules/fscluster/authkey.pem',
  }
  class { 'corosync':
    package_corosync         => true,
    version_corosync         => '3.0*',
    enable_corosync_service  => true,

    package_pacemaker        => true,
    version_pacemaker        => '2.0*',
    enable_pacemaker_service => true,

    package_pcs              => true,
    version_pcs              => '0.10*',
    package_crmsh            => false,

    bind_address             => $::ipaddress,
    cluster_name             => 'nfs_cluster',
    check_standby            => true,
    test_corosync_config     => true,

    enable_secauth           => true,
    authkey_source           => 'file',
    authkey                  => '/root/corosync_authkey',

    set_votequorum           => true,
    quorum_members           => [ $fscluster::share_server_ips[0], $fscluster::share_server_ips[1] ],
    quorum_members_names     => [ $fscluster::share_server_hosts[0], $fscluster::share_server_hosts[1] ],
    require                  => File['/root/corosync_authkey']
  }

  corosync::service { 'pacemaker':
    version => '2.0*',
  }

  cs_property { 'stonith-enabled' :
    value   => false,
  }

  cs_property { 'no-quorum-policy' :
    value   => 'ignore',
  }

  cs_rsc_defaults { 'resource-stickiness' :
    value => 'INFINITY',
  }

  cs_primitive { 'DrbdVolume':
    primitive_class => 'ocf',
    provided_by     => 'linbit',
    primitive_type  => 'drbd',
    parameters      => { 'drbd_resource' => 'all' },
    metadata        => {
      'master-max'      => '1',
      'master-node-max' => '1',
      'clone-max'       => '2',
      'clone-node-max'  => '1',
      'promotable'      => true,
      'notify'          => true
    },
    operations      => [
      {'monitor' => { 'interval' => '10s', 'role' => 'Slave' }},
      {'monitor' => { 'interval' => '09s', 'role' => 'Master' }},
      {'demote'  => { 'interval' => '0s', 'timeout' => '90s' }},
      {'notify'  => { 'interval' => '0s', 'timeout' => '90s' }},
      {'promote' => { 'interval' => '0s', 'timeout' => '90s' }},
      {'reload'  => { 'interval' => '0s', 'timeout' => '30s' }},
      {'start'   => { 'interval' => '0s', 'timeout' => '240s' }},
      {'stop'    => { 'interval' => '0s', 'timeout' => '100s' }},
    ],
  }

  cs_clone { 'DrbdVolume-clone':
    ensure         => present,
    primitive      => 'DrbdVolume',
    clone_max      => 2,
    clone_node_max => 1,
    notify_clones  => true,
    require        => Cs_primitive['DrbdVolume'],
  }
#
#
#

What are you seeing

pcs status show me my 2 nodes started (but no master nor slave):

Full list of resources:

 Clone Set: DrbdVolume-clone [DrbdVolume]
     Started: [ carto-blade-0 carto-blade-1 ]

drbd seems happy:

r0_nfs role:Primary
  disk:UpToDate
  peer role:Secondary
    replication:Established peer-disk:UpToDate

ButI get theses errors in log:

pacemaker-controld[1020]:  notice: State transition S_IDLE -> S_POLICY_ENGINE 
pacemaker-schedulerd[1019]:  notice: On loss of quorum: Ignore
pacemaker-schedulerd[1019]:  error: Couldn't expand DrbdVolume-clone_promote_0 to DrbdVolume-clone_confirmed-post_notify_promoted_0 in DrbdVolume-clone
pacemaker-schedulerd[1019]:  error: Couldn't expand DrbdVolume-clone_promote_0 to DrbdVolume-clone_confirmed-post_notify_promoted_0 in DrbdVolume-clone
...
...
pacemaker-schedulerd[1019]:  notice: Calculated transition 82, saving inputs in /var/lib/pacemaker/pengine/pe-input-120.bz2
pacemaker-controld[1020]:  warning: Transition 82 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-120.bz2): Terminated
pacemaker-controld[1020]:  warning: Transition failed: terminated

What behaviour did you expect instead

If I do the same in bash (without puppet):

sudo pcs cluster --force setup nfs_cluster carto-blade-0 carto-blade-1
sudo pcs cluster start --all
sudo pcs property set stonith-enabled=false
sudo pcs property set no-quorum-policy=ignore
sudo pcs resource defaults resource-stickiness=100
sudo pcs resource create DrbdVolume ocf:linbit:drbd \
     drbd_resource=all op monitor interval=10s role="Slave" op monitor interval=09s role="Master"
sudo pcs resource promotable DrbdVolume master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

It works as intended:

Full list of resources:

 Clone Set: DrbdVolume-clone [DrbdVolume] (promotable)
     Masters: [ carto-blade-1 ]
     Slaves: [ carto-blade-0 ]

Log output:

pacemaker-controld[1020]:  notice: State transition S_IDLE -> S_POLICY_ENGINE 
pacemaker-schedulerd[1019]:  notice: On loss of quorum: Ignore
pacemaker-schedulerd[1019]:  notice:  * Start      DrbdVolume:0   (                   carto-blade-0 )  
pacemaker-schedulerd[1019]:  notice:  * Start      DrbdVolume:1   (                   carto-blade-1 )  
pacemaker-schedulerd[1019]:  notice: Calculated transition 112, saving inputs in /var/lib/pacemaker/pengine/pe-input-140.bz2
pacemaker-controld[1020]:  notice: Initiating monitor operation DrbdVolume:0_monitor_0 locally on carto-blade-0 
pacemaker-controld[1020]:  notice: Initiating monitor operation DrbdVolume:1_monitor_0 on carto-blade-1 
pacemaker-controld[1020]:  notice: Result of probe operation for DrbdVolume on carto-blade-0: 7 (not running) 
pacemaker-controld[1020]:  notice: Initiating start operation DrbdVolume:0_start_0 locally on carto-blade-0 
pacemaker-controld[1020]:  notice: Initiating start operation DrbdVolume:1_start_0 on carto-blade-1 
kernel: [98215.624649] drbd r0_nfs: Starting worker thread (from drbdsetup-84 [40273])
kernel: [98215.648877] block drbd0: disk( Diskless -> Attaching ) 
kernel: [98215.649123] drbd r0_nfs: Method to ensure write ordering: flush
kernel: [98215.649130] block drbd0: max BIO size = 1048576
kernel: [98215.649140] block drbd0: drbd_bm_resize called with capacity == 16776632
kernel: [98215.649217] block drbd0: resync bitmap: bits=2097079 words=32767 pages=64
kernel: [98215.649223] block drbd0: size = 8192 MB (8388316 KB)
kernel: [98215.650349] block drbd0: recounting of set bits took additional 0 jiffies
kernel: [98215.650355] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: [98215.650365] block drbd0: disk( Attaching -> UpToDate ) 
kernel: [98215.650375] block drbd0: attached to UUIDs E34C19A8901F523B:BA3A295CAB28B4BB:AEB4E0E13837A321:AEB3E0E13837A321
kernel: [98215.666086] drbd r0_nfs: conn( StandAlone -> Unconnected ) 
kernel: [98215.666116] drbd r0_nfs: Starting receiver thread (from drbd_w_r0_n [40276])
kernel: [98215.666157] drbd r0_nfs: receiver (re)started
kernel: [98215.666167] drbd r0_nfs: conn( Unconnected -> WFConnection ) 
pacemaker-controld[1020]:  notice: Transition 112 aborted by status-1-master-DrbdVolume doing create master-DrbdVolume=1000: Transient attribute change 
pacemaker-controld[1020]:  notice: Result of start operation for DrbdVolume on carto-blade-0: 0 (ok) 
pacemaker-controld[1020]:  notice: Initiating notify operation DrbdVolume:0_post_notify_start_0 locally on carto-blade-0 
pacemaker-controld[1020]:  notice: Initiating notify operation DrbdVolume:1_post_notify_start_0 on carto-blade-1 
pacemaker-controld[1020]:  notice: Result of notify operation for DrbdVolume on carto-blade-0: 0 (ok) 
pacemaker-controld[1020]:  notice: Transition 112 (Complete=12, Pending=0, Fired=0, Skipped=2, Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-140.bz2): Stopped
pacemaker-schedulerd[1019]:  notice: On loss of quorum: Ignore
pacemaker-schedulerd[1019]:  notice:  * Promote    DrbdVolume:0   (   Slave -> Master carto-blade-1 )  
pacemaker-schedulerd[1019]:  notice: Calculated transition 113, saving inputs in /var/lib/pacemaker/pengine/pe-input-141.bz2
pacemaker-controld[1020]:  notice: Initiating notify operation DrbdVolume_pre_notify_promote_0 on carto-blade-1 
pacemaker-controld[1020]:  notice: Initiating notify operation DrbdVolume_pre_notify_promote_0 locally on carto-blade-0 
pacemaker-controld[1020]:  notice: Result of notify operation for DrbdVolume on carto-blade-0: 0 (ok) 
pacemaker-controld[1020]:  notice: Initiating promote operation DrbdVolume_promote_0 on carto-blade-1 
pacemaker-controld[1020]:  notice: Initiating notify operation DrbdVolume_post_notify_promote_0 on carto-blade-1 
pacemaker-controld[1020]:  notice: Initiating notify operation DrbdVolume_post_notify_promote_0 locally on carto-blade-0 
pacemaker-controld[1020]:  notice: Result of notify operation for DrbdVolume on carto-blade-0: 0 (ok) 
@rotulet
Copy link
Author

rotulet commented Mar 27, 2019

Does nearly the same with versions:

  • pcs: 0.9.155
  • pacemaker: 1.1.16-1

pcs status gives:

Full list of resources:

 Clone Set: DrbdVolume-clone [DrbdVolume]
     Masters: [ carto-blade-0 ]
     Started: [ carto-blade-1 ]

slave node is not detected as Slave in the 'Clone Set' and if I shutdown the master he does not promote the slave.

Here is the log if I stop the master:

crmd[30804]:   notice: State transition S_IDLE -> S_POLICY_ENGINE 
pengine[30803]:   notice: On loss of CCM Quorum: Ignore
pengine[30803]:    error: Resource start-up disabled since no STONITH resources have been defined
pengine[30803]:    error: Either configure some or disable STONITH with the stonith-enabled option
pengine[30803]:    error: NOTE: Clusters with shared data need STONITH to ensure data integrity
pengine[30803]:   notice: Scheduling Node carto-blade-0 for shutdown
pengine[30803]:    error: Couldn't expand DrbdVolume-clone_promote_0
pengine[30803]:   notice: Calculated transition 4, saving inputs in /var/lib/pacemaker/pengine/pe-input-95.bz2
pengine[30803]:   notice: Configuration ERRORs found during PE processing.  Please run "crm_verify -L" to identify issues.
crmd[30804]:   notice: Transition 4 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-95.bz2): Complete
crmd[30804]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE 
crmd[30804]:   notice: do_shutdown of peer carto-blade-0 is complete 
attrd[30802]:   notice: Node carto-blade-0 state is now lost 
attrd[30802]:   notice: Removing all carto-blade-0 attributes for peer loss
attrd[30802]:   notice: Purged 1 peers with id=1 and/or uname=carto-blade-0 from the membership cache
stonith-ng[30800]:   notice: Node carto-blade-0 state is now lost 
stonith-ng[30800]:   notice: Purged 1 peers with id=1 and/or uname=carto-blade-0 from the membership cache
cib[30799]:   notice: Node carto-blade-0 state is now lost 
cib[30799]:   notice: Purged 1 peers with id=1 and/or uname=carto-blade-0 from the membership cache
192.168.5.59: Stopping Cluster (pacemaker)...
corosync[30783]:   [TOTEM ] A new membership (2:72) was formed. Members left: 1
corosync[30783]:   [CPG   ] downlist left_list: 1 received
corosync[30783]:   [QUORUM] Members[1]: 2
corosync[30783]:   [MAIN  ] Completed service synchronization, ready to provide service.
crmd[30804]:   notice: Node carto-blade-0 state is now lost 
pacemakerd[30792]:   notice: Node carto-blade-0 state is now lost 
crmd[30804]:   notice: do_shutdown of peer carto-blade-0 is complete 
192.168.5.59: Stopping Cluster (corosync)...
corosync[30783]:   [KNET  ] link: host: 1 link: 0 is down
corosync[30783]:   [KNET  ] host: host: 1 has no active links

@juniorsysadmin juniorsysadmin added the bug Something isn't working label Mar 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants