Video Screencast Help

VCS does not FAULT LVM2 resources when LUN is removed

Created: 20 Dec 2012 | 1 comment

 

hello:
 
we have a similar problem with VCS.
 
we created a LUN on a Clariion and then added the LUN to each Storage Group of two Hosts. each Host has its own Storage Group so that the Host can see only the LUNs in its Group. so, the Host's Storage Group has its Boot LUN and then we added a LUN that is present in both Hosts' Storage Groups. this way, each Host can see its Boot LUN and the common data LUN.
 
the SAN is connected to the Host by two FC links because there are two FC ports. RHEL Multipathing is used to manage the two paths to the same LUN on each Host.
 
------
      |---------------
 Host |
      |--------------- \                       --------
------                  ......... ------------|\
                               FC switch      |  - LUN    Clariion
------                  ......... ------------|/
      |--------------- /                       --------
 Host |
      |---------------
------               
 
 
once the LUN was visible on the Host, an LVM2 Volume Group (VG) was created on it, and then the VG was used to create an LVM2 Logical Volume (LV). an ext4 file system was created on the LV and this file system was mounted on the Host.
        mount -> file system on -> LV created in -> VG created on -> LUN
 
a VCS Service Group was created with three resources: a Mount, LV resource, and VG resource.
  • Mount resource mounted the LVM2 LVM Logical Volume (LV).
  • LV resource monitored the LVM2 LV.
  • VG resource monitors the LVM2 VG.
  • the dependencies in the VCS SG are: Mount -> LV -> VG.
  • the SG is a failover SG, so that it is online only on 1 Host at a time.
  • everything works fine, as designed.
 
when the SG is online on 1 Host, the ext4 file system is mounted, the LV is online/available, and the VG is imported on that Host.
 
when the SG is offline on a Host, the ext4 fs is not mounted, the LV is offline/unavailable, and the VG is exported.
 
everything works fine until.... until the LUN is deleted on the Clariion and the FC interfaces are not rescanned on the Host.
 
you can also remove the LUN from the Hosts' Storage Groups so that it is no longer accessible to the Hosts.
 
when the LUN is quietly removed, on the online Host:
  • the Mount is still available.
  • small files can still be read.
  • attempts to write to the FS on the LV cause errors.
  • errors are even detected by VCS in the Engine Log.
  • however, the VCS Service Group does not failover to the other Host nor is the SG marked as Faulty.
we can reproduce the problem everytime.
 
what is going on here? shouldn't VCS detect that there is a Fault with the resources and mark the SG as Faulted?
 
thank you in advance.
 
have a wonderful, Xmas holiday,
 
 
 
Aaron
 

Comments 1 CommentJump to latest comment

am-aaron's picture

we have checked this several times, but have not found a solution as yet.

  • the LUN is no longer accessible from the Server.
  • FC rescan was not done or LIP Reset was not issued to get current LUN information.
  • vgdisplay gives errors.
  • lvdisplay gives errors.
  • VCS resources for LVM VG and LV do not give any errors and are still online.
  • in between, the VG resource went to ONLINE|MONITOR TIMEOUT.
  • however, the VCS SG is still ONLINE and does not get FAULTED.
  • none of the information changed after a FC rescan and LIP Reset.

strange.

 

 

NoLunVgLvErrorsVcsSgResNoErrs.jpg