Video Screencast Help

Sfcache (SmartIO) volume Kstate DISABLED after SG switch over to alternate node

Created: 09 Jan 2014 • Updated: 15 Jan 2014 | 12 comments
This issue has been solved. See solution.

Hello

We are currently testing sfcache (SmartIO) on a 2 nodes test cluster without CFS and after doing a SG switchover from one node to the other, the cache is no more used and volume appear in State: ENABLED and Kstate: DISABLED.

No way to reactivate it unless destroying the data (cached ) DG.

Any hints ?

Setup : RHEL 6.4 SFHA ENT 6.1 

 

Brem  

Operating Systems:
Discussion Filed Under:

Comments 12 CommentsJump to latest comment

TonyGriffiths's picture

Hi,

Could you post the extracts of the VCS main.cf file, that shows the resources you are failing over (DG etc)

Also could you simmarise what devices you are using for the SMARTIO cache and which nodes they are located

 

thanks

tony

Brem Belguebli's picture

Hello 

The servicegroup is made of a DG with a volume on top of which resides a vxfs FS and a IP address.

The whole servicegroup is switched over the alternate node.

For now we are testing the cache on regular disk storage, each node having a dedicated local cache area.

This is noted as working in the Smartio documentation.

Brem  

 

 

TonyGriffiths's picture

Hi Brem,

So if I understood, you are currently testing the SMARTIO feature using regular SAN disk storage (non-shared).

The SMARTIO cache is intended to be used on fast host based flash devices that will provide a high speed cache. Using disk storage will not provide any real gain and may even be slower.

Is the testing to better understand how SMARTIO works ? or you intend to move this to a live/production state ?

thanks

tony

Brem Belguebli's picture

Hi Tony,

You guessed right. Actually we plan to deploy new clusters with some local Flash PCIe for caching purposes. These machines are not yet deployed, and we wanted to anticipate a bit our integration on a test cluster (deployment of SFHA 6.1, how to configure the cache, observe the behaviour, etc...)

However, it would have made sense if we were using different tiers of storage (high end SAN for the cache and  cheaper/slower  and more capacitive for the data) .

So yes it is for testing purposes currently.

Brem  

 

TonyGriffiths's picture

Thanks Brem, understood.

As for the failover aspect, is it the failover of the data disk group that you are having problems with ?

cheers

tony

Brem Belguebli's picture

No, failover of the SG (including DG, volume and FS are fine) as well as fallback work fine .

The only thing is that the cache area is not used anymore (failover and fallback).

 Rgrds

Brem 

Clifford Barcliff's picture

Hi Tony.

Are you working with Read caching or write-back caching?    You mention that you are NOT using CFS, but I wanted to make sure that you are testing Read Caching pre Chapter 2 of the SmartIO for Solid State Solutions Guide for Linux.

Clifford Barcliff

 

My customers spend the weekend with family, not in the datacenter.

Brem Belguebli's picture

Hello Clifford,

 

Actually it's Brem not Tony, who works with you.

We plan to use it for read caching only, as we need to maintain write consistency across sites.

 

@Tony, I think I could figure out what my problem is.

My data disk is replicated (htc agent), and when it fails over the remote site, the udid_mismatch and clone_disk flags are not cleared (actually we have a preonline trigger script for this but it's not working on this 6.1 new cluster).

Clearing manually the flags and disabling/enabling the cache for the volumes reactivates the cache.

Brem   

 

 

TonyGriffiths's picture

Hi Brem,

Looks like you have handle on the issue. As you mentioned, the data diskgroup can failover in VCS like a traditional disk group. 

The SMARTIO cache device is local to a node and cannot failover/migrate to another node. If you failover the data disk group. the SMARTIO cache device will be left on the original node.

cheers

tony

Brem Belguebli's picture

Hi Tony,

It works now, thanks to the Symantec support (we opened a case in parallel).

We added the new (6.1) flag ClearClone=1 on the DG definition in the main.cf, and the udid_mismatch and clone_disk are automatically cleared, thus making the cache active when failover occurs.

We do not expect to switch over the cache devices, as at the target they will be local to each node (FusionIO PCIe cards).

Regards

Brem

 

 

SOLUTION
RyanJancaitis's picture

Brem,

Glad to hear this got resolved.

For your local Fusion-IO devices, would it be useful to be able to migrate the cache over during a fail-over?

So when your app moved from node a to node b, the cache is pre-warmed.

TonyGriffiths's picture

Hi Brem,

Good to hear that

cheers

tony