Using TIBCO Enterprise Message Service with Storage Foundation Cluster File System: Increasing Availability and Performance
1. Overview
TIBCO Enterprise Message Service (EMS) is a leader in enterprise messaging platforms. EMS allows for efficient system-to-system communications across different technologies. The communication may occur synchronously or asynchronously. Although the use cases vary quite a lot, there is heavy usage in billing, financial analysis and alerting, and ERP-type workloads. Common applications include trading software, shipping management, and internal supply-chain systems.
The TIBCO customer base has a strong overlap with Symantec. Many EMS customers also use Veritas Storage Foundation products. EMS has tremendous penetration into the financial sector, airlines, telecommunications, technology, and shipping. Many of the published EMS customers are currently customers of Veritas availability products.
While NFS can be used in conjunction with TIBCO’s native fault-tolerance mechanisms, this is configuration is discouraged. Customers have discovered that this solution is sub-optimal for meeting needs of application availability, throughput, and administration. Since then, we have seen strong engagement from the field and corporate teams at TIBCO to utilize Veritas Storage Foundation Cluster File System (CFS) as the embedded solution for TIBCO.
Symantec offers a proven end-to-end, integrated solution for ensuring high availability for a wide variety of environments. By leveraging the industry leading Veritas Storage Foundation Cluster File System, Veritas Cluster Server and the fault tolerant support in TIBCO Enterprise Message Service the products can be combined to provide a high performance, highly available messaging solution.
Service level requirements for a typical TIBCO EMS environment are variable, but where performance and reliability are critical it is not uncommon to see requirements for 50,000 messages per second and 99.999% uptime. As customers have come to rely more and more on messaging services, the business drivers have come to expect these listed SLAs or better. Customers seeking to obtain the highest level of availability and throughput can leverage the benefits of Symantec Storage Foundation and the Veritas Cluster File System. The combined solution can eliminate the potential for data loss and guarantee maintainence of data after messages are transmitted.
To provide the higest level of reliability and performance, shared storage must meet the criteria for fault tolerance required by TIBCO EMS (Write order fidelity, synchronous write persistence and distributed locking etc).
When NAS hardware uses NFS as its file system, it is particularly difficult to determine whether the solution meets the criteria required for fault tolerant shared storage. Our research indicates the following conclusions:
- NFS v2 definitely does not satisfy the criteria.
- NFS v3 with UDP definitely does not satisfy the criteria.
- NFS v3 with TCP might satisfy the criteria. Consult with the NAS vendor to verify that the NFS server (in the NAS) satisfies the criteria. Consult with the operating system vendor to verify that the NFS client (in the OS on the server host computer) satisfies the criteria. When both vendors certify that their components cooperate to guarantee the criteria, then the shared storage solution supports EMS.
In contrast, Symantec’s Storage Foundation Cluster File System and Veritas Cluster Server provide a shared storage solution with the following benefits:
- Full compliance with the TIBCO shared storage criteria for fault tolerance
- Persistent shared storage for:
- Message data (for queues and topics)
- Client connections to the primary server
- Metadata about message delivery
- Reduced storage costs due to over provisioning
- Simplified administration by providing a single namespace across all nodes
- Proper lock management and data integrity ensured in the event of a node goes down within the cluster
- Ability to restore redundancy in the event of node failure due to a hardware or software fault
- Monitoring of node health and proactive failure detection
2. Fault Tolerant TIBCO EMS
This section reviews the TIBCO EMS default model forfault tolerance and it relies on the TIBCO EMS User Guide, version 4.4 and the TIBCO EMS Installation Guide, version 4.4. from November 2006 and many sections of the guide are directly cited in the section below.
TIBCO Enterprise Message Service servers may be configured for fault-tolerant operation by configuring a pair of servers—one primary and one backup. The primary server accepts client connections, and interacts with clients to deliver messages. If the primary server fails, the backup server resumes operation in its place.
Note: TIBCO can load balance two servers in a fault-tolerant configuration for additional reliability. Multiple primary/secondary pairs can be utilized to increase the overall reslience. With the Storage Foundation Cluster File System and Veritas Cluster Server it is possible to deploy additional failover servers into a cluster to further improve availability.
Shared State
A pair of fault-tolerant servers must have access to shared state, which consists of information about client connections and persistent messages. This information enables the backup server to properly assume responsibility for those connections and messages.
Locking
To prevent the backup server from assuming the role of the primary server, the primary server locks the shared state during normal operation. If the primary server fails, the lock is released, and the backup server can obtain the lock.
Note: With Veritas Storage Foundation Cluster File System from Symantec, this is handled through the distributed lock manager that prevents bottlenecking and will allow for concurrency and more rapid failover without requiring a hand-off.
Configuration Files
When a primary server fails, its backup server assumes the status of the primary server and resumes operation. Before becoming the new primary server, the backup server re-reads all of its configuration files. If the two servers share configuration files, then administrative changes to the old primary carry over to the new primary.
Failover
This section presents details of the failover sequence.
Detection
A backup server detects a failure of the primary in either of two ways:
- Heartbeat Failure—The primary server sends heartbeat messages to the backup server to indicate that it is still operating. When a network failure stops the servers from communicating with each other, the backup server detects the interruption in the steady stream of heartbeats. For details, see Heartbeat Parameters.
- • Connection Failure—The backup server can detect the failure of its TCP connection with the primary server. When the primary process terminates unexpectedly, the backup server detects the broken connection.
Response
When a backup server (B) detects the failure of the primary server (A), then B attempts to assume the role of primary server. First, B obtains the lock on the current shared state. When B can access this information, it becomes the new primary server.
Figure 2 Failed primary server
Role Reversal
When B becomes the new primary server, A can restart as a backup server, so that the two servers exchange roles.
Figure 3 Recovered server becomes backup
Client Transfer
Clients of A that are configured to failover to backup server B automatically transfer to B when it becomes the new primary server. B reads the client’s current state from the shared storage to deliver any persistent messages to the client.
Shared State
The primary server and backup server must share the same state. Server state includes three categories of information:
- persistent message data (for queues and topics)
- client connections of the primary server
- metadata about message delivery
During a failover, the backup server re-reads all shared state information.
Implementing Shared State
We recommend that you implement shared state using shared storage devices. The shared state must be accessible to both the primary and backup servers.
TIBCO’s Support Criteria
Several options are available for implementing shared storage using a combination of hardware and software. EMS requires that your storage solution guarantees all four criteria in:
TIBCO’s Hardware Options
Consider these examples of commonly-sold hardware options for shared storage:
- Dual-Port SCSI device
- Storage Area Network (SAN)
- Network Attached Storage (NAS)
SCSI and SAN
Dual-port SCSI and SAN solutions generally satisfy the Write Order and Synchronous Write Persistence criteria. (The clustering software must satisfy the remaining two criteria.) As always, you must confirm all four requirements with your vendors.
NAS
NAS solutions require a CS (rather than a CFS) to satisfy the Distributed File Locking criterion (see below).
Some NAS solutions satisfy the criteria, and some do not; you must confirm all four requirements with your vendors.
NAS with NFS
When NAS hardware uses NFS as its file system, it is particularly difficult to determine whether the solution meets the criteria. Our research indicates the following conclusions:
- NFS v2 definitely does not satisfy the criteria.
- NFS v3 with UDP definitely does not satisfy the criteria.
- NFS v3 with TCP might satisfy the criteria. Consult with the NAS vendor to verify that the NFSserver (in the NAS) satisfies the criteria. Consult with the operating system vendor to verify thatthe NFS client (in the OS on the server host computer) satisfies the criteria. When both vendorscertify that their components cooperate to guarantee the criteria, then the shared storagesolution supports EMS.
For more information on how the EMS locks shared store files, see How EMS Manages Access to Shared Store Files.
Software Options
Consider these examples of commonly-sold software options:
- Cluster Server (CS): A cluster server monitors the EMS server processes and their host computers, and ensures that exactly one server process is running at all times. If the primary server fails, the CS restarts it; if it fails to restart the primary, it starts the backup server instead.
- Clustered File System (CFS): A clustered file system lets the two EMS server processes run simultaneously. It even lets both servers mount the shared file system simultaneously. However, the CFS assigns the lock to only one server process at a time. The CFS also manages operating system caching of file data, so the backup server has an up-to-date view of the file system (instead of a stale cache).
With dual-port SCSI or SAN hardware, either a CS or a CFS might satisfy the Distributed File Locking criterion. With NAS hardware, only a CS can satisfy this criterion (CFS software generally does not). Of course, you must confirm all four requirements with your vendors. Note: Influencing this decision will be critical to determining whether CFS can win or whether a Celerra/NetApp solution will be the vehicle.
Messages Stored in Shared State
Messages with PERSISTENT delivery mode are stored, and are available in the event of primary server failure. Messages with NON_PERSISTENT delivery mode are not available if the primary server fails. For more information about recovery of messages during failover, see Message Redelivery.
Storage Files
The tibemsd server creates three files to store shared state.
3. Using CFS and VCS to create a fault tolerant TIBCO setup
The key to minimal queue downtime in a TIBCO EMS environment is to have the data store available on the standby node as soon as possible. This allows the standby node to recover the
messages more quickly. This can be done either through network or shared storage, but network attached storage options such as NFS are not always an option since they do not conform to the data integrity demands of TIBCO EMS.
- NFSv2, NFSv3 (udp) is not an option
- NFSv3 with TCP might be an option, but not guaranteed.
- CFS fully complies to the data integrity demands of TIBCO EMS.
Using a clustered file system to share the data store between the EMS servers the standby server allows for a take over/recovery operation as soon as it is detected that the primary server is
offline.
The recovery process will start as soon as the standby server detects that the primary is not responding to heartbeats (and it can acquire the file locks). The heartbeat interval is configurable, but by default TIBCO sends a heartbeat every third second and the standby will initiate the recovery process when two heartbeats have gone un-acknowledged.
With default parameters for Storage Foundation Cluster File System and TIBCO EMS the recovery operation is initiated in less than 10 seconds. Compared with traditional active/passive
environments, where the fail over time is counted in minutes, the value of Storage Foundation Cluster File System is clear.
If the primary server failure is recoverable (ie software error or a transient error) VCS will automatically restart the EMS server to restore a fault tolerant state as soon as possible.
4. Configuring TIBCO EMS with Veritas Storage Foundation CFS for fault tolerance
For simplicity a two-node configuration is shown with one queue, but this can be expanded into multiple servers and queues by adding resources
Figure 4 CFS fault tolerant configuration
Creating the Cluster File Systems
To support the TIBCO EMS installation two cluster file systems will be used:
- /opt/tibco will be used to host the TIBCO EMS binaries
- • /var/tibco is used to host all data stores and configuration files
Since the availability of the configuration depends on these file systems it is important that care is take to ensure that they are mirrored, preferably between different storage arrays to ensure maximum availability.
Installing TIBCO EMS
The TIBCO EMS binaries can be stored on local disk on every node or on a shared cluster file system. Each method has its advantages. Using local storage allows the administrator to easily upgrade one node at a time, but increases the administrative burden by demanding that multiple binary trees are kept in sync.
Using a clustered file system to store the binaries allows the administrator to have a single copy of the binaries and simplifies day-to-day administration at the expense of a slightly more complex process for upgrading TIBCO.
Installing on local storage
To install TIBCO EMS on local storage no additional steps need to be taken and the administrator should follow the installation and user’s guide from TIBCO.
Installing a TIBCO on a shared file system
To install a single, shared TIBCO EMS binary tree the administrator must first create a cluster file system and ensure that it is mounted in the correct place (default: /opt/tibco).
It is also suggested that the administrator takes the time to manually specify where the TIBCO installation procedure stores the TIBCO installation properties and installation history files. It is suggested that these are placed on the shared file system together with the binaries.
ConfiguringTIBCO EMS for fault tolerance
As described in Chapter 15 of the TIBCO EMS User’s Guide the various “ft_*” parameters in tibemsd.conf dictate how fault tolerance is set up.
To enable fault tolerance, at a minimum the server, store, and ft_activate parameters need to be set.
The server parameter is set to an arbitrary name, the name of this TIBCO EMS server, and should be identical in the configuration file for both the primary server and the standby server. The shared data store is designated by the store argument and needs to point to a shared/clustered file system.
The ft_activate parameter should be set to point to the “other” server in a primary/standby pair, so that on the primary it would point to the standby and on the standby it should point to the primary. The servers will then communicate with each other on startup and agree on who is the current primary (First node that is started) and who is the current standby server.
In our example configuration the relevant sections looks like this:
For an in depth explanation of the remaining TIBCO EMS parameters for fault tolerance it’s suggested that the administrator reads the TIBCO EMS User’s Guide Chapter 15, Configuring Fault-Tolerant Servers.
Configuring VCS resources and services groups
In order to cluster TIBCO EMS it is suggested to use 2 different service groups for the infrastructure resources, and one additional service group per TIBCO EMS server instance (tibemsd daemon).
The example configuration in this document uses the following setup:
- cvm – Standard service group that controls the Veritas Cluster Volume Manager and Cluster File System shared resources. This group is automatically created during the configuration phase of Storage Foundation Cluster File System installation.
- tibco_cfs – This service group controls the access to the clustered TIBCO file systems. It depends on the cvm group.
- tibco_ems01 – This service group contains the resource that controls the tibemsd daemon for the TIBCO EMS instance “EMS01”. It depends on the availability of the tibco_cfs group and the file systems within.
The three service groups are all configured as parallel service groups (i.e runs on more than one node at a time).
The cvm and tibco_cfs groups run on all nodes within the cluster and the tibco_ems01 group runs on the primary/standby server dedicated to the EMS instance it controls.
Figure 5 Showing the three service groups online
Figure 6 Service group dependencies
Configuring VCS to start/stop/monitor an EMS server
VCS can be used to automatically start/stop and monitor the EMS servers within the cluster. To do this, a service group per EMS server instance is created, in our example: EMS01, and a single application agent is used to monitor the EMS daemon (tibemsd). The service group is configured as a parallel group on the primary and the standby server for the instance.
It is also recommended to change the RestartLimit attribute of the Application Agent to 1 (from default 0) to allow VCS to help TIBCO EMS restart if the server crashes.
Figure 9 TIBCO application group for EMS01
Configuring TIBCO EMS producers/consumers for fault tolerance
To allow TIBCO clients (Either commercial or home-grown applications) to benefit from the fault tolerant configuration they need to be configured to know about the standby server as well as the primary server. This is done by specifying multiple server URLs when starting the client/application.
If the original server definition was “tcp://tibems01:55200” the corresponding fault tolerant server definition would be “tcp://tibems01:55200, tcp://tibems02:55200”. This will allow the
client to automatically recover and reconnect once it detects a server failure. For more information on how to configure and control the client behaviours read the TIBCO EMS User’s Guide.
Summary
TIBCO Enterprise Message Service is a leader in enterprise messaging. The combination of TIBCO EMS with Veritas Storage Foundation Cluster File System and Veritas Cluster Server from
Symantec delivers significant benefits for organizations.
- The cluster file system provides parallel access to files improving performance
- Failover times are reduced because volumes and file systems do not need to be brought online after a recovery as they are already available to all nodes in the cluster
- Availability of TIBCO EMS is improved in three areas:
- CFS file lock management is significantly enhanced compared to NFS, which makes failover more reliable
- VCS can detect faults outside of TIBCO EMS’ awareness
- The redundancy of the TIBCO solution is restored when a failover occurs
Appendix A: VCS configuration script
Below is the complete Veritas Cluster Server configuration file for a two node TIBCO EMS configuration.
Appendix B: Start script
This is the script used to start the EMS server through VCS.
Appendix C: TIBCO EMS configuration
Below is the full TIBCO EMS configuration file used on tibems01, the corresponding file for tibems02 has different values where the hostname has been used.
Copyright © 2008 Symantec Corporation. All rights reserved. Symantec and the Symantec logo are trademarks or registered trademarks of Symantec Corporation or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners.