DOCUMENTATION: How network interfaces are selected in NetBackup 6.x/7.x when there are multiple NICs on a NetBackup host.

Article:TECH54733  |  Created: 2007-01-16  |  Updated: 2014-11-04  |  Article URL http://www.symantec.com/docs/TECH54733
Article Type
Technical Solution


Environment

Subject

Issue



DOCUMENTATION: How network interfaces are selected in NetBackup 6.x/7.x when there are multiple NICs on a NetBackup host.


Solution



Introduction
It is important to understand how new components in NetBackup 6.x/7.x (such as nbemm, nbjm, nbpem, nbrb) communicate when troubleshooting network problems.

NetBackup 6.0 GA through to 6.0 MP4
When one of the new processes starts it gets the list of interfaces from PBX and then advertises that list as addresses that can be used to contact itself:
[Application] VxICS 50936 File ID:103 [No context] [Info] V-103-23 Sending address[0]: 192.168.0.4:1556
[Application] VxICS 50936 File ID:103 [No context] [Info] V-103-23 Sending address[1]: 192.168.8.4:1556
[Application] VxICS 50936 File ID:103 [No context] [Info] V-103-23 Sending address[2]: 192.168.16.4:1556
[Application] VxICS 50936 File ID:103 [No context] [Info] V-103-23 Sending address[3]: 192.168.32.230:1556
[Application] VxICS 50936 File ID:103 [No context] [Info] V-103-23 Sending address[4]: 192.168.32.221:1556

The connecting process could choose to try and connect to any of these addresses, regardless of there being a valid route and regardless of the SERVER settings inside NetBackup. It meant that the connection might be across a route that the administrator did not intend for NetBackup to use.  Current behavior in this area is far different from prior to 6.0.

NetBackup 6.0 MP5 onwards (including NetBackup 6.5 and NetBackup 7.0)
The new NetBackup behavior has been modified to have NetBackup only use the paths intended by the user. The process still gets a list of interfaces from PBX and advertises them. The difference is that when a command tries to connect, it will only connect to an interface if it finds that name with a SERVER entry in its local bp.conf. Therefore, the bp.conf file should only contain valid SERVER entries where the server can be reached.  This will avoid communication via unintended interfaces.

Consider the following configuration:
 

Note there is no route between the "master-pub" interface on host "master" and the "media-prod" interface on host "media".  Therefore, host "master" should not have a SERVER entry for the "media-prod" interface, nor should host "media" have a SERVER entry for interface "master-pub".  This is why they are shown in strikethrough text.

At the same time, be sure to check name resolution. All NetBackup servers advertise on all interfaces. Even though you remove a SERVER entry, the host still needs to be able to resolve the name correctly to be able to compare with all the bp.conf SERVER entries to determine that route should not be used.

For the example configuration above, after removing the entry "SERVER=media_prod" from the master server, be sure to check that "media_prod" can still be resolved by using the "bpclntcmd". For example:

 
"bpclntcmd -hn media_prod"
 


MEDIA_SERVER entries
Note, the controlling effect of SERVER entries applies equally to MEDIA_SERVER entries. Adding incorrect MEDIA_SERVER entries will result in the same undesired behavior as adding incorrect SERVER entries. This TechNote uses only SERVER entries in all examples but all information applies to MEDIA_SERVER entries.


Problem Manifestation
Depending on where an incorrect SERVER entry is made, and the available routing, different problems will occur.
 
-  Incorrect Master Server entry. Processes that need to connect to the media sever will be affected. A common example is nbjm when it sends resources to bptm.
 
-  Incorrect Media Server entry. Processes that need to connect to the master server will be affected. Common examples here are ltid and nbemmcmd.
 
-  Route Available. If the alternate, incorrect route is available, no visible problem will be seen and jobs will run but NetBackup may be using a subnet that was not intended for NetBackup use.
 
-  Route Not Available. Usually this will fail immediately and either the command will fail or NetBackup will try another route. However, there is a situation where hangs and delays may be experienced. This is the case where a firewall is blocking the connection and throwing away SYN packets silently.
 

 
This diagram illustrates a scenario where the TCP stack believes a route exists, but a firewall exists on the route that is blocking most traffic (in this example only http traffic on port 80 is allowed). In this scenario, delays can begin to appear due to network timeouts if incorrect SERVER entries are made.
 

Consider the incorrect master server entry "SERVER=media-prod". The media server will advertise itself as available on both of its interfaces. When nbjm wants to send resources to bptm, it sees from it's bp.conf file that media-prod is a valid address to connect to so it initiates a connection that will try to pass through the firewall.

The firewall will block the connection. If the firewall silently blocks the connection, nbjm does not know there is a problem and will keep trying until the attempt times out. This will introduce a delay into nbjm and repeated attempts will eventually significantly impact the ability of the master to process work efficiently.


Troubleshooting
To verify the IP's being used, the DebugLevel for TAO component (156) and NetBackup Libraries (137) as follows:
 
/usr/openv/netbackup/bin/vxlogcfg -a -p NB -o 156 -s DebugLevel=1
 
/usr/openv/netbackup/bin/vxlogcfg -a -p NB -o 137 -s DebugLevel=5
 

For 6.0 only: If having problems connecting to nbemm, raise the debug level to 6.  Please note that the following command is run on the media server that the connection to nbemm is made from.
 
/usr/openv/netbackup/bin/vxlogcfg -a -p NB -o nbemm -s DebugLevel=6
 


Here is an example of a successful connection:
13:26:15.677 [26940] <2> taolog: TAO (26940|1) PBXIOP connection to peer <192.168.32.200:1556> on 256
...
13:26:15.724 [26940] <2> taolog: TAO (26940|1) - PBXIOP_Connector::make_connection, going to wait for connection completion on local handle [257]
13:26:15.725 [26940] <2> taolog: TAO (26940|1) PBXIOP connection to peer <192.168.32.200:1556> on 257

The following is an example of an attempted connection to an IP address whose SERVER name was incorrectly added to the bp.conf file:
12:27:52.126 [451] <2> taolog: TAO (451|1) - PBXIOP_Connector::make_connection, to <192.168.0.4:1556:EMM>
...
12:27:52.127 [451] <2> taolog: TAO (451|1) - PBXIOP_Connector::make_connection, going to wait for connection completion on local handle [261]; closed=0
12:27:52.127 [451] <2> taolog: TAO (451|1) - Leader_Follower[2408808]::wait_for_event, (leader) enter reactor event loop
..
>>>Note from the above and below lines that the connection takes approximately 225 seconds to fail.  This is a common timeout value.  Another common timeout is 75 seconds.  Timeouts depend upon the OS.<<<
....
12:31:36.810 [451] <2> taolog: TAO (451|1) - Leader_Follower[2408808]::wait_for_event, (leader) exit reactor event loop
12:31:36.810 [451] <2> taolog: TAO (451|1) - PBXIOP_Connector::make_connection, active_connect_strategy_->wait failed on this thread for local handle [261]; closed=1
12:31:36.811 [451] <2> taolog: TAO (451|1) - TAO_Transport_cleanup_queue_i[2408808], cleaning up complete queue
12:31:36.811 [451] <2> taolog: TAO (451|1) - PBXIOP_Connector::make_connection, connection to <sun04bk:1556:EMM> failed (errno: Transport endpoint is not connected)

("TAO" tags in the log are not printed by default)

Note that the connection takes approximately 225 seconds to fail.  This is a common timeout value.  Another common timeout is 75 seconds.  Timeouts depend upon the OS.


The following is an example of the ltid process not being able to connect to nbemm because it is using the wrong interface. Note the CORBA exception/timeout:

19:45:40.979 [32313] <4> InitLtid: LTID detected an abnormal shutdown
19:45:42.994 [32313] <4> InitLtid: emmserver_name = svr04
..
20:06:00.960 [32424] <16> GetSupportedInterface: (-) Exception! CORBA::Exception
20:06:00.960 [32424] <16> GetSupportedInterface: (-) system exception, ID 'IDL:omg.org/CORBA/TIMEOUT:1.0'
20:06:00.960 [32424] <16> emmlib_GetHost: (0) Interface not supported, emmError = 3000004, nbError = 0
20:06:00.961 [32424] <16> AddAndVerifyHost: Failed to validate the existence of host sun04 with EMM. EMM error code = 3000004

Note:  Although timeouts are a frequent symptom, not all connection issues have will have the same (timeout) symptom.

Summary
This document outlines the importance of verifying the bp.conf settings when troubleshooting these errors. From NetBackup 6.0 MP5, the use and effect of SERVER entries has changed since both NetBackup 5.1 and earlier and also since NetBackup 6.0 MP4 and earlier.  Incorrect entries may result in delays, failures, timeouts and undesired network path usage. Care must be taken to ensure only valid SERVER and MEDIA_SERVER entries are now made in the NetBackup configuration.
 



Legacy ID



293038


Article URL http://www.symantec.com/docs/TECH54733


Terms of use for this information are found in Legal Notices