Server Fault Tolerance for Software Delivery Solution
While looking for information regarding package server redundancy, I was unsuccessful in locating any references. Hopefully, this posting will fill the missing information gap.
If a package server was defined for a site and that package server goes down, how will the Altiris Agent clients respond? Original thought was that if the clients were unable to pull from the package server, the clients would fail over to the NS to get their distribution.
The client tries to download the package from the package server assigned to the site (which was shut down for testing). Since the server was down, initial download was unsuccessful but then client tried to connect again in 3 minutes, then 6 minutes. The check-in time kept doubling until the client checked every 2 hours. From that point on, the client continued to check every 2 hours until the package server came back up.
Fault tolerance was achieved by defining 2 package servers for a site. In this scenario the client attempted to contact the initial package server but since the server was down the client then switched to the second package server that was defined.
This discovery led to another question: Would the client point back at the original package server once it was brought up?
The clients that failed over to the new package server continued to pull packages and updates from the new package server. Not until the new package server was shut down -- and the original package server brought up -- did the clients fail over to the original package server.
The package.xml file for each package snapshot contains a static list of package servers in the order that the package servers were added. The package servers are listed in the order created within the package.xml file. Therefore, the client will initially try to pull from the last known good package server and, if that server is not responding, the client will continue down the package.xml codebase reference to the next package server.