Video Screencast Help
Symantec Appoints Michael A. Brown CEO. Learn more.

Stability issues with version 9.0.0.1193

Created: 21 Oct 2013 • Updated: 29 Nov 2013 | 17 comments
This issue has been solved. See solution.

Hi,

We are still using an old version of eVault 9.0.0.1193 in combination with 4 Exchange server 2010 Service Pack 2 in a DAG configuration. It is only used for mail archiving. We had file archiving in the past, but stopped this when we moved to NetApp. We plan to upgrade eVault somewhere in 2014.

The solution is not working stable at the moment. It basically never has, but at the moment it seems worst. We regularly have the MQ queues that freeze. No messages are going in & out the queue. It happens randomly, but almost every 2 to 3 days. As our setup is dependant on the backup, because no items can be removed from the mailbox until an succesfull backup, and we only backup at night, this usually leads to user impact for at least a day.

I did find a workaround to clear the queues and reset the storage & MQ service. It is based on this article: http://www.symantec.com/business/support/index?page=content&id=TECH48896. This temporary resolves the issue. We want a more stable solution until we get a green light from the business to upgrade.

The event that seems to be linked to it on the server is ID 2262: "Error processing archiving request for system MAILSERVER.". And the same moment, just after this event, the ID 3156: "Start to process the Exchange system MAILSERVER." is logged, but the MQ queue seems to freeze.

Restarting the MQ service fails. It timed out after a VERY long time, only removing the queues of emptying them, seems to again get it up an running. Users randomly complain that they can not restore items, or that archiving stopped. An engineer, who left the company, called Symantec for it, and they suggested him to reinstall MQ each time the problem occurs. A bizzare suggestion to be honest.

I have checked numerous things and suggestion, also based on suggestions in this forum, but until this moment the final solution remains a mistery. I hope, with your help, to get some leads to solve this issue. I can provide extra information if needed.

Already many thanks for your help!

 

Operating Systems:

Comments 17 CommentsJump to latest comment

GertjanA's picture

Hello WiVM,

Some pointers:

Make sure the MSMQ location is EXCLUDED from Antivirus scanning, both scheduled and manual scans.

Make sure the Size (set on the properties of MSMQ) is set to 10GB. (assuming you're on W2008R2 already), expand Features, Rightclick MessageQueing, select properties.Verify in both fields under Storage Limits, it reads 10485760, not 1048576.

Verify MSMQ is not AD-integrated (when selecting security tab, you should get a message stating it is not supported in Workgroup mode)

Make sure there is no MSMQ Journaling set to any queue.

Verify (preferably) you do not have outgoing queues.

Verify MSMQ location is on a reasonably fast disk, not shared with other functions. (pagefile/temp/indexing etc)

How many EV-servers do you have, and how many mailboxes do you have archived? Are you performing the regular SQL-maintenance on the databases? Which MSMQ('s) are 'freezing'? If storage-queue, is there something going on on the storage location (backup/windows indexing/ defrag/antivirus scanning, etc) If A5 queue, is there something going on on Exchange perhaps? How much data are you archiving (perhaps lower the threads on the tasks?) Is your latency perhaps an issue?

If you run the tasks in report mode, does the storage queue empty? Is there enough storage on both index and storage locations? what storage are you using? Is it possible for you to add an extra EV server, to balance the load?

Have you tried involving support? They are good at determining the bottleneck, but that does require some dtracing.

Good luck. If I can think of more, I'll post here.

Thank you, Gertjan, MCSE, MCITP,MCTS, SCS, STS
Company: www.t2.nl

www.quadrotech-it.com

www.symantec.com/vision

WiVM's picture

- Antivirus is disabled on the servers.

- It is a Windows 2003 SP 2 Standard Edition. 4 GB RAM. The storage limit is 7340032 KB

- It is in workgroup mode indeed

- There is no journaling enabled on any queue. Checked them one by one. But I noticed that only on the queue A6 has a “Limit message storage” of 102400 KB applied.

- I don’t have outgoing queues.

- It is located on a dedicated SAN disk connected on a Clariion storage system.

- We have 2 eVault servers, but only one is used. The other server was for file archiving, it is basically running in standby, but not used. We have switches between both and both systems have the same issues.

- Full backup with incrementals every two hours and shrink every day.

- It seems random queues. I will have a look on this when it happens again.

- We are archiving about 2350 mailboxes, with an average size of 60 MB each.

- Latency should not be an issue.

- Never tried report mode to see that queue empty. What is the impact on the users when doing this?

- We use Clariion storage

- Adding a server to balance the load: We already have a second one, but not active. May be a good suggestion, but I will have to bring up good reasons to do so.

- About support, as mentioned: An engineer who worked here in the past did, but the support only could suggest to reinstall MSMQ when it happened.

Thanks already!

WiVM's picture

Maybe some usefull addition: I sometimes see the event "EventID: 0x800408A8 (2216) - Message dispenser will suspend processing for 5 minutes due to a recoverable error".

MichelZ's picture

What stops you from installing the latest Service pack?
Usually, there are tons of stability fixes included...

WiVM's picture

That is another discussion. The business wants it. I am only the Exchange administrator. I can advise, but not force. ;-)

MichelZ's picture

Seems like the business does not need the problem "solved" then... :)

GertjanA's picture

Can you add RAM to the server?

Can you change the limit on that A6 group?

And, as Michel says, upgrade to latest SP. On 2 servers it should be relatively quick.

 

 

Thank you, Gertjan, MCSE, MCITP,MCTS, SCS, STS
Company: www.t2.nl

www.quadrotech-it.com

www.symantec.com/vision

WiVM's picture

Hi,

I do understand that the latest fixes would probably change a lot, as mentioned, I don't have this option at the moment.

Which trace would you suggest to use for this type of issue?

Thank you

plaudone's picture

There are a couple of things I would suggest. 

1.  Optimize the EV and SQL servers using the following.  This could help to avoid issues cause by lack of resources.  

http://www.symantec.com/docs/TECH55653

http://www.symantec.com/docs/TECH56172

2.  Upgrade ASAP to the latest release of 9 - http://www.symantec.com/docs/TECH204715

There have been numerous enhancements and fixes to issues to help EV 9 run more efficiently and address many known issues that can only be fixed by a code update. 

 

Outside of these you may be looking at increased time to try and resolve the issues that are occurring on the server one at a time.  

plaudone's picture

Also, ensure that there are any MS patches around MSMQ performance installed on the server.  

A Dtrace of the ArchiveTask and StorageArchive processes should be able to give more information.   These typically produce a large amount of data so I would suggest running Dtrace from a command line using the folllowing:

Dtrace 2000000

This will increase the buffer size and hopefully avoid buffer overflows in the logs.  

WiVM's picture

I ran a trace on ArchiveTask & StorageArchive while running the tasks in report mode. This is basically the error that we see frequently:

581221 11:15:19.276  [3752] (ArchiveTask) <7300> EV:M :CArchivingAgent::QueueChunkOfMailboxes() |Committing the transaction |
581222 11:15:19.276  [3752] (ArchiveTask) <7300> EV:H :CArchivingAgent::MainProcessSystem() |Return the MAPI session to the session pool |
581223 11:15:19.276  [3752] (ArchiveTask) <7300> EV:H :CArchivingAgent::MainProcessSystem() |Exiting routine |
581224 11:15:19.276  [3752] (ArchiveTask) <7300> EV:L CArchivingAgent::MainProcessSystem (Exit) |Unspecified error  [0x80004005] |
581225 11:15:19.276  [3752] (ArchiveTask) <7300> EV~E Event ID: 2262 Error processing archiving request for system MBXPRD02 |
581226 11:15:19.276  [3752] (ArchiveTask) <7300> EV:H :CArchivingAgent::ProcessSystem() |Exiting routine |
581227 11:15:19.276  [3752] (ArchiveTask) <7300> EV:M :AgentMessageDispenser::ActivateObject() |An error that we do not specifically recognize has occurred, so we'll just increment this messages retry count |

581228 11:15:19.276  [3752] (ArchiveTask) <7300> EV:H :AgentMessageDispenser::ActivateObject() |Exiting routine at point A |
581229 11:15:19.276  [3752] (ArchiveTask) <7300> EV:M :AgentMessageDispenser::ProcessNextMessage() |ActivateObject has returned failure |
581230 11:15:19.276  [3752] (ArchiveTask) <7300> EV:M :AgentMessageDispenser::ProcessNextMessage() |Activate object returned with status AGENTS_E_DISPRETRY. The retry count will be incremented, and the message reposted to the end of the queue |
581231 11:15:19.276  [3752] (ArchiveTask) <3140> EV:M :AgentMessageDispenser::ProcessNextMessage() |Retrieved a message successfully from the queue |
581232 11:15:19.276  [3752] (ArchiveTask) <3140> EV:M :AgentMessageDispenser::ProcessNextMessage() |Read new message or a message that is within its retry limit (). |About to process the message body |
581233 11:15:19.276  [3752] (ArchiveTask) <3140> EV:H :AgentMessageDispenser::ActivateObject() |Entering routine |
581234 11:15:19.276  [3752] (ArchiveTask) <3140> EV:M :AgentMessageDispenser::ActivateObject() |Message type indicator = MsgID_ArchiveSystemEx. |Will fall through to be handled by the case for MsgID_ArchiveSystemExImp |
581235 11:15:19.276  [3752] (ArchiveTask) <3140> EV:M :AgentMessageDispenser::ActivateObject() |Message type indicator = MsgID_ArchiveSystemExImp |
581236 11:15:19.276  [3752] (ArchiveTask) <3140> EV:L :AgentMessageDispenser::ActivateObject() |Called server side object, with arguments: |m_pIBackgroundArchivingAgent->ProcessSystemEx(Priority = "true",|  ReportingMode = "true",|  Run Now Mode = "3",|  ContinuousMode = "true",|  NumMsgsToArchivePerPass = "0",|  ExchangeSystem = "MBXPRD02",|  NULL);| |
581237 11:15:19.276  [3752] (ArchiveTask) <3140> EV~I Event ID: 3157 Start to process the Exchange system MBXPRD02 (in report mode). |
581238 11:15:19.276  [3752] (ArchiveTask) <3140> EV:M EPC::GDP - Returning Default Policy : [Exchange Mailbox Policy][153A7645E9C3DFA4B801F3930B421BAF51012700EVSITE1]
581239 11:15:19.276  [3752] (ArchiveTask) <3140> EV:H :CArchivingAgent::ProcessSystem() |Entering routine |
581240 11:15:19.276  [3752] (ArchiveTask) <3140> EV:L CArchivingAgent::MainProcessSystem (Entry) |
581241 11:15:19.276  [3752] (ArchiveTask) <3140> EV:M CArchivingAgent::MainProcessSystem - Processing queued message type [0x0000004D]
581242 11:15:19.276  [3752] (ArchiveTask) <3140> EV:M EPC::GDP - Returning Default Policy : [Exchange Mailbox Policy][153A7645E9C3DFA4B801F3930B421BAF51012700EVSITE1]
581243 11:15:19.276  [3752] (ArchiveTask) <3140> EV:H :CArchivingAgent::MainProcessSystem() |Entering routine |
581244 11:15:19.276  [3752] (ArchiveTask) <3140> EV:L CArchivingAgent::Initialise (Entry) |
581245 11:15:19.276  [3752] (ArchiveTask) <3140> EV:M MigratedDominoItems::Reset (Entry)
581246 11:15:19.276  [3752] (ArchiveTask) <3140> EV:M MigratedDominoItems::Reset (Exit)
581247 11:15:19.276  [3752] (ArchiveTask) <3140> EV:L CPrioritizedItemTable::InitialiseTable(Age) - Setting up the table.  Size: [1000]
581248 11:15:19.276  [3752] (ArchiveTask) <3140> EV:L CPrioritizedItemTable::InitialiseTable(Quota) - Setting up the table.  Size: [1000]
581249 11:15:19.276  [3752] (ArchiveTask) <3140> EV:L CArchivingAgent::Initialise (Exit) |Success  [0] | 

 

Please note that the RCAMaxConcurrency is set to $Null with the SetEVThrottlingPolicy.ps1 as recommended.

WiVM's picture

I have worked from this article: http://www.symantec.com/business/support/index?page=content&id=TECH35774

I noticed that the Exchange Throttling Policy was not active on the mailboxes that are used on the tasks. It was only enabled on the mailbox that was linked to the service account of eVault.

In addition I found in an old post, that Microsoft Exchange System Manager is not supposed to be installed on the eVault server. I found this in this post: http://www.symantec.com/connect/forums/enterprise-vault-901-error-add-outlook-2007#comment-6710581

Could that be the reason of the MAPI conflicts? And it is save to uninstall it from the server without breaking the system? When I start Outlook 2007 SP3 on the server, I get the message that the eVault client could not be loaded. I disabled this in the Trust Center. Still I get errors saying that Outlook is not the default client, while it is seleced as default client in IE & the Outlook options.

Please advice.

SOLUTION
plaudone's picture

You could try to run fixmapi on the server, but it may be required to remove ESM and then re-install Outlook on the server to resolve.  A reboot should be done after removing ESM.  

WiVM's picture

After the removal of the ESM the 2262 errors went away and archiving is running much more stable then before for more then one week. I have also enabled the EV throttling policy for the mailboxes that are used in the tasks to the different mail servers. This throttling was only enabled on the service account for EV. Since those two changes we never had to reset MQ, but I notice that especially queue A5 holds some messages from time to time. It seems to be linked with the backup, at that moment they are emptied. This may be because we have set that items can only be removed from Outlook after a successfull backup.

Now, from time to time we get MAPI errors 0x8007000E (Out of Memory). I have set the HKLM\Software\KVS\Enterprise Vault\Agents\ProfileExpire from 3 to 1 and scheduled a daily restart of the TaskController service. We only have 4 GB in the server (32 bit). This may be a bit to low.

We also started to get Error ID 3419: http://www.symantec.com/business/support/index?page=content&id=TECH164858. I will plan a recreation of the tasks to see that it is better. I doubt that this will fix it, as we have recreated them only recently.

I am also still trying to get approval to install the latest fix 9.0.5, as it seems that the above article is only applicable to version 9.0 and 9.0.1.

GertjanA's picture

Hello WiVM

It is ok to see messages in the A5 queue. This queue holds information of which mailboxes to process. In the Admin guide, there is an extensive description of what each queue does.

A5 = Mailboxes to process. Used during scheduled archive runs.This queue is not processed outside the scheduled archiving times, so you cannot use Run Now to clear a backlog on this queue.

What happens is that every mailbox that needs to be scanned is put inhere. If the archiving schedule permits, a second pass of mailboxes is done. When the schedule then stops, you have some mailboxes left to scan, they then remain in the A5 queue, and will be processed on the next scheduled archiving run. If the numbers drop to 0, every now and then, there is no issue. If it remains high, you need to investigate why it is not dropping (to small archiving window, too many mailboxes, too large mailboxes etc).

On the Mapi-errors, you might also want to use the following regkeys.

[HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\KVS\Enterprise Vault\Agents]
"RestartAllMAPITaskIntervalMins"=dword:00000168
"RestartOnMAPIMutexError"=dword:00000001

1st one will restart the tasks on a set interval (the decimal value is in minutes)

2nd one will restart tasks if there is a mapimutex error.

You can also create some scheduled task to restart the taskcontroller service at certain times. That should also clear up the mapi-profiles.

good luck.

 

 

Thank you, Gertjan, MCSE, MCITP,MCTS, SCS, STS
Company: www.t2.nl

www.quadrotech-it.com

www.symantec.com/vision

plaudone's picture

You should make sure that Outlook is 2007 with the latest SP as there are fixes for some MAPI issues.  If all the tuning has been put in place that should help as well.   

WiVM's picture

This issue is under control since the changes that I made. And the good news is that I got approval to upgrade to 10.0.0.4.