Archive

Posts Tagged ‘Agetns’

How to stop false heartbeat alerts for DMZ servers

12/04/2011 Leave a comment

One of the strange things in OpsMgr is the relationship between Health Service Watcher object and the Root Management Server. A common mistake is to think that when we point a server to a gateway server (GW) or a Management Server (MS), the GW or MS are responsible to alert us about the availability of the monitored server.

This is not the case in the current version of OpsMgr (hopefully next version will help us dill with it better). All  Health service Watcher objects are placed on the RMS, and if the GW server is down we will get a lot of “Computer not reachable” & “Health Service Heartbeat Failure” for servers that are up and running!!!

HB_RMS

Lets start with a common scenario where we have a GW server that is connected to a MS thru a FW.

GW

in this scenario when the MS,GW or FW is down we will get a lot of false alarms in the console that alert us that all agents behind the FW are down.

We have 2 options to work around this:

Option 1: 

GW1

Add another GW server (GW2) and set all agents to failover to the new one in case of a failure in GW1. (How to failover an agent\GW). take in mind that if the FW or network devices that connect the GW to the MS fail you will still get all the unwanted alerts.

Option 2: We need to create an override for the 2 monitors “Computer not reachable” & “Health Service Heartbeat Failure” and to create a rule on the GW server that will catch an event when a monitored server is down.

1. Create a group that contain all health service watcher (agent) in the DMZ, in my case it was easy, I just needed to exclude all my internal domains agents

hb_2

2. Go to authoring pane and search for the 2 monitor “Computer not reachable” & “Health Service Heartbeat Failure” and set an override to the group created in step 1.

hb_3

hb_4

3. Create an event rule that catch the following event and assign the rule only to the GW server.

hb_5

in the end you will have 2 overtraded monitors and a new event rule

Overrides

Hope this will help you to lower the number of false notification alerts.

Advertisements
Categories: OpsMgr Tags: , , ,

Does my Operations manager environment healthy ?

07/02/2011 Leave a comment

Update – 17/05/2011 – I Changed the query to show only agents that aren’t in maintenance mode.

After several OpsMgr deployments at different costumers I discovered that most of us concerned about OpsMgr Management server health and/or Database health, but what about agent health? How can we tell if all our agents report to their managements servers?

In one of my previous posts I wrote about WMI and agent health , these hotfixes are not for operations manager , those are bugs that discovered in the operating system components that monitoring solution use, like WMI. So for agent reliability it’s important to keep the OS updated.

Sometimes agents seems to be healthy, Operations manager MP point that, but when we starts to investigate, we can see that some gents did not report any new alerts for long time. Operations manager event log at the client side seems to be healthy and there are no errors. But if we will look closer we can see that there are several events rapidly reoccurring (6022, 21025).

Since OpsMgr gives us the ability to monitor agents availability, performance and configuration we need to find a way to discover if agents really writing data to OperationsManager DB. I decided to search for the right approach.

Daniele Grandini wrote two SQL queries that check the OperationsManager database for agents that did not collected event or performance data in the last 4 hours. I decided to monitor only the performance data collection since we do use the HealthService and MonitoringHostCPU usage. If you disabled these rules the query can be changed to target any monitoring class id (like operating system).

Thanks to a colleague of mine, Evyatar Nezer, we re-write the query to create a scheduled report using operations manager reporting services. the report show us all the agents that are not in good health.

clip_image040

to fix this issues I found that we need to take this actions (client side only):

1. restart the HealthService .

2. rebuild the perf registry strings and info using LODCTR.EXE /R (the command is case sensitive!!!) – this step should be done with extra care!

Here are the steps taken to create the scheduled report.

First we need to take note about the SQL server that hosts the OperationsManager DB. This could be done using REGEDIT.EXE against the RMS as shown below.

image

The next step uses Microsoft Visual Studio report builder. Open the application and under new project select the following:

clip_image004

Enter the OpsMgr database server name and choose the OperationsManager database.

clip_image006

clip_image008

clip_image010

Paste the query (click here to download) in the next screen and continue as describe in the next screen shots.

clip_image012

clip_image014

clip_image016

clip_image018

clip_image020

clip_image022

And finally the report we are looking for, now its time to add the report to the operations manager reporting pane.

clip_image024

Click on File (there is no need to save the file first) , Open -> File as shown

clip_image026

Right click on the RDL file in Open File dialog and click copy

clip_image028

Now go to the reporting services we site, create a new folder and name it “MYCOMP Custom Reports”

clip_image030

Enter the folder that you created in the previous step and click Upload File

clip_image032

Give the new report a name and click the browse button

clip_image034

Choose the RDL file that you copied earlier and click open

clip_image036

And that’s it. We have a report that show us all the agents that missed to collect performance data in the last 4 hours.

clip_image038

Categories: OpsMgr Tags: , ,

Operations Manager 2007 agents and WMI

08/09/2010 1 comment

Operations Manager 2007 R2 agent is a service that uses the Windows Management Instrumentation (WMI) class to query information about the computer it run under very frequently. the agent uses WMI to discover the server roles and to run scripts to verify the health of the server.
to improve the agent reliability  it’s important to keep the WMI repository with best health. To do so Microsoft released several updates that can help us.

Windows Server 2003:

KB933061 – This hotfix is available for Windows Management Instrumentation (WMI) in Microsoft Windows Server 2003. This hotfix improves the stability of the WMI repository. The hotfix includes many of the improvements that are available in WMI in Windows Vista.
Generally, corruption issues can be defined as cases in which the following conditions are true:

  • Data that is expected in the repository is missing.
  • The missing data cannot be retrieved or added back to the repository

x86 – http://www.microsoft.com/downloads/details.aspx?FamilyId=94CE776E-A4DA-4937-B2FA-3EC16495222E&displaylang=en

x64 – http://www.microsoft.com/downloads/details.aspx?FamilyId=7D75CC01-7673-4884-ADF8-6AFB7E598D42&displaylang=en

KB950681 – WMI & %PATH% more than 1024 chars for windows server 2003

KB955360 – CSCRIPT 5.7 to keep the scripts that uses WMI to run with less errors and fails.

Windows Server 2008 R2

There are 2 WMI hotfixes for 2008 R2 OS, the first one against memory leak and the other when the agent having problems to query a failover cluster.

An application or service that queries information about a failover cluster by using the WMI provider may experience low performance or a time-out exception.
http://support.microsoft.com/kb/974930

The "Win32_Service" WMI class leaks memory in Windows Server 2008 R2.
http://support.microsoft.com/kb/981314

I recommend to apply those hotfixes to all agents, BUT before, you must test them at your environment!!!

I also recommend the following hotfixes:

KB981263Management servers or assigned agents unexpectedly appear as unavailable in the Operations Manager console in Windows Server 2003 or Windows Server 2008. This issue occurs because the database that is used by the health state is corrupted. This database corruption is caused by an issue in the storage engine of the jet database that is hosted on Windows.

Categories: OpsMgr Tags: , , ,