Performance Monitoring: The Diagnostic Lifecycle

Many IT divisions and administrators of companies struggle answering questions such as: Is your current infrastructure still capable to handle the load from years ago and deliver the same performance? What is the maximum amount of capacity your severs can handle? Where do you feel bottlenecks or pain points today in system’s or application’s performance? What is the long-term growth and usage of your applications servers? If you cannot answer half of the questions based on data provided by a monitoring system, it would be a good idea to consider and set up fine-tuned diagnostics tools for your system administrators. This will improve your capacity planning and help you identify bottlenecks, including disk latencies, IIS web call performance, and SQL deadlocks.


Diagnostic enterprise tools (DETs) such as IderaNewRelic or RedGate provide excellent reporting information to your (SQL) system administrators. The software, however, comes with a price. If you consider using a DET, think about how you would scale it to involve other non-specific application server parameters as well. For example, your SQL analytics tool won’t be capable to measure host performance for BizTalk servers or analyze IIS performance. Think about how this can be tackled as well!

Collect. Measure. Analyze.

Define the set of applications you’re using and research the possible parameters that can be collected. Which parameters are of interest to you? Your application team? Your network team? Before you know it, you’ll end up with an extensive list of interesting values!

To illustrate this, I’ve included a short overview of interesting parameters in different domains as well as a couple of examples.

  • System: disk Queue write length, available system memory,  CPU usage
  • SQL: deadlocks, Latch wait times, recompilations
  • BizTalk: host throttling, message delivery delay, host queue length
  • IIS: average response times, application pool recycles

How do you build a diagnostics lifecycle?

Since we are mostly interested in trends, there is no need to measure all of those values day in and day out. Identify a time slot or day of the week within your enterprise where you process the most data. If this peak happens on a Friday, stick to this day and collect all data every Friday: collecting data on a frequent recurring day will give insight in the occurring trends. I recommend collecting data at least once a week. This will allow you to detect long term, short term and recent variations.

diagnostics lifecycle
The diagnostics lifecycle helps you to identify trends.

The raw outcome will provide too much detail to notice any remarkable or insightful trends. To interpret the result in the best way possible, it is highly recommended to parse the raw result into a human readable file, preferably with graphics and charts.

The resulting spikes and/or other irregularities can be investigated by the different teams in your enterprise.

I recommend storing the report as long as possible since it can be used to investigate the result on a long-term. Always try to refer back to one week before, three months, one year etc. This will give you a relevant and useful indication of the previous state.

Windows’ PerfMon and PAL

PerfMon

Fortunately, Microsoft offers an excellent tool that can assist you: PerfMon. This is the standard in performance monitoring data collection for any kind of technical value in a Microsoft Windows environment. On top of that, it is also compliant with custom application values, although some customization might be required. The tool captures data collection points every few seconds and saves the results into a .blg file. Interesting fact: many paid software vendors focusing on capturing system and application values use PerfMon under the hood as well.

Remote perfmon connection
Remote PerfMon connection

Depending on the configuration, the capture of data points consumes some system resources. I therefore recommend configuring PerfMon on a remote server or reporting / management server. This allows you to have all the results available in one machine as well as analyze them (more on this later). The PerfMon tool compiles a .blg file and as such provides you with extremely detailed information on each configured counter.

Performance Analysis of Logs (PAL)

Additionally, you can parse the report with the relog command into a .csv file and create your own graphics and maximum thresholds. This is, however, a very time consuming job. That’s why the community created an open source software tool to make life easier. Performance Analysis of Logs (PAL) Tool analyzes the performance counters in a .blg file according a known threshold, which results into an html file. Note that this tool requires some heavy CPU and memory processing! You definitely don’t want to run this on the production environment at all.

The tool can be called from command line as well, something I highly recommend if you process many files or for a longer period of time or during multiple scheduled times. A good approach would be to launch the open source PAL tool immediately after PerfMon has completed its tasks.

diagnostics collection and analyse
Diagnostics collection and analysis.

Don’t stay behind!

This blog post has hopefully provided you with some insight on how operators or system administrators could improve their analysis of the system based on a more structural approach. The importance of having a process in place to check different parameters and compare them with past results is not only a quintessential process, but also adds value to your enterprise.

Even smaller teams and companies can now find various tools on the market, including the excellent software offered by the Microsoft Windows platform. In summary, having a life cycle in place benefits everyone!