Factors to Consider When Building or Evaluating Monitoring Tools
In this post, I'll walk you through some essential factors to consider when building or evaluating successful monitoring tools. While the focus is on monitoring products, the general principles apply to most B2B products. In the absence of a universally agreed-upon definition of success, I will define it here as achieving good adoption, a high NPS, and generating substantial revenue.
NOTE: I have avoided using the term observability in the post. More on this here
I am not going to be pedantic about the definition of monitoring tool. In general, monitoring tools are essential for keeping an eye on various aspects of your infrastructure, applications, and services. They help ensure everything runs smoothly and efficiently by providing critical insights and alerts. Traditionally, the 4 pillars of monitoring are considered to be metrics, events, log and traces. Examples of such products include Datadog, Dynatrace, Newrelic, Grafana.
Let’s dive right in:
Time to Value
One of the most crucial factors for retaining users of a monitoring solution is how quickly they can derive value from it. Here are a few key elements that contribute to this:
Easy Installation and Setup: Users should not have to jump through hoops—copying multiple API keys, navigating complex configurations, or entering credit card information—just to start testing the tool. Open-source tools (OSS) often have the edge here, but many proprietary tools offer free trial periods to ease the initial setup.
Personalized Onboarding: A well-designed onboarding process should automatically configure the right settings, alerts, and dashboards.
For instance, Datadog excels in this area by asking the right questions during setup to ensure users can quickly get to the meat of the product
Scalability and Performance
Scalability and performance are particularly critical for large enterprises running thousands or even millions of applications. Many tools perform well with sample applications but falter under the strain of high cardinality and high-dimensional metrics. Your monitoring tool should:
Use significantly fewer resources than the applications it monitors.
Provide timely signals to detect downtimes, even under high load.
Last9 is a monitoring tool which specifically leans into the high cardinality niche.
When evaluating tools, stress-test them with your business use cases, as most tools do not publicly share performance metrics. I found one blog post comparing a few logging tools
Extensibility
While seamless setup and installation are important, companies have diverse needs and use cases. A monitoring tool should:
Integrate easily with other systems, such as incident management tools (Pager duty, Slack)
Be customizable, allowing users to create custom dashboards and configurations programmatically.
Grafana Labs is a standout example here. Its visualization tools work with almost any data source, and users can create custom plugins for your own data sources.
Pricing
Pricing is one of the toughest aspects to get right. Here are some tips:
Align Pricing with Customer Value: Collecting more metrics does not necessarily mean more value. Charging based on the volume of metrics or logs can be straightforward but often problematic for customers.
Avoid Per-Node/Pod Pricing: In a cloud-native world, charging by the number of nodes or pods can create conflicting incentives.
Keep Pricing Simple and Predictable: Even if it means leaving some money on the table, simplicity and predictability can win over more customers.
UI/UX
Most monitoring tools offer similar functionalities, so the user experience can be a key differentiator(every tool claims to be ‘single pane of glass’). Tools that enable users to create easily interpretable and usable dashboards quickly can make a significant impact. Grafana has set the gold standard for visualization, allowing users to visualize and correlate different signals in a single view. Check out some of their sample dashboards here
Additional Considerations
There are a few other attributes that are important but not a deal breaker in my opinion
Vendor Neutrality: While many products claim to be neutral, switching tools can incur unforeseen costs. Starting with a vendor-neutral open-source standard like OpenTelemetry is advisable.
ROI: Measuring metrics like MTTR (Mean Time to Repair) and MTTM (Mean Time to Mitigate) can be challenging. The true value of monitoring extends beyond ROI, improving developer productivity, user experience, and ultimately, customer satisfaction.
In my upcoming posts, I will review individual monitoring tools to share what I find what works (and doesn’t). Would love to hear your thoughts and experiences!