This GigaOm Research Reprint Expires: Apr 9, 2023

GigaOm Radar for Cloud Observability Solutionsv2.0

1. Summary

Enterprises of every size are moving applications and infrastructure to the cloud. Maintaining operational awareness is difficult enough within a single cloud environment; when multiple cloud vendors are involved, a robust cloud observability solution becomes essential.

Cloud observability encompasses monitoring, performance measurement, reporting, and predictive analytics. As Figure 1 shows, it not only promotes operational awareness, it also contributes to both business awareness and IT awareness. Cloud observability solutions are tuned to the volatile nature of cloud environments, able to respond to environment changes based on application needs, finances, and performance fluctuations. Without a cloud awareness platform, balancing multifaceted requirements in the cloud can be difficult.

Taking control of cloud awareness means evaluating which cloud (or clouds) are in use, what the business expectations are, and the abilities of the operations and DevOps teams.

Figure 1. Operational Awareness

Many buyers are organizations transitioning from on-site development and infrastructure to the cloud. In large organizations, private clouds often exist to satisfy security or governance requirements. Solutions must include public and private cloud observability and insights into any on-site infrastructure. The more platform-oriented solution providers are targets for these organizations.

Another type of buyer is the small-to-medium-sized business (SMB). Here all software often resides in a cloud or multicloud environment. These buyers are less interested in on-site infrastructure and applications development and more focused on cloud operations and DevOps in cloud environments.

Yet another type of buyer includes startups and organizations with 100% of their operations in the cloud. These organizations need observability tools that focus on the specific clouds in use. An observability tool directly from a single cloud provider may be optimal.

This GigaOm Radar report assesses 21 cloud observability solutions, applying evaluations of specific key criteria and evaluation metrics defined in the GigaOm report “Key Criteria for Evaluating Cloud Observability Solutions.”

How to Read this Report

This GigaOm report is one of a series of documents that helps IT organizations assess competing solutions in the context of well-defined features and criteria. For a fuller understanding, consider reviewing the following reports:
Key Criteria report: A detailed market sector analysis that assesses the impact that key product features and criteria have on top-line solution characteristics—such as scalability, performance, and TCO—that drive purchase decisions.
GigaOm Radar report: A forward-looking analysis that plots the relative value and progression of vendor solutions along multiple axes based on strategy and execution. The Radar report includes a breakdown of each vendor’s offering in the sector.
Solution Profile: An in-depth vendor analysis that builds on the framework developed in the Key Criteria and Radar reports to assess a company’s engagement within a technology sector. This analysis includes forward-looking guidance around both strategy and product.

2. Market Categories and Deployment Types

To better understand the market and vendor positioning, we assess how well solutions for cloud observability are positioned to serve specific market segments.

  • Startups: These companies often use a single cloud provider and may have limited funding for more extensive solutions. Optimally, a long-term trial or vendors with a program targeted at startups may be the best solution. If rapid expansion is possible, the ability to scale is a consideration.
  • SMB: In this category, we assess solutions on their ability to meet the needs of organizations ranging from small businesses to medium-sized companies. Also assessed are departmental use cases in large enterprises, where ease of use and deployment are more important than extensive management functionality, data mobility, and feature set.
  • Large enterprise: Here offerings are assessed on their ability to support large and business-critical projects. Optimal solutions in this category will have a strong focus on flexibility, performance, data services, and features that improve security and data protection. Scalability is another big differentiator, as is the ability to deploy the same service in different environments.

In addition, we recognize three deployment models for solutions in this report: public cloud/software as a service (SaaS), private cloud/on-premises, and private and public clouds/hybrid.

  • Public cloud (SaaS): Refers to solutions available only in the cloud. Often designed, deployed, and managed by the service provider, they are available only from that specific provider. The big advantage of this type of solution is the integration with other services offered by the cloud service provider (functions, for example) and its simplicity.
  • Private cloud (on-premises): These solutions involve on-premises installation for private cloud applications.
  • Public and private clouds (hybrid): These solutions are meant to be installed both in the cloud and on-premises, allowing them to build hybrid or multicloud storage infrastructures. The integration with the single cloud provider could be limited compared to the other option and more complex to deploy and manage. They are more flexible, with more control over the entire stack, including resource allocation and tuning. These solutions can be deployed in the form of virtual appliances, like a traditional network-attached storage (NAS) filer but in the cloud, or a software component on a Linux VM (that is, a file system).

Table 1. Vendor Positioning

Market Segment

Deployment Model

Startup SMB Enterprise Public Cloud (Saas) Private Cloud (On-Premises) Public & Private Clouds (Hybrid)
Amazon
Broadcom
Cisco
Datadog
Dynatrace
Elastic
Google
Grafana
IBM
LogicMonitor
Logz.io
Micro Focus
Microsoft
NetApp
New Relic
Oracle
Solarwinds
Splunk
Stackstate
Sumo Logic
VMware
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

3. Key Criteria Comparison

Building on the findings from the GigaOm report, “Key Criteria for Evaluating Cloud Observability Solutions,” Table 2 summarizes how each vendor included in this research performs in the areas that we consider differentiating and critical in this sector. Table 3 follows this summary with insight into each product’s evaluation metrics—the top-line characteristics that define the impact each will have on the organization. The objective is to give the reader a snapshot of the technical capabilities of available solutions, define the perimeter of the market landscape, and gauge the potential impact on the business.

Table 2. Key Criteria Comparison

Key Criteria

Reporting & Dashboards User Interaction Performance Multicloud Resource View Predictive Analysis Intelligent Data Push
Amazon 2 1 1 1 0
Broadcom 2 2 2 2 2
Cisco 3 3 3 2 3
Datadog 3 3 2 3 2
Dynatrace 3 3 3 3 3
Elastic 2 2 2 2 2
Google 2 2 1 2 1
Grafana 3 2 2 0 1
IBM 2 2 3 2 1
LogicMonitor 2 2 2 2 3
Logz.io 2 1 2 2 3
Micro Focus 3 3 2 3 3
Microsoft 2 2 0 2 2
NetApp 3 1 3 2 2
New Relic 3 3 3 2 2
Oracle 2 2 2 2 2
Solarwinds 2 2 2 2 2
Splunk 3 3 3 3 2
Stackstate 2 1 2 1 2
Sumo Logic 2 2 2 2 2
VMware 2 2 3 1 1
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

Table 3. Evaluation Metrics Comparison

Evaluation Metrics

Deployment Ease Ease of Use Microservices Detection Number of Clouds Supported Security
Amazon 2 2 2 0 2
Broadcom 1 1 2 3 3
Cisco 2 2 3 3 3
Datadog 3 2 3 3 3
Dynatrace 3 2 3 3 3
Elastic 2 2 2 3 2
Google 2 1 2 1 2
Grafana 1 1 2 3 2
IBM 2 2 3 3 2
LogicMonitor 2 2 2 3 2
Logz.io 2 2 2 2 3
Micro Focus 2 2 3 2 3
Microsoft 2 2 2 1 2
NetApp 3 2 2 3 2
New Relic 2 3 3 3 2
Oracle 2 2 2 2 2
Solarwinds 2 2 2 3 3
Splunk 2 2 3 3 3
Stackstate 2 2 3 3 2
Sumo Logic 2 2 2 3 2
VMware 2 2 3 3 2
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

By combining the information provided in the tables above, the reader can develop a clear understanding of the technical solutions available in the market.

4. GigaOm Radar

This report synthesizes the analysis of key criteria and their impact on evaluation metrics to inform the GigaOm Radar graphic in Figure 2. The resulting chart is a forward-looking perspective on all the vendors in this report, based on their products’ technical capabilities and feature sets.

The GigaOm Radar plots vendor solutions across a series of concentric rings, with those set closer to the center judged to be of higher overall value. The chart characterizes each vendor on two axes—balancing Maturity versus Innovation, and Feature Play versus Platform Play—while providing an arrow that projects each solution’s evolution over the coming 12 to 18 months.

Figure 2. GigaOm Radar for Cloud Observability

As you can see in the Radar chart in Figure 2, cloud providers and vendors providing solutions primarily based on open-source software are separated from the more platform-oriented vendors on the right.

Notice the crowding in the leadership circle. The Radar chart displays the complexity of existing cloud observability solutions. The number of vendors has increased significantly since our 2021 Radar report on Cloud Observability, with strong players remaining in their dominant positions and new vendors joining the leadership circle.

Differentiation among the platform-based leaders is often difficult to discern and depends on buying needs and existing infrastructure. Within this area, there are solutions that are SaaS-based only and those that can be installed on-site. Others are hybrid solutions with on-site components and a SaaS AI.

One vendor stands out as an innovator within a space with many mature players.

5. Vendor Insights

Amazon CloudWatch

AWS helps customers optimize application experiences through its full-stack observability solution, Amazon CloudWatch. By centralizing and correlating application performance analytics across the entire stack of applications—compute, storage, network, databases, etc.—together with end-user behavior (real or synthetic), customers can isolate issues and quickly recover, allowing them to undertake data-driven optimizations for improving application performance. In addition to being a robust monitoring solution, AWS’ full-stack observability solution captures telemetry data at the same rate as that generated by modern architectures. By storing, reliably scaling, and helping customers analyze the data, the solution helps deliver insights needed to understand the performance of applications as well as the underlying resources that support them.

Native integration with more than 70 AWS services such as Amazon EC2, Amazon DynamoDB, Amazon S3, Amazon ECS, Amazon EKS, and AWS Lambda provides wide coverage of AWS resources and applications. Detailed one-minute metrics and custom metrics are published automatically, with up to one-second granularity to allow drill-down into logs for additional context. The service can also be used across hybrid cloud and cloud environments by leveraging the CloudWatch Agent or API to monitor on-premises resources.

Automatic dashboards display up to 15 months’ worth of metrics data, which is the maximum that can be stored and retained. Reusable graphs can be created in a unified view with the ability to display metrics and log data side by side in a single dashboard. CloudWatch Logs Insights is charged by the number of queries run. Users are able to publish log-based metrics, create alarms, and correlate logs and metrics together in CloudWatch Dashboards.

Alarms can be set based on metric value thresholds or anomalous metric behavior based on ML algorithms. Monitoring can extend to the container ecosystem, including Amazon ECS, AWS Fargate, Amazon EKS, and Kubernetes.

Users can monitor end-user digital experience by collecting data from every layer of the performance stack, from front-end applications to the infrastructure. ServiceLens identifies performance bottlenecks in applications and isolates them using the correlated metrics, logs, and traces.

The Amazon CloudWatch agent is available as open source. Amazon CloudWatch also supports OpenTelemetry to enable full-stack observability within their environments. AWS offers production-ready AWS-supported distribution of the OpenTelemetry project as AWS Distro for OpenTelemetry. AWS supports full-stack observability with Amazon Managed Service for Prometheus and Amazon Managed Grafana services.

Amazon CloudWatch deploys on AWS public cloud as a SaaS solution.

Strengths: Amazon CloudWatch provides excellent coverage of the Amazon cloud space and is a top-10 contributor to OpenTelemetry. Amazon supports startups via its AWS Activate program.

Challenges: Amazon CloudWatch currently does not address the key criteria of user interaction performance and predictive analysis.

Broadcom

Broadcom Inc. is best known as a global infrastructure technology company with 50 years of experience. Its roots are in AT&T/Bell Labs, Lucent, and Hewlett-Packard/Agilent. Broadcom focuses on technologies that connect the world. In 2021, it launched the Broadcom Software Group. In 2019, it introduced Automation.ai, an AI-driven platform.

Broadcom’s observability solution builds on its AIOps solution, AIOps from Broadcom. It provides full-stack observability of the digital experience, including mobile and web applications. Automation.ai monitors cloud-native architectures, hybrid infrastructures, and network services.

Automation.ai uses machine learning, analytics, and automation to provide visibility and data-driven insights. Broadcom provides a domain-centric and domain-agnostic AIOps solution, bringing together its observability data with data acquired from third-party sources.

AIOps and observability products comprise DX Application Performance Management (APM), DX Unified Infrastructure Management, DX NetOps, AppNeta, and DX Operational Intelligence.

DX Application Performance Management seamlessly integrates with the AIOps solution to provide AI/ML capabilities. DX APM automatically discovers traces and maps application dependencies. It can detect, discover, and monitor microservices and containers.

DX APM integrates with Runscope, BlazeMeter, and Jenkins to enable collaboration.

DX Unified Infrastructure Management monitors traditional data center, public cloud, and hybrid infrastructure environments. The open architecture provides full-stack observability and zero-touch configuration and has an HTML5 operations console. It provides actionable insights for cloud environments, such as AWS and Azure, and cloud services, including Nutanix, Hadoop, Mongo, and Apache. Included are out-of-the-box dashboards that can be customized or created from scratch, with the Dashboard Designer providing a reporting capability. The bundled CA Business Intelligence (CABI) solution provides additional out-of-the-box and custom reports.

Broadcom supports SaaS-based public cloud, private and public cloud, and private cloud deployments.

Strengths: The Broadcom solution is an excellent fit for enterprises already running applications from the Broadcom suite. Broadcom supports all of the deployment scenarios.

Challenges: Broadcom can be difficult to deploy in an environment without existing Broadcom applications. Professional services would be necessary for an enterprise converting from another management platform to Broadcom.

Cisco AppDynamics

Cisco moved into the observability market with its acquisition of AppDynamics, ThousandEyes, and subsequent integration of Cisco’s Intersight cloud operations platform. The AppDynamics solution is geared toward mid-sized to large enterprises and appeals to financial, retail, and IT service customers. The company has transitioned AppDynamics from an APM solution to an observability platform by adding cloud, network, and infrastructure monitoring capabilities. The solution can visualize revenue paths and correlate customer and application experience to find and fix application issues. It can also monitor errors using its cognition engine, isolate problematic domains, and identify root causes from snapshot data by scanning all instances of collected telemetry in the dependency tree using the Automated Transaction Diagnostic feature.

The Cognition Engine comprises a collection of machine learning algorithms that analyze transaction-based performance data across application topologies to provide visibility of application performance deviations and contextual insights.

The APM capability affords visibility down to the code level and into important transactions across multicloud environments. The infrastructure monitoring capability provides users a view of connections between applications and infrastructure, whether the application is hybrid cloud, multicloud, or on-premises.

Cisco AppDynamics can ingest data from its agents, and via open standards such as Prometheus and OpenTelemetry. It also supports public clouds such as AWS, consumes up to 450 billion metrics a day, and can handle structured and unstructured data. Its systems do not use sampling.

AppDynamics Business iQ provides business performance monitoring and observability throughout the technology stack. Business Journey Mapping shows contextual insights into how performance impacts applications’ key business transactions. Errors, crashes, network requests, page load details, and other metrics are captured automatically. Users can create visualizations of key business metrics across customer interactions with User Journey Dashboards.

Cloud-native visualization supports multicloud monitoring in both hybrid and native cloud environments with end-user monitoring. Real user monitoring (RUM), IoT, browser, and synthetic monitoring are included. Contextual alerts and AI-powered root cause analysis are available. AppDynamics monitors AWS, AWS Lambda, Azure, Docker, IBM, Kubernetes, OpenShift, Pivotal Cloud Foundry, and SAP and S/4 HANA environments.

Cisco AppDynamics has recently added support for pushing security events, via Cisco Secure Application, to Splunk SIEM to enable security investigation workflows.

Cisco AppDynamics is primarily deployed as a public cloud (SaaS) solution, however, it does offer support for on-site deployment.

Strengths: Cisco has good support for reporting and dashboards where its capabilities include scheduled reports, out-of-the-box, custom reports, and custom dashboards. It includes RUM, IoT, browser, and synthetic monitoring. A multicloud resource view is available to monitor multiple clouds. There is support for OpenTelemetry, and Cisco actively contributes to the project.

Challenges: There is some weakness in the areas of predictive analysis, where strengthening the AI/ML would provide better operational awareness. The ability to push data to other data sinks, such as FinOps, could be stronger, but is possible with effort.

Datadog

Datadog was formed in 2010 to remove friction between developers and system administrators. Its growth is driven by a focus on automation and real-time observability. Launched as an infrastructure monitoring company, Datadog has expanded its portfolio via both acquisition and organic growth to offer solutions throughout the full observability space. Headquartered in New York City, it has regional headquarters in Boston, Dublin, Paris, Singapore, Sydney, Tokyo, and offices across the U.S., Europe, and Asia Pacific.

Datadog’s SaaS-based observability and security solution is a single platform for metrics, traces, logs, events, and security signals from across the stack, automatically enriched with contextual metadata. It includes application performance monitoring, infrastructure monitoring, log management, digital experience monitoring, network monitoring, and security. These products are tightly integrated and serve several cross-platform features such as dashboards, alerts, SLOs, incident management, notebooks, and proactive and contextual machine learning capabilities.

Datadog provides reporting and dashboards that provide real-time visualizations across sources. Customizations can be facilitated interactively or by coding, and a library of visualization tools and drag-and-drop widgets are available. Support includes rates, ratios, averages, and integrals. Dashboards can be auto-generated or created from templates. Geomap graphs, heatmaps, stacked graphs, and top-lists are included. Built-in collaboration allows dashboards to be shared.

More than 500 vendor-backed APIs integrate cloud providers, including AWS, Azure, GCP, Alibaba, Oracle, and technologies such as Kubernetes and serverless platforms. Included are container auto-discovery and a single view of infrastructure components and performance. Application performance monitoring includes distributed tracing, browser RUM with session replay, mobile RUM, front-end and back-end error tracking, synthetic monitoring and testing, continuous profiler, database monitoring, and serverless monitoring.

The portfolio includes Datadog Cloud SIEM (part of the Datadog Cloud Security Platform) and provides threat detection capabilities. Threshold and anomaly detection rules are provided out-of-the-box, and custom rules can be created.

Datadog is a SaaS-based cloud solution with public, private, and hybrid options.

Strengths: With its built-in security monitoring capabilities, Datadog is able to send observational data to its Cloud SIEM product. Datadog is a major contributor to OpenTelemetry.

Challenges: Datadog has a strong background supporting SMBs and is returning to large enterprises at the behest of its customers.

Dynatrace

Dynatrace has built a solid reputation as a high-quality application performance management (APM) solution. It is now building on that reputation with its full observability platform based on the Davis AI, the company’s proprietary AI engine.

The Dynatrace platform includes APM, AIOps, infrastructure monitoring, digital business analytics, digital experience management (DEM), application security, and cloud automation for enterprise IT departments and digital businesses. Using automation in concert with the Davis AI engine, the Dynatrace platform provides root-cause details of application performance, generates insights into the underlying infrastructure, and presents an overview of the user experience. The system is designed to scale and operate in hybrid clouds, public clouds, or edge environments, as well as on-premises.

The platform deploys a single agent, OneAgent, that drops a single binary onto a host to automatically instrument not only containers running within the environment, but also processes and code running within the container, all without requiring any manual instrumentation or image modification.

The entire application topology is visualized through an interactive map called Smartscape, which also collects context metadata and captures the relationships and dependencies for all system components down to containers, infrastructure, and cloud, to build visualizations automatically. Process-to-process dependencies are visualized by capturing network communication data. The self-learning capabilities automatically identify performance anomalies and the AI engine, Davis, automatically performs root-cause analysis to determine the reasons for performance issues.

Synthetic monitoring also is supported with single-URL browser monitors, browser click paths, and HTTP monitors. Licensing is based on the consumption of synthetic actions and requests.

Dynatrace natively supports OpenTelemetry, and actively contributes to the open-source project. OpenTelemetry data is integrated into the common data model, enabling automation, full-stack topology mapping, and causation-based AI analysis through Davis. OpenTelemetry data can be ingested via API or automatically captured via Dynatrace OneAgent code modules

Dynatrace supports a SaaS-based cloud solution for public, private, and hybrid options.

Strengths: Dynatrace has outstanding capabilities in the key criteria of reporting and dashboard capabilities, user interaction performance, predictive analysis, and intelligent data push. Dynatrace is a major contributor to OpenTelemetry. Its roadmap for OpenTelemetry also puts it ahead of many of its competitors.

Challenges: Dynatrace has no capabilities in the area of federated, hierarchical, or edge AI/ML. Network monitoring has improved; however, it is still based on application data, but there is SNMP support for on-site network devices.

Elastic

Elastic has a solid observability platform using the free and open ELK stack (Elasticsearch, Logstash, Kibana). The company has successfully layered usability and visibility on top of the stack, and its technology is used widely across enterprises as diverse as eBay, Wikipedia, Uber, and Netflix.

Elastic offers both on-site and cloud (AWS, Azure, and GCP) versions. This helps users create independent, hybrid-cloud, or multicloud variations of the solution as needed. This flexibility is particularly useful when an enterprise needs to start at one location (either on-premises or in the cloud) and quickly expand to other locations without creating siloed implementations or fragmenting the toolset.

Elastic provides visibility across cloud-native infrastructure and applications, including services hosts, containers, Kubernetes pods, and serverless tiers. More than 200 out-of-the-box integrations are provided for common services and platforms, and an intuitive UI provides visibility into all infrastructure and applications.

Kibana provides ad hoc visualizations and analysis with support for dimensions, tags, cardinality, and fields. Attributes, hostname, IP address, and tags can all be used as search criteria.

Built-in machine learning helps find root causes by automatically correlating anomalies to downstream data and dependencies. Both supervised and unsupervised machine learning are supported and classification and regressions are possible with supervised machine learning, while anomaly and outlier detection are possible with unsupervised machine learning.

Elastic uses agentless data ingestion leveraging native integrations within the cloud console, allowing users to import a variety of data such as logs, metrics, traces, and content from their ecosystem, including applications, endpoints, infrastructure, cloud, network, and workplace tools.

Elastic supports SaaS-based public cloud, private and public cloud, and private cloud deployments.

Elastic offers a strong OpenTelemetry-based full observability that helps enterprises observe the full stack including APM, infrastructure, services, and the network of both enterprise and cloud-based solutions. It even provides visibility into combined multicloud and hybrid cloud options within a single stack.

Strengths: Elastic has good capabilities across the key criteria of reporting and dashboards, user interaction performance, multicloud resource view, predictive analysis, and intelligent data push.

Challenges: Elastic is based on open-source code and Elastic Cloud can be deployed as a solution. Shops with strong in-house technical expertise will fare better than those without.

Google

Google provides a complete set of tools for observability, including cloud logging, cloud monitoring, and support for microservices using Google Kubernetes Environments (GKE).

Cloud Logging is a fully managed service that performs at scale and can ingest application and platform log data, as well as custom log data from GKE, VMs, and other services inside and outside of Google Cloud. Cloud Monitoring provides visibility into the performance, uptime, and overall health of cloud applications. It collects metrics, events, and metadata from Google Cloud services, hosted uptime probes, application instrumentation, and a variety of common application components. Managed Service for Prometheus is a fully managed Prometheus-compatible monitoring solution built on top of Cloud Monitoring.

Google Cloud’s operations management suite provides the same core platform that powers all internal and Google Cloud observability. With the addition of BindPlane from Blue Medora, metrics and logs from other clouds can be pushed into the Google Cloud open APIs. BindPlane comes at no additional cost for Google Cloud users.

Reporting and dashboards provide a range of metrics, with Cloud Monitoring populating dashboards based on the services and resources used. Custom dashboards can be created to chart data, display indicators, or display text including system and application metrics gathered by the Cloud Monitoring agent. Additional information is provided about system resources and applications running on Compute Engine instances and on Amazon Elastic Compute Cloud (Amazon EC2) instances, and the agent can be configured to collect information from third-party plug-ins such as Apache or Nginx web servers, and MongoDB or PostgreSQL databases.

Strengths: Google has good capabilities across the key criteria of reporting and dashboards, with extensive reports, user interaction performance, and predictive analysis, with VertexAI providing the ability to perform predictive analytics using AI/ML.

Google provides excellent coverage of the Google cloud space and is a top-10 contributor to OpenTelemetry. Google does not provide a startup program; however, its pricing and deployment model allows startups to adopt its solution with ease.

Challenges: Google has poor capabilities across the key criteria of ease-of-use, and multicloud resource view. There is no support for on-site private clouds or infrastructure; however, on-site support is available from Google partner ObserveIQ.

Grafana Labs

Grafana Labs provides a unique approach to help customers on their observability journey by offering plugins to all major observability solutions (DataDog, New Relic, Dynatrace, etc.) where customers can view their observability data together in a single Grafana dashboard. Grafana Labs offers both self-hosted and cloud-based options across its observability solutions.

Grafana Cloud is a composable observability platform, integrating metrics, traces, and logs with Grafana visualization. It takes advantage of open-source observability software, including Prometheus, Loki, and Tempo, with no requirement to install, maintain, and scale the observability stack. Getting up and running, according to Grafana, is extremely quick, requiring selection of the services to be monitored and installing the Prometheus-inspired agent in order to receive preconfigured alerts and dashboards. Grafana Cloud provides a fully managed service, and it includes a scalable, managed back end for metrics, logs, and traces.

Full-stack monitoring is supported with out-of-the-box dashboards and alerts available. Enterprises that are already running Prometheus, Loki, or Graphite can achieve a single view across several instances. Grafana retains 13 months’ worth of metrics for trend analysis and capacity planning and 30 days of log and trace data.

Dashboards support queries and alerts, and a wide range of metrics can be visualized. Once created, dashboards can be shared with other team members. Ad hoc queries, dynamic drill-down, and split view dashboards are available, along with the ability to compare different time ranges, queries, and data sources side by side. Out-of-the-box dashboards and alerts are included for infrastructure components including MySQL, Postgres, Redis, and Memcache.

Grafana ships with built-in support for Grafana Tempo, an open-source tracing solution, while Loki powers Grafana Logs. Using Promtail, Grafana’s preferred agent, logs can be pulled in from a wide range of sources, including local log files, the system journal, GCP, AWS CloudWatch, AWS EC2 and EKS, Windows events logs, the Docker logging driver, Kubernetes, and Kafka.

Grafana Cloud Metrics provides a view of all metrics by running queries using data from multiple applications, data centers, or regions. Query results from metric clusters running in different data centers or geographies can be merged to provide a single view of data.

Strengths: Grafana has outstanding capabilities in the key criterion of reporting and dashboards. It has good capabilities in user interaction performance as it provides synthetic monitoring, error, and latency alerting, and user load times, which are all part of Grafana’s out-of-the-box dashboards, as well as multicloud resource view.

Challenges: Grafana provides no, or limited, predictive analysis. Ease of deployment is a feature, but deployment in complex environments is difficult. Due to the number of open-source components, some degree of technical expertise is required to implement the different Grafana solutions.

IBM Instana

IBM has adopted a cloud strategy that it believes will make it a leader in the hybrid cloud space. By acquiring RedHat, IBM brought OpenShift into its stable and in turn, made its multicloud management platform competitive. The observability platform, combined with Watson for AIOps, provided a solid first step. IBM acquired Instana at the end of 2020, which has provided the vendor with an enterprise observability and application performance monitoring platform. The addition of Instana enhances IBM’s Watson AIOps offering by providing a continuous stream of information, which improves the quality of the recommendations from its AI models.

Instana’s full-stack APM discovery and monitoring include automatic discovery, monitoring, root cause analysis, and feedback. In-depth root cause analysis of every incident is provided with all events correlated using the Dynamic Graph. This results in the generation of a single alert, which contains a cause and effect report that includes hyperlinks to the details. Stream processing is used to collect Instana’s Dynamic Graph records, and relationships among all entities are modeled in real-time, providing insights into inter-dependencies and the ability to identify what is not running at any time.

Data analytics, powered by Unbounded Analytics, allows users to analyze trace and profiler data to identify potential bottlenecks and resource problems without the need for knowledge of a specific query language.

Instana Dependency Map shows all service and application dependencies, providing an understanding of the relationships among application components and allowing visibility of the impacts from various issues.

The product provides observability capabilities across any infrastructure environment whether it be hybrid cloud, virtual hosts, PaaS, IaaS, or serverless. More than 300 monitoring sensors are available that work in bare metal, virtual, private cloud, public PaaS, IaaS, and serverless infrastructure components. Each sensor has its own health rules and alerts built in, and does not require any setup or configuration.

Instana supports public cloud (SaaS), private and public cloud, and private cloud deployment models.

Strengths: Instana has outstanding capabilities in multicloud resource view and good capabilities in reporting and dashboards, user interaction performance, and predictive analysis. Integration with the IBM platform of tools continues to improve.

Challenges: Instana does not push information to other data sinks easily, and the lack of integration with other IBM tools makes it difficult to leverage the extensive library of solutions IBM can offer. IBM is a supporter of OpenTelemetry but is not a major contributor.

LogicMonitor Cloud (LM Cloud)

LogicMonitor provides an automated, cloud-based infrastructure monitoring and observability platform for hybrid/multicloud infrastructure, logs, and applications targeted at enterprise IT and managed service providers (MSPs). Its single unified platform provides IT insights, seamless data collaboration at scale, and visibility into networks, cloud, containers, applications, servers, and log data. It includes AIOps for metrics, logs, and applications with features including root cause analysis, anomaly detection, and forecasting, as well as alerting and dynamic topology mapping, which allows dependencies and relationships to be visualized.

Agentless collectors automatically discover, map, and set baselines for complex and distributed infrastructure, with website monitoring and synthetics enabled and multicloud environments supported with coverage for AWS, GCP, and Azure, as well as Kubernetes deployments, SaaS applications, and traditional environments. Advanced forecasting allows future trends to be predicted.

The platform, which includes end-to-end tracing with code-level visibility across the entire stack, APM, metrics, and logs, is built on OpenTelemetry and OpenMetrics. Predefined, customizable dashboards and reports are available, allowing users to automatically map and visualize the relationships between microservices and application components to aid troubleshooting. Dashboards are available for all aspects of the system, from high-level KPIs to granular technical metrics. AI/MLis used in the auto-detection of anomalies and to adjust thresholds dynamically with continuous unsupervised learning, and automated root-cause analysis and alert suppression to avoid alert storms.

Included with LM Cloud are rapid API-based monitoring of cloud platforms and SaaS applications; intuitive, guided setup and configuration; troubleshooting with instant visibility into cloud resources, logs, and applications; enhanced visibility for on-premises, cloud, and microservice topology; and automated real-time log analysis.

APM capabilities include auto-instrumentation client libraries for Java, Node.js, .NET, Go, and Python, and the ability to push or pull data from any source.

Three retention options are available for log data: unlimited, one year, and 30 days. In addition, there are more than 2,000 pre-built templates and modules in the unified collector for log and metric ingestion.

Strengths: LogicMonitor has outstanding capabilities in intelligent data push and good capabilities in reporting and dashboards, user interaction performance, multicloud resource view, and predictive analysis.

Challenges: LogicMontor does not have a program for startups. LM Cloud does not currently have any capabilities in the emerging technology area of federated, hierarchical, or edge AI/ML.

Logz.io

Logz.io is an Israeli-based company with a large presence in the U.S. It primarily uses open-source technologies and open standards (such as OpenTelemetry) to monitor, log, collect, search, and analyze observability data. A vast majority of its revenue comes from its observability platform. Logz.io works well with agile, cloud-native customers, most of which are running Kubernetes in production. The company has more than 1,200 customers, including Siemens, Unity, and ZipRecruiter.

The scalable SaaS-based Logz.io platform has four elements: ELK-based (Elasticsearch, LogStash, Kibana) log management, infrastructure monitoring based on Prometheus Grafana, Jaeger-based distributed tracing, and an ELK-based cloud SIEM. These are fully managed, integrated cloud services for effectively monitoring, troubleshooting, and securing distributed cloud workloads. While the logging solution has been around since 2014, the tracing and infrastructure components were added in 2020. The vendor also released a synthetic monitoring system using FaaS (function as a service).

Logz.io provides solutions for a number of use cases, including AWS and Azure observability and container monitoring with the ability to monitor Docker and Kubernetes using a unified machine data analytics platform built on top of the ELK Stack and Prometheus.

Cognitive insights are provided using human-coached machine learning and crowdsourcing to automatically locate critical issues in log data with actionable information available to help troubleshoot them.

Prebuilt and customizable monitoring dashboards are provided for full visibility of the environment using PromQL. Filters and a drag and drop interface are available, and existing Grafana dashboards can be migrated into Logz.io.

Logz.io includes Cloud SIEM, which is based on the ELK stack and can be easily turned on. It allows incoming logs to be cross-referenced with hundreds of out-of-the-box rules and a variety of threat intelligence feeds. Integration and interoperability with other tools are provided. Security events can be classified, prioritized, and grouped to enable investigation and response workflows. Cloud SIEM includes built-in integrations with any data source, including AWS, Azure, and popular security tools like HashiCorp Vault and Okta.

Logz.io supports a SaaS-based public cloud model.

Strengths: Logz.io has outstanding capabilities in the key criterion of intelligent data push, as its Cloud SIEM is a key differentiator. It also has good capabilities across the criteria of reporting and dashboards, multicloud resource view, and predictive analysis. Security is a strong point for Logz.io

Challenges: There is little support for user interaction performance. Logz.io does not currently have any capabilities in the emerging technology area of federated, hierarchical, or edge AI/ML. There is little support for OpenTelemetry.

Micro Focus Operations Bridge – SaaS

Micro Focus is one of the longest-running players in the ITOM monitoring space. Founded in 1976, the company has a long history of building out its technology stack and providing solutions on the DevOps, hybrid IT, security and risk management, and predictive analytics markets. Through a number of acquisitions (HP Software and Vertica among them), Micro Focus is trying to expand quickly into the IT observability space.

The Operations Bridge product automatically monitors and analyzes the health and performance of multicloud resources across devices, operating systems, databases, applications, and services on all data types. The platform offers an event consolidation and correlation engine, and big data analytics-based noise reduction. It integrates end-to-end service awareness with rule and machine learning-based event correlation capabilities delivered on top of an OPTIC data lake.

A SaaS-based AIOps platform consolidates data across toolsets to pinpoint service slowdowns and solutions, providing automated event and metric analyses. Integrated machine learning on events and data automatically provides problem identification with real-time automated event correlation and dynamic thresholds, along with interactive visual analytics.

Predictive analytics using machine learning creates dynamic baselines automatically incorporating prior history and seasonality. Created events can alert operators to help identify issues when thresholds are broken before overall systems are impacted.

IaaS, PaaS, and traditional IT environments can be monitored with out-of-the-box discovery for more than 120,000 combinations of applications, operating systems, and virtualization platforms including AWS, Azure, Google Cloud, private clouds, and containers such as Kubernetes, Docker, and OpenShift. Capabilities include agentless, agent-based, or a combination to facilitate the discovery of the whole IT ecosystem.

Dashboard and reporting capabilities are provided using a single data store with real-time business value dashboards, or alternatively, companies can use their own BI tool of choice. Business Value Dashboards (BVD) display traditional status and KPI data from Operations Bridge and other IT sources. BVD also shares metrics with the Collect Once Store Once (COSO) common data store.

Micro Focus supports public cloud (SaaS), private cloud (on-premises), and public and private cloud (hybrid) deployment scenarios.

Strengths: Micro Focus has improved significantly in cloud and application observability. The UI in Operations Bridge has better refinement and is much more modern. Overall, Micro Focus has enhanced its solution in almost every way. It has outstanding capabilities across the key criteria of reporting and dashboards, user interaction performance, predictive analysis, and intelligent data push.

Challenges: There is some support for OpenTelemetry, but Micro Focus is not a contributor. Support for startups is difficult for any major platform player, and Micro Focus struggles in this area.

Microsoft

Azure Monitor collects, analyzes, and acts on telemetry data from Azure and on-premises environments, including virtual machines, Azure Kubernetes Service (AKS), Azure Storage, and databases, as well as Linux and Windows virtual machines in a single map. Integration with other Microsoft applications such as PowerBI, which enhances visualization, extends the capabilities.

Application Insights provides the ability to detect and diagnose issues across applications and dependencies. VM Insights and Container Insights monitor the performance of container workloads, collecting metrics from controllers, nodes, and containers that are available in Kubernetes. Log Analytics provides troubleshooting and deep diagnostics by allowing drill down into monitoring data. Smart alerts and automated actions are both supported.

Visualizations such as charts and tables can be created to display monitoring data, and other Azure services are used for publishing it to different audiences. Azure dashboards allow different types of data to be displayed in a single pane in Azure Portal. Dashboards can be customized and user-created. Azure Monitor Metrics allows data to be collected from monitored resources.

Workbooks can be used for data analysis and the creation of reports in Azure Portal. Workbooks are provided with Insights, but they can also be user-created from pre-defined templates. PowerBI can be configured to import log data automatically from the Azure monitor for enhanced visualizations.

Azure Monitor collects data from a variety of sources, including applications, operating systems, services, and platforms from a number of tiers, which are: application monitoring data; guest OS monitoring data; Azure resource monitoring data; Azure subscription monitoring data; and Azure tenant monitoring data. Log data can be collected from any REST client using the Data Collector API, providing the ability to create custom monitoring scenarios.

Azure Monitor does not provide multicloud monitoring as it only monitors Azure and on-premises environments, which is a limitation in a multicloud world, but it integrates with other monitoring products such as Datadog, Grafana, and Splunk, which do monitor multicloud environments.

Strengths: Microsoft provides excellent coverage of the Azure cloud space and is a top-five contributor to OpenTelemetry. Microsoft provides a startup program.

Challenges: Microsoft has poor capabilities across the key criteria of ease-of-use, multicloud resource view, and predictive analysis. There is no multicloud support.

NetApp Cloud Insights

With three decades behind it, having been founded in 1992, NetApp is a hybrid cloud data services and data management company that has acquired a wide variety of customers, including DreamWorks Animation, AstraZeneca, and Dow Jones. It provides a number of solutions and cloud services to help companies to manage their IT infrastructures.

NetApp Cloud Insights is a SaaS monitoring tool that allows complex infrastructures to be mapped and provides real-time data visualization of the topology, availability, performance, and usage of the entire IT infrastructure, including cloud and on-premises environments. It provides an understanding of demand, latency, errors, and saturation points of all services. Automated discovery allows end-to-end service paths to be created. Root cause data is provided to identify performance level violations. The NetApp approach is to handle the hardware up to the point of the application; data can then be consumed and displayed from other sources. Predictive analytics, based on machine learning technology, provides alerts of potential issues before they escalate to become major problems.

Of note is that NetApp can protect data by auto-restricting user access when it detects a ransomware event. Access can be totally blocked or restricted to read-only. Additionally, NetApp can detect abnormal data deletion and access, and totally block or restrict the user to a read-only mode. NetApp provides tools to help investigate resources and their impact on the environment. Cloud Insights has automated response policies based on the detection of abnormal user behaviors.

NetApp Cloud Insights provides support for AWS, Azure, and GCP, as well as third-party devices from a variety of companies, including Dell, Fujitsu, IBM, and Hitachi. It also supports the navigation of Kubernetes clusters to provide end-to-end visibility of containerized applications.

The product ships with a number of dashboards for each cloud environment that is supported and includes reports on latency, SLOs, and VMs. Users can also create new dashboards using visualization types and an intuitive dashboard creation interface. Annotations allow users to add custom metadata, which is used to slice the monitoring data. The product ships with a set of default annotations, but user-created annotations are enabled as well.

NetApp supports a public cloud (SaaS) deployment model only.

Strengths: NetApp has good reporting and dashboards, predictive analysis, and multicloud resource view. Infrastructure support for on-site hardware is a major strength. Support for consuming OpenTelemetry is provided. The NetApp approach to observability allows it to move into the area of federated/hierarchical/edge AI/ML more easily than other vendors can. NetApp provides unique protection from ransomware and data-intensive intrusion detection.

Challenges: NetApp products are not focused on user interaction performance and monitoring. NetApp does not contribute to OpenTelemetry but can consume OpenTelemetry data streams.

New Relic

New Relic is a San Francisco-based company founded in 2008. Its observability platform, New Relic One, is targeted at all enterprise verticals including technology, retail, finance, healthcare, media, industrials, and public sector, focusing on forward-thinking organizations looking for innovative solutions to their problems. The company’s revenues are increasing rapidly, having reached $753 million by 2021, up from $600 million in 2020.

New Relic One is a cloud-based observability platform that provides application performance management (APM) as well as infrastructure, browser, real user, synthetics, mobile, AIOps, and native client monitoring.

New Relic provides full-stack visibility from the client side (mobile, browser) to back-end services, to databases, infrastructure, and networks, with the ability to view traces and logs in context. Automap capabilities allow dependencies to be visualized. A real-time Java profiler enables troubleshooting cluster behavior to diagnose and improve performance bottlenecks.

New Relic’s AIOps is integrated into all capabilities and is available for free for all full users. It includes anomaly detection, root cause analysis assistance, and incident mitigation.

Dashboards provide visibility of data with a library of built-in charts and templates available as well as New Relic’s programmable platform that builds custom visualizations. Data can be gathered from agents (APM, browser, mobile, infrastructure, synthetics) and third-party instrumentation such as Prometheus, DropWizard, Zipkin, OpenTelemetry, and Fluentd.

Synthetic Monitoring is available out of the box with full-stack observability. It allows user traffic to be simulated to proactively detect and resolve outages and poor performance of URLs, APIs, critical services, and end-user experience.

New Relic includes OpenTelemetry APM, which provides exporters to ingest telemetry data into a single, fully managed telemetry data platform. Users can visualize performance data in context. A single set of APIs and libraries are provided to standardize telemetry data collection, eliminating the need to create code.

New Relic holds three leadership positions in the OpenTelemety project. New Relic is a top-10 contributor to OpenTelemetry. Pixie, an open-source observability tool for Kubernetes applications, is used by New Relic to capture telemetry data without the need for manual instrumentation.

New Relic supports on-site, hybrid, edge, and multicloud environments. It supports public/private clouds with Pivotal and OpenShift for all major public cloud providers.

Strengths: New Relic has outstanding capabilities in the key criteria of reporting and dashboard capabilities, user interaction performance, and multicloud resource view. New Relic’s OpenTelemetry capabilities and contributions place it ahead of many of its competitors.

Challenges: The costs of using the New Relic One platform in a very large enterprise can add up. Edge AI/ML is not currently part of the New Relic offering; however, the Pixie roadmap includes edge computing to allow AI/ML on unsampled data.

Oracle

Oracle Corporation is a U.S. multinational technology company headquartered in Austin, Texas. In 2020, it was the second-largest software company in the world by revenue and market capitalization. The company sells hardware and software across a wide range of market sectors and has invested heavily in the cloud in recent years, providing a large number of data centers.

Oracle Cloud Observability and Management Platform is a relatively new product, released in October 2020. It provides visibility across all layers of the stack across Oracle, third-party clouds, and data center resources deploying machine learning-driven actionable insights. The platform comprises six services: logging, logging analytics, application performance monitoring (APM), operations insights, database management, and a service connector hub. It uses open standards such as OpenTracing and OpenTelemetry.

The platform provides full-stack analytics, forecasting, and visibility into microservices, Kubernetes, Java, and .NET running on AWS, Azure, and Oracle Cloud Infrastructure.

The logging component provides the centralized management of all log types including audit, infrastructure, database, and applications. They are displayed via a single view and can be searched for relevant data and event types. It is built on open standards and leverages Fluentd for log ingestion.

Logging analytics allows log data to be visualized, queried, and analyzed in real time using machine learning algorithms to find anomalies, patterns, and data relationships. More than 250 out-of-the-box parsers are included for Oracle and third-party technology stacks. Log data can be archived through user-defined rules.

APM provides end-user and server monitoring, synthetic monitoring, distributed tracing, and the capture and analysis of traces. Real and synthetic end-user performance can be measured for browser and application usage, providing support for session diagnostics.

Operations Insights provides capacity planning, analysis, and forecasting for database resource usage and the proactive identification and resolution of SQL issues such as slow performance. It combines historical data and machine learning-based forecasting algorithms.

Database management provides monitoring and management of databases across on-premises and cloud environments, including real-time SQL monitoring.

The service connector hub supports integrations as well as visibility and security, allowing a central console to be used to manage data movement between Oracle observability services and third-party tools.

Oracle offers flexible deployment models, including public cloud (SaaS), private and public clouds, and private clouds.

Strengths: Oracle has good capabilities across all the key criteria of reporting and dashboard capabilities, user interaction performance, multicloud resource view, predictive analysis, and intelligent data push. Oracle contributes to and supports OpenTelemetry.

Challenges: Oracle is a relatively new vendor in this space, and as such, its capabilities are not as advanced as those of its more established competitors.

SolarWinds

SolarWinds is a U.S. company, founded in 1999 and headquartered in Austin, Texas. It develops software to help manage networks, systems, and infrastructure. It provides solutions for observability, IT service management, application performance, and database management. The observability solution provides a full-stack solution, which monitors on-premises and multicloud environments, with native support for AWS and Azure clouds, increasing visibility, intelligence, and productivity.

SolarWinds’ observability suite is offered through one unified platform allowing businesses to optimize their application and system performance, ensure availability and reduce remediation time across on-premises and multicloud environments.

SolarWinds currently offers multiple integrated products across its product portfolio, but these will be unified into a new comprehensive observability platform with simplified licensing and pricing during 2022. The new SolarWinds Hybrid Cloud Observability solutions are designed for hybrid IT and can be deployed on-premises or self-hosted in AWS, Azure, or GCP. Additionally, 2022 will see the release of a SaaS version, which is intended to complement the SolarWinds Hybrid Cloud observability solution.

SolarWinds Network Automation Manager (NAM) delivers network performance monitoring, hardware health, packet analysis, flow monitoring, bandwidth analysis, configuration, and change management, switch port and end-user monitoring and tracking, WAN performance monitoring, and IP address management.

SolarWinds Server & Application Monitor provides visibility across applications and their supporting infrastructure. Users can identify the root cause of performance issues while monitoring everything from applications or virtual hosts down to server hardware health.

SolarWinds Virtualization Manager provides a virtual machine monitoring and management solution designed to troubleshoot and solve performance issues. Virtualization Manager supports visibility of the entire IT environment from a single interface—be it on-premises, hybrid, or in the cloud.

SolarWinds Server Configuration Monitor delivers the ability to detect and compare configuration changes to servers, databases, and applications.

SolarWinds Log Analyzer is a log management and analysis solution that includes real-time log collection and analysis. Log Analyzer provides out-of-the-box visibility into the performance and availability of IT infrastructure and applications.

SolarWinds supports flexible deployment options including public cloud (SaaS), private (self-hosted), hybrid deployment, or fully on-site.

Strengths: SolarWinds has good capabilities across all the key criteria: reporting and dashboard capabilities, user interaction performance, multicloud resource view, predictive analysis, and intelligent data push. SolarWinds provides licensing approaches that allow startups to thrive before they grow to become SMB or enterprise customers.

Challenges: SolarWinds uses OpenTelemetry, but it is not a contributor. Implementations can be tricky with complex hybrid environments, perhaps requiring professional services support.

Splunk

Splunk has been in the IT monitoring business for more than 15 years. In 2019, Splunk acquired SignalFx (founded in 2013) and Omnition (founded in 2018), which enhanced the usability of the Splunk platform and transformed it into a full observability platform.

The Splunk solution combines monitoring, troubleshooting, and incident response solutions that boost application modernization initiatives. Splunk is a full-stack multicloud integrated enterprise solution that comprises Splunk Observability Cloud and Splunk Enterprise and brings together infrastructure monitoring, application performance monitoring, digital experience monitoring, real user monitoring, synthetics, log investigation, AIOps, and incident response into a single platform for any hybrid cloud application environment.

Splunk Observability Cloud is SaaS-based and comprises IT infrastructure monitoring, application performance monitoring, real user monitoring, synthetic web, and application monitoring, log analysis, incident response, on-call management, and mobile alerting and dashboarding.

Splunk Observability provides full-stack visibility across infrastructure, applications, and business services, providing insights to ITOps, DevOps, CloudOps, SREs, application developers, and service owners. It provides application, infrastructure, and digital performance monitoring, log management, AIOps, and incident response as an integrated solution.

The product ingests, analyzes, and stores all transactions from the front end (browser or mobile app) to the back-end service without ever sampling. Full coverage of on-premises, hybrid and multicloud environments are supported.

Built-in ML-driven correlation from multiple sources is provided in real-time with dynamic thresholding, anomaly detection, and prescriptive troubleshooting, allowing users to identify the root causes of problems.

The solution is OpenTelemetry native with no proprietary agents for customers to deploy or manage, and allows cloud-native shops to build quickly with an option to convert to enterprise-scale. Splunk has contributed a SignalFx Smart Agent, application libraries, and eBPF collector to OpenTelemetry 1.0 spec for tracing, and remains committed to the project.

Splunk provides flexible deployment options supporting private cloud, public cloud, and private and public clouds (hybrid). For Observability deployments, Splunk provides host-based pricing.

Strengths: Splunk has outstanding capabilities in the key criteria of reporting and dashboard capabilities, user interaction performance, multicloud resource view, and predictive analysis. The integration of Splunk Observability Cloud and Splunk Enterprise provides end-to-end full-stack coverage across hybrid cloud environments. Splunk co-founded OpenTelemetry and is the number one contributor to the project. Splunk provides support for startups.

Challenges: For companies leveraging the Splunk platform at scale costs can add up. Due to the complexity of environments, deployment of Splunk in large organizations can be difficult, resulting in the likely use of professional services for a complete solution.

Stackstate

Founded in 2015 and headquartered in Hilversum, The Netherlands, with an office in Boston, MA, this nimble startup has built its observability data analytics solution from the ground up. The company generally provides an update every quarter. It offers some unique features that have found niche applicability with banking and finance, telecom, and MSPs, mostly in Europe.

Stackstate is a topology-powered and relationship-based observability solution. It maps business services to their applications, infrastructure dependencies, configurations, and changes. The topology relationships are generally pulled from a CMDB storage, such as BMC Remedy, ServiceNow, and other IT management tools. It collects data by integrating with other third-party monitoring tools, such as Splunk, and can be extended with the platform’s own agents. The SaaS offering ingests data directly from Kubernetes, AWS, and many other cloud integrations to offer cloud-native teams the ability to build an understanding of their cloud-native application. There is support for AWS Serverless Monitoring, where OpenTelemetry collects traces and creates the topology.

This solution monitors microservice and cloud platforms and provides discovery service and infrastructure maps in a hybrid on-premises, cloud, and container environment.

Stackstate adds additional information to the problem resolution process to help identify root causes, including relationships, configuration changes, anomalies, and the time events occurred. The time-traveling graph database enables a user to go back to any moment in time to see what the landscape looked like at that point.

Stackstate provides end-to-end insights into the entire IT landscape through its tracing capability. Tracing provides support for all languages, allows distributed traces, and integrates cloud tracing technologies such as Amazon X-Ray and Azure Monitor.

Full support is provided for cloud and container infrastructures across all environments. It maps the full infrastructure, including all cloud components and services such as low-level components, security groups, and higher-level services, such as elastic load balancers and Lambdas.

It automatically captures the Four Golden Signals (latency, calls per second, error rate, and contention) for each service or application, providing an accurate picture of the health of each service without the need to deploy a separate APM solution. Through AIOps, StackState is capable of detecting anomalies in these golden signals and raising alerts before these anomalies become customer issues.

Automatic dashboards are provided for all aspects of the system including events, metrics, traces, and topology.

Stackstate can be deployed in a private cloud, public cloud, or in private and public clouds (hybrid).

Strengths: Stackstate has good capabilities in the key criteria of reporting and dashboards, multicloud resource view, with multiple clouds visible from a single dashboard, and intelligent data push, through the ability to send observability data to third-party monitoring tools.

Challenges: Stackstate lags on the criteria of user interaction performance. Application performance management is not a strength. It does not contribute to or use OpenTelemetry but does support the use of Telemetry.io standards.

Sumo Logic

Sumo Logic is a SaaS-based, cloud-native, multi-tenant observability platform. It was built originally as a log-management, big data analytics, and SIEM solution, but now Sumo Logic has added tracing and metrics to revamp the product into a full observability platform.

Sumo Logic Continuous Intelligence Platform ingests and analyzes data from applications, infrastructure, security, and IoT sources. It then develops unified, real-time analytics. The platform employs AI/ML to create a smooth user experience when exploring logs, metrics, and traces.

Sumo Logic provides full support for AWS, Azure, and GCP environments, applications, and services, and integrates with cloud monitoring tools. Built-in pattern detection using ML, anomaly detection, outlier detection, and predictive analytics provide insights and a way to help locate root causes. There are native integrations for Kubernetes, Docker, AWS EKS, AWS Lambda, Azure AKS, and Azure functions.

Users can create customizable dashboards to gain visualizations of logs, metrics, traces, and performance data. The visualization of these elements, assisted by AI processes, allows quick and easy navigation for engineers as they diagnose the causes of errors and failures.

Log management capabilities include log dashboards and data visualizations to provide real-time log visibility, with built-in predictive analytics to identify trends and help solve issues. Sumo Logic offers multi-tenant SaaS security analytics with integrated threat intelligence. One-click integrations are provided with AWS, Azure, and GCP services to provide full-stack visibility of the cloud architectures through logging and monitoring.

Sumo Logic Cloud SIEM provides visibility across the enterprise to allow users to gain an understanding of the impact and context of an attack. Workflows are provided to prioritize security alerts. Cloud SIEM parses, maps, and creates normalized records upon ingestion from structured and unstructured data, then correlates detected threats across on-premises, cloud, multicloud, and hybrid cloud environments.

Cloud SIEM ingests and analyzes security telemetry and event logs, as well as reassembles network traffic flows into protocol-level network sessions, extracted files, and security information.

Sumo Logic can be deployed in a public cloud (SaaS), in private and public clouds, or in a private cloud.

Strengths: Sumo Logic has good capabilities across all key criteria: reporting and dashboards, user interaction performance, multicloud resource view, predictive analysis, and intelligent data push. The Cloud SIEM product is a strong addition to the observability platform. Sumo Logic is a top-30 contributor to OpenTelemetry.

Challenges: Sumo Logic does not have any capabilities currently in the area of federated, hierarchical, or edge AI/ML, an emerging technology, so adding capabilities in this area would strengthen its capabilities. The transition from log management, big data analytics, a SIEM solution, and the integration of APM and tracing will determine the long-term viability of Sumo Logic in the observability space.

VMware

The VMware Tanzu solution suite, designed to support cloud, hybrid cloud, and containerized applications, now includes its observability platform: Tanzu Observability by Wavefront. VMware is expanding its support for cloud and Kubernetes, and this platform, rebranded from the Wavefront product in March 2020, is designed to help produce, maintain, and scale cloud-native applications.

Tanzu Observability is designed specifically to help enterprises with monitoring, observability, and analytics of cloud-native applications and environments including AWS, Azure, and GCP. It uses metrics, traces, histograms, span logs, and events. These are aggregated across distributed applications, application services, container services, and public, private, and hybrid cloud infrastructures to build a real-time picture of an entire ecosystem.

Tanzu Observability delivers instant chart rendering, and real-time updating, which enables rapid iterative incident triage. Users can create and customize interactive dashboards from a simple widget-enabled tool bench, and dashboards can be self-service-enabled, scaled to thousands of users across an organization. Charts can be created from metrics, histograms, integrations, or chart types using drag and drop. Out-of-the-box dashboards are provided into which data can be ingested via more than 200 existing integrations using auto-configuration plug-ins.

Microservices-based applications can be monitored using built-in support for key health metrics, histograms, and distributed tracing and span logs for common languages and frameworks. All of these elements are unified into a single platform. Support is provided for OpenTelemetry-compliant solutions. Application maps display dynamic distributed application services in real time, with a drill-down capability allowing access to root causes. AI Genie, using machine learning-based anomaly detection and forecast prediction, provides visualization of incidents and future requirements across applications and infrastructure.

VMware Tanzu also provides observability into Kubernetes environments, auto-discovers Kubernetes workloads, and recognizes Kubernetes services. It populates out-of-the-box dashboards with metrics from all Kubernetes layers including clusters, nodes, pods, containers, and system metrics.

The solution can be deployed in a public cloud (SaaS), in private and public clouds, or in a private cloud.

Strengths: VMware has outstanding capabilities in the key criterion of multicloud resource view, and good capabilities across reporting and dashboard, user interaction performance, and intelligent data push. VMware contributes to and uses OpenTelemetry.

Challenges: VMware is weak in the area of predictive analysis. The ability to push data to other data sinks (such as SIEM or FinOps) is an area that’s ripe for improvement. Application performance management and user experience need strengthening.

6. Analyst’s Take

Digital transformation has impacted operational awareness in many ways, including the need for cloud observability solutions. From large enterprises to small startups, cloud computing is changing the way businesses look at IT. Vendors have responded in a number of ways. Some have created products specifically for cloud operations. Others have added features to application performance management tools. Hardware and virtualization vendors have expanded their feature sets to encompass cloud observability. Additionally, open-source solutions have been bundled together to provide solutions for organizations with the requisite technical prowess. Of course, all cloud providers have observability offerings of their own.

This report supports the need for multicloud observability and makes the assumption that vendors without such support (other than the cloud providers themselves) will limit the desirability of their offerings. Even smaller companies use multiple cloud providers. The need for consolidation of the operations views places multicloud observability at the forefront of cloud observability requirements.

In large enterprises, two types of solutions stand out: those based on a platform of solution offerings and single-vendor solutions. Micro Focus, Broadcom, and IBM have compelling solutions for buyers already invested in these platforms. The single-vendor solutions have become more similar to the platform offerings. Splunk now provides a number of products and services that cover more of the enterprise than ever before. Furthermore, hardware and virtualization providers NetApp and VMware have strengthened their cloud observability products.

The strongest single-vendor solutions continue to come from companies with application performance monitoring backgrounds: New Relic, Cisco, Dynatrace, and Datadog. All give buyers strong multicloud observability and provide deep insights into the customer experience. Each provides a SaaS-only offering that fits with companies moving in that direction.

SMBs should consider their future direction before choosing a cloud solution because IT and business maturity will affect good decision-making. There are no bad solutions on the platform side of the radar; however, a feature play can be a good short-term choice based on technical resources and purchasing power.

Startups often choose the observability solution of their cloud provider, which can provide adequate solutions for single-cloud applications. Those with startup programs, such as Oracle and Datadog, may determine the future direction of the company’s IT operations. In some cases, a purely open-source solution may present a safer choice due to lower near-term costs. Support from the open-source community may be sufficient initially, but the support of an enterprise solution is usually a better option in the long run.

One vendor, StackState, uses the term “business aware observability,” bringing it closer to the concept of operational awareness. It has strong root cause analysis across multiple teams delivering data from multiple silos and DevOps teams. As the product matures, more established vendors and organizations where StackState has footing should pay attention.

Current AI/ML tools are largely monolithic. The number of processors, the amount of memory, the storage, and the sheer size of AI/ML tools cannot continue to expand unchecked. Platform vendors build a single AI/ML engine and use it in multiple applications. Though this does result in lower costs in the short term, as the use of AI/ML increases, resource issues will become an issue.

The use of federated, hierarchical, or edge AI/ML is a direction that may relax the requirements for the main AI/ML tool. These distributed AI/ML technologies will allow load reduction for the main AI/ML system by placing some of the workloads in edge computing resources nearer the source of the data. Distributed AI/ML cannot handle all of the load but will become appropriate for AI models of edge applications.

Vendors with hardware or virtualization backgrounds (NetApp and VMware) are already exploring distributed logic due to the nature of their products. The advent of Internet of Things (IoT) devices pushes applications farther away from the central AI. AI/ML models of these environments nearer the source will provide better detection and prediction of anomalies while still sending appropriate data to the main AI.

Cost is not discussed overtly in this report. The variety of offerings, from free to multi-year contracts, makes direct comparisons difficult or impossible. Buyers should weigh the position of their IT organization on the journey to complete operational awareness and pick solutions that fit both their budget and maturity level.

7. About Ron Williams

Ron Williams

Ron Williams is an astute technology leader with more than 30 years’ experience providing innovative solutions for high-growth organizations. He is a highly analytical and accomplished professional who has directed the design and implementation of solutions across diverse sectors. Ron has a proven history of excellence propelling organizational success by establishing and executing strategic initiatives that optimize performance. He has demonstrated expertise in planning and implementing solutions for enterprises and business applications, developing key architectural components, performing risk analysis, and leading all phases of projects from initialization to completion. He has been recognized for promoting effective governance and positive change that improved operational efficiency, revenues, and cost savings. As an elite communicator and design architect, Ron has transformed strategic ideas into reality through close coordination with engineering teams, stakeholders, and C-level executives.

Ron has worked for the US Department of Defense (Star Wars initiative), NASA, Mary Kay Cosmetics, Texas Instruments, Sprint, TopGolf, and American Airlines, and participated in international consulting in Qatar, Brazil, and the U.K. He has led remote software and infrastructure teams in India, China, and Ghana.

Ron is a pioneer in enterprise architecture who improved response and resolution of enterprise-wide problems by deploying “smart” tools and platforms. In his current role as an analyst, Ron provides innovative technology and strategy solutions in both enterprise and SMB settings. He is currently using his expertise to analyze the IT processes of the future with particular interest in how machine learning and artificial intelligence can improve IT operations.

8. About Sue Clarke

Sue Clarke

Sue Clarke has worked as an industry analyst for almost 25 years, supplying research, analysis, and advisory services in the content management space to both organizations and vendors. She has built up a wealth of knowledge and experience having spent more than 20 years focusing on enterprise content management (ECM), in areas including document management and collaboration, records management, enterprise file sync and share, search, content analytics, case management/business process management, capture and scanning, e-discovery, web content management, digital asset management, web analytics, and customer communications management.

9. About GigaOm

GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.

GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.

GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.

10. Copyright

© Knowingly, Inc. 2022 "GigaOm Radar for Cloud Observability Solutions" is a trademark of Knowingly, Inc. For permission to reproduce this report, please contact sales@gigaom.com.