When developing modern cloud applications, ensuring your components are observable in production is a critical effort. A confusion that I have encountered working with various engineering teams are good practices around when to use metric telemetry and when to use logging telemetry. In some cases, there are questions about using specific metric events & tooling at all as the data can be placed in logs and extracted via a processing pipeline. With others, it is a bit more subtle with regards to how to use both data types to their best advantages.
Both logs and metrics data have advantages, and an application should use both to their strengths. At a high level, the following statements summarize the guidance:
Use Metrics to answer the question, "Is my service running correctly?". Use Logs to answer the question, "What went wrong with my service?"
Is my service running correctly?
This particular question is very open-ended for which the answer depends on the stakeholder of your project, you ask. The engineers may be concerned about resource consumption, crashes and error rates. At the same time, the business-oriented team members may be interested in data related to user sign up rates, session time, and resource costs.
Metric data formats are particularly suitable for recording and reporting this sort of data as it aligns with some of the strengths time-series metrics aims for:
- A numerical, point in time measure
- Potentially numerous sources of data to aggregate
- Metadata labels for filtering/drill-downs
- Low latency processing
- Store over long periods for review/reporting
- Automated alerting.
Most metric formats in use today are very minimal and pre-formatted for consumption by a particular software platform or service. This known format has the advantage of being very efficient for transmission and processing, allowing many metric solutions to have a latency between collection and query availability of only seconds.
The rapid processing of time-series metrics also plays to another strength: enabling automated alerting. Being informed in a near real-time fashion that a metric or set of metrics is no longer an acceptable value, lightens the workload load on human operators.
For business-related stakeholders, compression & aggregation features of metric solutions are a significant strength. These stakeholders tend to look at data over large time ranges, so storing data for long periods in a cost-effective manner is essential. All of the metadata related to a particular time series do not change, only the timestamp and value changes, leading to excellent storage usage. Aggregating multiple time-series together for reporting use further improves on this compression and allows for the source data to be trimmed out of storage if unneeded.
What went wrong with my service?
When something exceptional happens in a service, time-series metrics can notify & inform an operator that it is occurring and help them begin an investigation. But in most cases, time-series metrics analysis leads to a set of log data to review, which takes the investigation further. This transition from metric data to log data during an investigation work very well to the strengths of the log data type:
- Time-ordered set of events
- Detailed textual data for human consumption
- Flexible data processing options to enable indexing of data for searching/analysis.
Because log data is time ordered for a given component, during an investigation of exceptional telemetry, an operator can view all the events emitted for that component. Unlike metric data in many cases, log event does not get aggregated so are a complete record of what was occurring at a given time. This rich, potentially high detail data, based on log verbosity, can allow operators to quickly home in on potential root causes of exceptional issues in production. Furthermore, many log analysis solutions offer features that permit complex processing of log data, which can assist in highlighting issues or perform a further investigation of past data.
The power & flexibility of processing log data has some potential drawbacks:
- It tends to be computationally expensive, compared to metric processing, so ingestion latency tends to be much higher. How much higher varies considerably based on the log data & solution features.
- Due to its text-based nature, log data tends to be orders of magnitude larger than metric data when stored.
To get the most value from both metrics & logs to answer the two critical questions, here are some best practices:
- Use log data to record errors & warnings.
- Log events should be rich in actionable information.
- Have a metric counter for every operation that the service is performing and whether it was a success or failure.
- Use a structured log format, vs free form text.
- Use metrics to create automated alerts.
- Use dimension labels to decorate logs & metric data with useful metadata about the event/time-series.
- Use the same label & values for the same metadata between logs & metrics. If the logs emit a label called "hostname=foo," the metric telemetry for the same component should have this same label & value.
- Use histogram metrics to bucket measurements that occur over time (ex: HTTP request processing, dependency call)
- Have a short retention period for log data. How short depends on the specific usage needs, but this would typically be the range of a week or two.
- Be aware of what metrics should be stored long term & which should not. While the storage cost of metrics is lower than log data, storing data longer than is needed is still a wasted cost.