This blog gives my perspective on what different terms mean to me. Your mileage may vary.
Types of Tools
Monitoring
Monitoring is the continuous or on-demand inspection of data, by a person or machine, often with the intent of identifying anomalies. Data includes, but is not limited to, alerts, metrics, events, logs, traces, and other contextual information.
Many times monitored data is visualized, sometimes in a time series. Increasingly, data values / patterns are used to trigger operations processes such as trouble ticket creation and automation.
Observability
Classically, observability has referred to a deeper form of monitoring, for example the internal state of an object, and/or a more powerful diagnostic capability, for example the ability to ask “any question” across multiple sources of data.
A platform does not have to use AI/ML to be considered an observability solutions. Some solutions that use AI/ML are positioning themselves as observability platforms, especially those that consolidate / analyze multiple sources of data.
AIOps
Artificial Intelligence for IT Operations (AIOPs) uses artificial intelligence to automate operations activities such as incident management, mitigation/remediation, and capacity planning.
AIOPs solutions provide monitoring visualizations and observability. However, the significant new area they are pushing into is operations automation. Where applicable, AIOPs solutions use machine learning algorithms and models. AIOps solutions can also use other AI techniques, for example natural language processing (NLP).
Technology
Artificial Intelligence (AI)
Artificial intelligence is the simulation of human intelligence using techniques such as machine learning, deep learning, natural language processing, computer vision, and expert systems.
No network operations tools today implement all areas of artificial intelligence. Multiple tools are implementing machine learning and correlation. A smaller subset have implemented natural language processing.
AI Correlation
Correlation is the testing / identification of the strength and direction of relationship between variables. AI correlation is applied to features/variables of a model and other contextual information.
A frequent use of correlation in AIOps solutions is identifying what alerts have a relationship and eliminating redundancy. The root of an incident can also be determined using correlation.
AI Algorithm (often machine learning)
Algorithms detect patterns in data through “training” on applicable data for a period of time. After that time, a model is created (see ‘model’ below).
Instead of having explicitly encoded rules like a program, machine learning recognizes patterns in data, and hence “learns” from the data. Different algorithms excel at identifying specific patterns. Some determine if a numeric value is an anomaly, while others classify.
AI Model (machine learning)
A model identifies data that does or does not fit a target pattern.
An anomaly model is looking for new patterns that are anomalous to the pattern a model has been trained on, for example, for optical signal levels, latency variations, etc. A classification model, for example, is looking for data that fits a known pattern for a category of data (email spam, an application traffic class, a cat, etc.).
AI Natural Language Processing (NLP)
Natural language processing converts between human-friendly languages and computer-friendly languages, for example between English and SQL.
In network operations tools, NLP has been used to provide a human-friendly query interface, detect new / rare messages through semantic analysis, and transform machine-generated alerts into human-friendly alerts.
AI Prediction
The conclusion from AI/ML techniques that a network condition will exist at some time in the future.
This capability is nascent and often manual in today’s network operations tools. Today’s tools can detect anomalies, and even allow network operations teams to monitor increasing degradation over time. However, the ability to make predictions about the future condition of the network, at a specific future time, with high confidence requires more development by most network operations tools.
Anomaly Detection
Thresholds
Thresholds generate anomaly alerts based on a specific value.
Using thresholds is applicable when an exact value is well-known and applicable across most instances of an object type. For example, thresholds detect packet loss anomalies across many links and paths.
Machine Learning
Machine learning models generate anomaly alerts based on patterns.
Using algorithms is applicable when an exact value is not-known or differs across most instances for an object type. For example, algorithms/models can detect latency anomalies across many links with widely different distances.
Using algorithms is also applicable for detecting "gray" failures, detecting anmalies before an outage / incident.
Conclusion
Monitoring is something that can be done by a human or by a machine. Observability is something that assists incident triage. AIOps seeks to automate workflows from data collection to ticket creation & mitigation / remediation. AIOPs solutions may support monitoring, observability / incident triage and automation.