Complex systems are slow to change. Vested interest, multi-vendor considerations, new operations training, technology maturity, etc. By 2025, most networks will not be autonomous, especially service provider underlay networks. Network automation is only now starting to emerge, and it is in its early forms. Autonomy is not a realistic expectation in the next few years. The next period of networking will be characterized by augmented routing, where the distributed IP control plane is still significant, usually primary, but centralized / regionalized systems augment the control plane with post-event optimizations, and also play a role in configuration, analytics, and service automation. Some domain specific autonomy, for example campus WiFi, may emerge and evolve faster than large-scale WANs, but overall, there is a significant journey ahead.
What is Augmented Routing?
Augmented routing is assistance provided to a distributed control plane. The distributed control plane operates as it always has, discovering and advertising topology / paths, but when there is an optimization opportunity, the augmented routing controller overrides the choices that would otherwise be made by the distributed control plane.
The chief use case today, is the creation/maintenance of disjoint paths. For example, their maybe redundant links to a customer location, but traffic to/from these links should follow a different path through the network, so that if one link fails, traffic switched to the other link has a truly redundant path through the network. In practice, this requires more information than what a traditional IP control plane has. Required is knowledge of the underlying physical infrastructure, including fiber ducts, but a) some types of network failures are more common/likely than others and b) networking solution suppliers are looking to address this gap with capabilities such as identifying shared risk ling groups (SRLG) and percolating information up from the optical layer.
Another use case is post recovery optimization. After there is a link failure, it maybe necessary to move traffic to a new path. There may be fast reroute options already installed or there may be post failure decisions made by the distributed control plane based on its best understanding of the working topology. Either way, many different sources may choose common path elements / segments to send traffic on. This may benefit from optimization after all the re-routing has occurred. Fast, local, distributed action is a good thing, providing for the fastest restoration of paths, but a controller with a global view of the network, access to global intent/policy and analytics, may be able to compute a more optimal distribution of traffic.
Whether the controller can optimize better than a distributed control plane will depend on a number of factors. Is the compute equation more processor / memory intensive than what edge routers are able to complete without impacting network operation? Does the controller have access to more information than a distributed control plane (either information about the network or policy information)? Can a controller coordinate in a better fashion than distributed routers?
The answer to these question are partly implementation and age of equipment dependent. Coordination is an (in)famous economics problem. An economy being just another type of network. Today, coordination among routers is loose at best. There is a mechanism for distributed learning, that provides a best current & common view of the network topology / network paths, but each router ultimately makes routing decisions by looking at the same information as other routers and deciding which route/path is the best. In fact, pre-MPLS (RSVP-TE), networks would fail if each router did not come to the same conclusion about what the best path was. It is the potential convergence on the same view of what is the best path that may lead to suboptimal routing from an overall customer experience perspective, and where coordination may add value. The extent to which a controller adds value will be implementation and information dependent; information being the foundation of coming to different/better conclusions.
Why Not Autonomous?
This year, tech media has been running articles on how Uber “wasted” 2.5B on developing autonomous vehicle technology, and now rumors are being printed that Uber is in talks to sell its Advanced Technology Group (ATG). Autonomous anything is not simple, and it requires significant investment over a long period of time. There has not been enough significant investment in autonomous networks to realize reliable solutions, and there certainly has not been enough touch points to decrease uncertainty / increase confidence. This vision of networking is unlikely to be realized by 2025, even if some isolated domains of networking make significant progress.
The car industry came to the realization that just driving around does not generate enough information to dramatically accelerate the learning process. Simulation, reinforcement learning, and other technique are likely required. While networking can for sure learn from that experience, there will still likely be a learning curve. There is also increasing awareness of the simulation gap in networking, for multiple use cases. Some highly skilled and operations focused groups have asserted their ability to create simulators, but in general, it feels like this is a gap in networking today.
The Internet is critical national / international infrastructure. Faults have consequences. The criticality of the infrastructure will be accompanied with some conservatism with respect to change. This is true for both automation(1) and autonomy(2). Automation is still in its early life when it comes to most networks.
Dynamic routing is a difficult problem. Very difficult. Neither event sequence nor topology information can be guaranteed to be the same at every router, or every controller. That is an unsolvable problem because the universe has a speed limit.
Network operators have spent decades honing their craft, learning how to reduce potential networking problems. What CLI commands to execute to diagnose problems. How to divide a network into areas, where supported, to reduce the span of topology advertisement. What ratio between capacity and load seems to provide a good network experience, most of the time. Much more than this. That comfort and familiarity with the way things are done today, won’t easily translate to rapid conversion to controller-based networking, and certainly not fully-autonomous networks. Some network managers are using controllers today. Some network managers are using controllers in a monitor-only mode. Some network managers are letting controllers sit in a virtual corner and are not doing much with them. Change creates uncertainty. Uncertainty requires information (why, how, and who has done it successfully). Information takes time.
Companies that promise customers grand sci-fi-like visions are likely to find customers disappointed, and their brand damaged, as customer’s come to grips with the realities of those promises. Companies that focus on specific, pragmatic, augmented routing opportunities may lose the hype battles, but on the upside, win the customer experience wars.
Conclusion
People are reluctant to change. End-to-end autonomy will take time. So, the next five years will be characterized by augmented routing. A highly capable distributed control plane, taking fast action when a failure occurs, augmented by centralized scale-out intelligence, receiving an increasing amount of data, and learning how to turn that data into value.
Notes:
(1) This article defines automation as the efficient execution of repeatable tasks. Automation is characterized by workflows.
(2) This article defines autonomy as the human-independent operation of a network, including intent-based recovery from failure. Autonomy is characterized by learning and self-directed adjustment to changing conditions.