Routing Will Never Be A Solved Problem
The Important
There will never be perfect information
There will always be an area of futzing to find optimization points across multiple competing objectives
We are just beginning to dive into some areas of networking, so there is much discovery yet to happen
For all these reasons, routing will never be a solved problem, and that is ok. We move forward with the good, accepting there may never be a perfect, addressing new problems as new uses of networking emerge, new types of topologies emerge, and the use of networks stretches what we have already discovered about networking.
[Nov 11, 2022: See the later written “Mitigation vs Remediation” for an explanation of what is meant by a “solved problem”]
Voltaire image source: Wikipedia
Physics Prevents Some Problems From Ever Being Solved
A takeaway in the article “Network Science: Reducing Entropy, for a Given Capability, With a Desired Outcome, Through Space” was “Focusing on an (unattainable) perfect, delays the realization of a good”. This has not happened, to date, thankfully. The IETF/Internet ethos of rough consensus and running code has kept the ball moving forward for decades now. However, it could happen if the industry ever became too obsessed with perfection.
For starters, the industry has to acknowledge that some problems are not, as far as science currently knows, solvable.
The universe is full of space, crossing this space always involves latency, the universe has a speed limit, and therefore no router can have the same knowledge as all other routers, at the same time. Best case, nothing changes for a while and routers do have similar knowledge, but when an event occurs, some routers will know before others. Or as stated in the work in progress article on Information Axioms: No information function is omniscient or omnipotent - which should be a humbling reminder to all of us next time we become too full of ourselves. You don’t know everything, and neither does anything else.
This is not a solvable problem, and as a result, there will always be packet loss, micro loops, and/or higher than usual delays. We currently cannot overcome what we believe to be the laws of physics. Not even quantum networking can currently overcome the universe’s speed limit, because time delaying corrections have to be made, today.
A related issue is all routers cannot agree at what time an event occurs. Event information could conceivably be timestamped based on a highly precise clock that all routers reference/are sync’d too. Currently, this is not done with respect to control plane exchanges, and there does not appear to be much energy behind adding this complexity. Perhaps as AI/ML/self-driving networks evolve, a compelling reason for doing this will emerge.
Even if this is done, the same laws of physics prevents all routers from seeing event information in the same sequence. Routers distributed in space will experience different latencies from an event source, and so one router may see one sequence of events and another router may see another sequence of events. That’s not even considering the added complexity of how flap dampening leads to further timer divergence across a network. What are the implications of this for today’s control planes, let alone tomorrows AI/ML capabilities?
You can’t solve the laws of physics, as far as we know.
Some problems are in the realm of customer preferences
Does a network operator want to overbuild or run a network hot?
Does a network operator want to deliver good enough quality at one price point or premium quality at another price point?
Network operators have varying views on capacity (storage, compute, network), capabilities (policy, traffic engineering, QoS,…), and quality (customer experience at a given price point). IP Routing, and its many approaches, actually deal with this quite well. However, the bottom line is that there will always be levers and knobs for different network operators to optimize / manage their networks according to their value chain aligned view of networking. Not everyone wants a datagram service and not everyone needs ultra low latency. There will also be some level of added complexity emanating from a) resources not being infinite and b) varying operational optimizations and desired business outcomes based on tradeoffs between operations, the network, and services capabilities, capacity, and quality.
There is not a one size fits all reality, as there is often not in markets.
Solutions to some problems have not been discovered yet
The Internet Ethos started as a network that automates itself. For example, the best route to a prefix and/or the topology of the network are discovered by routers. So here is a network that was able to achieve global scale, supporting billions of devices, through self-automation of its core function - determining the best way to get from A to B.
However, we are currently in an era, somewhat amplified by the movement to segment routing, where it is assumed, that the network will be automated by a controller. Maybe the controller leverages the network control plane, but still, it is doing the automation. There are numerous arguments for and against this approach.
Example for: only a controller can plan disjoint paths, for the same service, from multiple provider edge devices - this may actually be the strongest argument for controllers (notwithstanding that creative engineers may be able to find a way for edge devices to cooperate on achieving the same).
Example against: the same information relativity problems that challenge routers, challenge controllers, and controllers become a centralized failure point, that has not existed before, so we, as an industry, don’t really know the consequences of basing all networks on this approach (notwithstanding that controllers supporting IP/MPLS are already deployed).
Just as industry sentiment has shifted to controller-based paradigms, approaches like RIFT come along with a different vision: automation in networks in not like automating a server, automation needs not to be at the unit level, but at the network level, and the network has to be self-automating (autonomic).
So there is this whole new design space in which to discover approaches, optimization points, customer preferences, unintended consequences, etc.
Conclusion
We could dumb down networks to having only two routing options, a primary and a backup, and when the primary fails, the backup automagically cuts in, with very little lost time. In fact, ring topologies are somewhat like this anyway, and it is questionable how much they should be complicated beyond what is minimally needed. OTOH, if we want rich connectivity, with many redundancy options, that is survivable across multiple failures, then perhaps “dumbing” down the network is not the right way to go, and we should accept some of the limitations of routed networks.
Due to the laws of physics, some problems are unsolvable.
Some problems need to be responded to with varying optimization points so network operators can pursue different value propositions / business models.
Some problems the industry is in the early stages of learning about.
If we sat down and said we are not doing anything until we can implement a perfect network with perfect information, we would never get anything done. An unattainable perfect would prevent the realization of so much good. In the colloquial, the perfect is the enemy of the good. What we have in the Internet is not just good, it is great, warts and all. We live lives that none of us could have forecast just a couple of short decades ago.