Segment Routing and the Death of QoS

Well bandwidth reservation anyway

Oct 21, 2020

The Important

While packet networks, underlay and overlay, are moving away from fine-grained, hop-by-hop QoS / bandwidth reservations, the Telecom standards bodies and ecosystem are doubling down on new complexity to achieve better / more deterministic performance, intended to support Ultra Reliable and Low Latency Communications (URLLC).
The packet trend is the inevitable result of decades of experience, with services that have been available up until now.
Someone’s got it wrong and/or these different approaches will exist in different parts of the network.
Using the three olive martini network design / architecture model, it is clear that one of the options is to add more capacity. Adding capacity simplifies many problems.

At the dawn of the century there was a networking protocol/approach called ATM. It was on its way out, though that was not obvious to everyone, so it was still hanging in. In addition to asserted jitter benefits from equal sized packets, one of its claims to fame was the ability to make bandwidth reservations, from end to end, for all clients of the network. There was also an IP-layer architecture called IntServ, that used RSVP, which also asserted the ability to do bandwidth reservations. You don’t read much about ATM and IntServ today, except when reading history books.

The label switching protocol that succeeded ATM, was of course IP/MPLS which combined an IP control plane with a label switched forwarding plane, mapping IP prefixes, directly to those labels. IP/MPLS has two different label mapping protocols, LDP and RSVP-TE. RSVP-TE supports end-to-end bandwidth reservations, LDP does not. IP/MPLS has been a phenomenal success and both LDP and RSVP-TE are deployed.

Segment Routing for MPLS (SR MPLS) is the next generation of label switching, which is gaining pretty good momentum, if not mind share. Segment Routing for IPv6 (SRv6) is part of this narrative as well. Segment Routing does not, today, have the ability to do bandwidth reservations. If you want to know what a network can do, look at the information in the forwarding and control planes. The Segment Routing forwarding plane does not carry state for each pair of clients on the network, so it can’t apply bandwidth reservations at that level of granularity. Maybe at some coarse level of segments bandwidth reservations could be done, for different virtual topologies, for example. This is true whether using segment routing’s distributed control plane by itself, or using a centralized controller.

A centralized controller does get more information than distributed routers, for example telemetry information, but it cannot solve the problem of information that does not exist. A centralized controller maybe able to optimize what paths traffic takes in a network, but it can’t implement capabilities that it does not have the necessary information for. The good thing about Segment Routing is the control plane does not know about every end to end path or every flow in the network. The bad thing about Segment Routing is the control plane does not know about every end to end path or every flow in the network.

Unless of course, such knowledge and such granularity of QoS is just not needed or worth the tradeoffs.

Before the era of IP VPNs, the two most prominent technologies for Enterprise network services were ATM and Frame Relay. Technically, both Frame Relay and ATM Enterprise services could have used bandwidth reservations, especially when PNNI was being used. However, in the end, it was easier for operators to over subscribe the network by a heuristic, for example 4-10 times, that would result in customers getting reasonably good service, most of the time. IP/MPL RSVP-TE could have been widely deployed with bandwidth reservations as well, and other advanced features, but it was mostly used to create tunnels that steered traffic along engineered paths, and with the right capacity dimensioning, most users of the network receive reasonably good service, most of the time.

In the three olive martini model I use for network design / architecture, there are options/tradeoff relationships between service quality, network capabilities and network capacity.

Figure 1. Options/Tradeoffs between operations, network, and services.

The more capacity an operator has, the less traffic management capabilities needed, and the better the overall service quality. What is the better choice, more capacity or sophisticated QoS mechanisms at every hop. The challenge with QoS is that it is easy to understand how QoS should be implemented at one router, but when there is a complex topology, complex service/application mix, and complex arrangements between network operator and customer, understanding and managing the alignment of end to end QoS settings at each router, becomes, well, complex.

Maybe with artificial intelligence/machine learning this fine-grained complexity *might* be possible one day, but for now, the easier path is to deploy more capacity, some of which is needed anyway to switch traffic when there is a failure, and when there is a failure, priority-based queuing (DiffServ), even without bandwidth reservation, can be used to protect the most important traffic.

Underlay networks are headed away from bandwidth reservation. Overlay networks, like SD-WAN, have also moved away from bandwidth reservations. Arguably bandwidth reservations are meaningless in the SD-WAN paradigm anyway. Certainly without underlay network interaction, and even then, it may not be end to end in the new multi-cloud, multi-access world enterprises live in. SD-WAN is mostly about traffic steering and ultimately, application/session traffic steering.

At this time, Packet overlays and underlays are moving away from fine-grained QoS, and relying on a formula that includes: capacity, multiple alternative paths (some of them hopefully loop free), and traffic steering/engineering. 5G architectures are still pursuing Ultra Reliable Low Latency Communication (URLLC) with no doubt any number of associated complexities. 5G already includes Flex-E/G.MTN & TSN (IEEE Time Sensitive Networking - IETF is also now working on TSN/deterministic networking), as well as other complexities not necessarily directly related to QoS, including CPRI, eCPRI (uses less bandwidth), CPRI to O-RAN, and synchronization. Huawei is also promoting a “new” IP within the ITU to support use cases like URLLC. Is all this stuff going in a different direction to where IP overlay and underlay networks are going?

Despite all these many potential knobs on 5G access/aggregation networks, the Edge cloud conversation continues to be about getting closer to the customer. This narrative seems to assume that the biggest problem is not the lack of knobs on access/aggregation devices, but the speed limit of electro magnetic “waves” in a fiber or over the air. In other words, the distance between the customer and the edge cloud. Are all these knobs in complicated 5G access units going to solve the end-to-end problem if segment routing has such a coarse level of QoS / Traffic engineering? I can imagine at least two answers to that question. One, there is more excess capacity at the core than there is in access, so less fine-grained bandwidth management is needed. Secondly, the Huawei response, the world needs a new IP. [Added after original publication: third option is to change/evolve segment routing, which is a whole other discussion].

Maybe the answer will be to get the Edge cloud as close as possible to the customer, and constrain the complexity to the segment of the network between the customer and the Edge, because the traffic that cares about latency & jitter will be hitting the Edge cloud. This of course is likely not entirely true when it comes to remotely controlling robots, cars, surgery equipment, etc.

Data networks have had the ability to provide bandwidth reservations for many decades, and through multiple technology deployment and maturity cycles. Yet, data networks have not discovered a broad-based need for them. The SD-WAN trend is putting an exclamation mark on that point. The “voluntarily” adjustment of video coding, by Netflix and others, to use less Internet bandwidth when COVID-19 first hit, also touches on this issue, what functionality with respect to this issue rightly belongs at the application and TCP layers of the stack.

Despite the clear direction packet networking is going in, traffic engineering without hop-by-hop bandwidth reservations, Telecom bodies like ITU, OIF,… are continuing to dig deeper into the Quality of Service / traffic management well, and at the same time pulling IEEE and IETF in as well, with TSN, deterministic networking [Added after initial publication: and TEAs/VPN+]

How will this play out. In the short term, all the 5G complexity will be asked for, developed, and delivered. The 5G ecosystem is in transition, and arguably has not even fully embraced O-RAN yet. In the long term? Well, I go back to the three olive martini model. Adding capacity simplifies many things.

Internet Dynamics

Discussion about this post