Colin Bannon, CTO of BT Business, explains “slow is the new down” what networks will look like for the next 20 years in his keynote
Colin Bannon, CTO of BT Business, began his keynote conversation The challenges and gains of operating an end-to-end digital infrastructure [playback here] explaining why the ‘end-to-end’ aspect is critical. He has radically different ideas than many about what resilience really means and why sticking to the letter of a service level agreement isn’t necessarily enough to avoid a brownout for a customer.
Why end-to-end matters
He was speaking at Mobile Europe’s recent Digital Telco virtual conference, and said that most large enterprises, like banks, have or are digitalising their front ends and want to extend that into their supply chain – including the services they buy from telcos – so they can match customers’ demands dynamically, while lowering opex.
From the telcos’ side, the key element to achieve this is automation, but open APIs, operators collaborating with each other and adopting standards is critical too, so that how the intent of the traffic is transparent.
Biggest pitfalls
He says the pitfalls in building and operating end-to-end digital infrastructure are many but highlighted two. Firstly, “getting everybody’s head and mindset in the culture of letting go” as telcos’ move to common platforms to achieve scale, when people have owned their own ‘stovepipe’ products for years.
He said, “It’s like you’ve owned your own house for years and you could paint it any colour but now you’ve moved into a condo and suddenly there are rules – but you get far better unit economics.”
Secondly, getting the skills mix right is tricky – protocols have largely stayed the same for 30 years but now we are adding layers of abstraction on top of each other. Bannon said, “We realised we had to double down on those skills…engineers no longer log on and make a direct change…It’s sort of a NetSec DevOps” all in one box, whereas before each of them had a separate team.
Big upsides are that telcos generally are greatly trusted by their customers, who feel that BT Business is “in the trenches” with them, helping them achieve what they need and that the company “has their back”. Bannon stated that in commercial relationships, these things are highly valued.
Most interesting lesson
“Ensuring your data model is clean,” Bannon said. “When you’re in mid-transformation, it’s very tempting to launch a product early or get a product out on time with compromises: ‘There’s just this one step, we haven’t fully automated yet, let’s put people in back and do it the old way’.” But those manual changes mess up the ‘clean room’ data environment. Hd added, “Hermetrically sealed and fully ready so that it delivers maximum value for all of your customers. I think that’s one of the most interesting learnings.”
Distributed workloads, sovereignty and AI
He went onto talk about designing products for highly distributed workloads, and thinking about sovereignty right from the start too, which in the meantime, has shot up the agenda for almost every organisation on the planet due to geopolitical shifts and uncertainties.
Something that couldn’t have been foreseen was the arrival of new kinds of AI that have had profound effects.
In response to a question from the audience, Colin gave a detailed explanation of the global fabric BT Business has built, which offers services globally to BT’s international customers. It’s a masterclass in the changing meaning of resilience – which is becoming an ever bigger and redefined topic in telecoms. Watch playback or read the lightly edited transcript below.
Running the world’s biggest cloud fabric
“If you think about the fabric concept in itself. What is it? We had 4 different cores, arguably 5 different cores internationally. We had Elan, E-Line, Internet, MPLS and also a media and broadcast core. We were using sometimes separate subsea cables; we were definitely using different stacks. We may have been using the same hardware, but they were on different racks in different parts of exchanges around the world.
Invariably, technology has an end-of-life challenge. You had different parts dealing with end of life. We saw a real need from our customers to have hybrid products. Also, on this stack, there is the concept of DDoS, and further in the roadmap is the ability to spin up virtual overlay services, whether that be SSE, SD-WAN, SASE at the head end. That’s essentially a compute platform that sits behind the PE routers of the telco.
Now, all of those were on different stacks – I alluded to some of the challenges that we had previously – the new platform uses 76% less electricity but has 10 to 100x new performance. And it makes it seamless and easy to change between your portfolio.
So now the port is sticky for the lifetime of the customer, and you can add multiple services, fully digitally, without humans getting in the way, without people unplugging things, without having to wait for somebody to get access to something in the middle of the night. And that really transforms the way we can serve our customers.
The second one is we wanted to design it from the cloud back out and so, it is the largest, most interconnected cloud fabric in the world that we’ve designed. Not only is it empirically, in other words, the most numbers of touch points that we’re building, but qualitatively the best. We are working with each of the hyperscalers, to differentiate the services that we’re doing.
In terms of closed-loop automation, where they can see trouble on their side of the network, we can switch things over. The investment that we’ve made in, I’ll give you an example. Being resilient and diverse, we have dual interconnection points in each availability zone, so if you have a failure on the hyperscaler side or at the CNF’s location, today you typically fail out of the country, out of the zone, then you have to come back in via another zone.
Slow is the new down
Now you can have immediate failover in a single zone. This is an argument that I’ve made a for long time: we had a thesis that slow is the new down. So we really wanted to make sure that it’s not just about people’s SLAs and contracts and us saying, ‘Well, the network is still up’, when actually, you’ve just added 150 milliseconds round-trip because now, your link in Japan, to say Azure 1, has gone down.
So now you have to go all the way to Singapore Azure and come back on Azure’s backbone to Japan then trombone all the way back down to Singapore, and then back up to Japan. That’s not a failure mode. Yes, technically, according to your contract, you’re up, but from an experiential point of view, that’s a brownout.
We wanted to make sure that we designed a higher quality network that is beyond the [five] nines so you have modern experience that is cloud-based from that front as well. Hopefully that gives you an example of modern commercials.
The final point I’ll say that is different is that previously, on a telco network, first of all, the core of the telco networks, if you look at most telcos around the worldt, hey’re a black box. And what people and technologies have tried to do is compensate for that with a thing called SD-WAN.
The SD-WAN has some brilliant capabilities where it can choose the path by monitoring the application’s process, but it really only has control over the first hop, then it goes into a telco’s black box core for the next 26 hops, and its spits out at the other side. Most telcos’ cores were fundamentally ignorant of that traffic, or that customer’s need or their context, or their business intent, other than perhaps a tagging on the MPLS was the closest equivalent.
In the traditional way, each node would calculate the next forward based on a restrictive set of rules. In this new Network as a Service, we’ve abstracted the control plane from the data plane so the path controllers are abstracted now. And with those central, sort of AI brain-controlled, path controllers, we’ve created a language that is exposed, that opens up and shines a light on the core.
That allows business intent to be injected into that black box so that you can do all sorts of things that we could never do as a telco before. Like taking two really important data flows, and ensure that they have disjointed paths through the mesh globally, meaning they never touch – a bit like Ghostbusters – the two lines never cross each other to give full resilience paths through there at lowest costs, lowest latency, fastest reconverge.
You can’t do with SD-WAN: what happens if I have some sovereign data in a country, and I have a sovereign user, and I have a sovereign data centre – but then transport is over the internet. The beauty of the internet, it is self-healing but the problem with it, is it was designed to survive a nuclear war but is fundamentally ignorant of geography.
We see examples on the internet every day where between a local user and a local data centre, there’s failures within the internet, like cut cables all the time. When you reconverge and it heals itself, it may be healing itself via another country because there’s a physicality of how all these meshes are connected up.
We see traffic going from Germany via Russia back to Germany, and that’s not a great look – certainly not from the regulators’ points of view in these days. Creating a new language, a new paradigm, a new control level at the underlay level is the gold standard because the overlay could not control this. You were at the mercy of your ISPs that underpinned you.
Our underlay now is software-defined for every hop, so you can inject your business intent and can do ‘no-flight’ zones. You can ensure your packets never go via that country, or you can ensure that your data in motion stays sovereign and stays inside a country.
We’ve seen regulation for blocs of countries that involve communities of interest, so you could slice the network via certain vertical industries, pharma or banking, or you could see traffic flows that need must stay within its state. Um, I even saw a city make some rulings around this as well. We’re going see this micro-segmentation,
If you think about what you had in the data centre before, Ai came and cloud came is, what was so great about data centers? Well, you had great micro-segmentation east-west. You could segment up traffic and protect it. You had pretty much infinite bandwidth. You could scale it up. You had great control because everything was controlled from a system, and you had great visibility. Normally, you could make changes quickly.
Then you get to the WAN, where it’s just pieces of string, really, originally. Now, our thesis is that the wide area network, as you start to connect up these AI nodes together – highly distributed mixes between public, private, on-prem and in the devices – we get agent-to-agent discussion or flows between public and private.
It’s very, very complex and getting more and more complex every day. Therefore, there’s a lot more east-west traffic, so our thesis is that the WAN is the new data centre LAN, and that the wide area network, the network itself is now part of the computer. That was one of the fundamental reimaginings of ‘How do you get that control in your wide area network’?
That’s where the name fabric came from originally – you had your data centre fabric and now you have your wide area network fabric. All those principles that came from the data centre have been extended out into the wide area network.
It’s an incredibly exciting time in telecommunications, and I’m very privileged to be part of the trailblazing work that BT’s doing to reimagine what networking will look like for the next 20 years.