RO2.THN - Repeated BGP Crash
Updates
The routers in Telehouse North have been restarted and traffic has been moved over to Telehouse East and West where possible. Some traffic remains in Telehouse North however the network has returned to stable operation.
A similar incident has re-occurered, this time on another router in THN. This router appears to have completely crashed following a dark fibre backhaul circuit being taken down.
The NOC are investigating this fault and an update will be posted shortly.
We’re really sorry for the disruption caused by last night’s network outage, which affected internet service for a portion of our customer base (around 2,000 connections). Service has now been restored, and we want to explain what happened and what we’re doing to prevent it from happening again.
What Happened:
One of our core routers experienced a failure in its BGP (Border Gateway Protocol) process. Specifically, the BGP daemon ran out of available threads. We suspect this was due to the router constantly recalculating whether it should advertise a default route every time a route was added or withdrawn from any internal or external peer — a process that became overwhelming. This loop maxed out a single CPU core despite there being 8 processors available, even though the hardware is high-performance, and eventually caused routing instability. All packet forwarding is done in hardware, but this relies on the CPU to handle inbound routes and push them to the FIB so when BGPd crashes, the FIB is no longer valid and needs to be rebuilt. This takes up to 3 minutes for the external peers to drop the routes due to their hold timers. It can sometimes take 10-15 minutes to fully converge the network.
What We’ve Done:
We’ve now removed the conditional default route logic and replaced it with a consistently originated default route into the network. This fix has stabilised the system and service has returned to normal.
What’s Next:
We’re already planning infrastructure upgrades over the coming months to provide more headroom, resilience, and faster failover during events like this.
If You’ve Raised a Ticket:
Our support team is working through all open tickets today and will be in touch with everyone affected. If you’re still experiencing issues, please do let us know — we’re here to help.
Thank you again for your patience, and we’re sorry for the inconvenience this caused.
1310 Team
One of the core routers in the 1310 network has had a series of BGPd failures in the past week.
It has again failed this evening, blackholing traffic which was routed through Telehouse North.
We have stopped this routers advertisements and the network is reconverging to use our redundant connections in Telehouse East and West.
If your traffic wasn’t going through THN it won’t have been affected.
We are really sorry for the repeated incidents and are working hard to implement a new London core network.
← Back