What happened to my BlackBerry?

Sub-title: 
Computer scientist addresses RIM outage

Writer Anjum Nayyar talked with computer science professor Eyal de Lara about the technological issues surrounding the recent failure of BlackBerry mobile devices.

1. In terms of the recent BlackBerry outage, the terms are very confusing.  What is the difference between a switching or routing problem and an engineering problem?

RIM is being very cryptic. A switching problem is a hardware problem. They are running a large network with many incoming links. RIM’s data centre has connections to all mobile carriers. When people use their BlackBerry to send a message, the message is first delivered to the cell phone provider; the provider forwards it to RIM, then RIM switches it to the intended recipient.  When looking at the problem there are two issues:  One is the underlying network. Think of it as a set of pipes that connect the world.  In creating a message in the U.K. and sending it through intermediate routers or switches to the U.S., there’s a set of cables and set of switches that lets you do that. If one of those switches goes down then a path is then broken. That is a network issue.
 
At the same time, when I send a message, they actually need to do some processing. They need to figure out who that message is supposed to be sent to. That’s what RIM is referring to as an engineering problem or an infrastructure problem. Basically the software that specifically runs BBM has to process each one of the messages.

2. Why is it taking so long for BlackBerry to determine the cause of the outage?

They’re running a large distributed system. They have 70 million users, it’ s a lot of users and a lot of data and a lot of equipment spread out in a bunch of data centres throughout the world. It’s a complex system. It seems that’s one of the problems they’re having.

When the system started failing, they actually started batching or storing messages. From what I can figure out, a switch went down in the U.K., the structure was not able to forward messages.  Instead of dropping the messages they stored them in the hope the switch would come up and they could deliver them in them eventually. When network connectivity was re-established and they tried to clear the backlog, the system failed. This is an educated guess on my part.

3. How is it possible that the crash of servers at a facility in the U.K. can cause such a backlog of email messages? Can you shed some light on this?

It’s a very connected world. It basically tells you that the U.K. is a hub for all of their European operations. People in the U.K. send messages to people in America. Even consumers use BBM to talk to their friends in different countries and different continents.  So they had a backlog and as they were trying to clear the backlog, the system failed.

4. Some reports also mention that a backup also failed in this outage; how common is that?

They have a switch which is a core switch. It’s basically a big piece of hardware, which in this case connects Europe with the rest of the world. It’s a very important piece of infrastructure. The way you engineer the system is that you assume that the hardware is reliable but that it will fail once in a while. So you add second switch as a backup so if the first one fails, you can very quickly replace it with the one that’s on standby so the disruption is minimal. This is common practice. In this particular case both switches failed. 

5. Can you talk about how the BlackBerry technology differs from the system used by an iPhone or other android phones?

BlackBerry belongs on two different networks. They have a consumer network and a business network, so email and corporate applications are segregated into two different systems. Android and iPhones don’t do that. For BlackBerry users, it appears that corporate users saw less of a disruption. The business users were really not affected.

6. Can an outage like this occur again, and what technology or measures would it take to prevent it?

Yes, absolutely it can occur again. What measures can they take? It depends on what the problem was in the first place. They have to figure out why their recovery process failed and brought the whole network down. It might very well be that whatever applications they’re using to handle the data, couldn’t’ handle the sudden spike. If that’s the case they have to change that.  But if it happens again, it will happen because a different component failed. My guess is RIM knows they have this problem and they’ll fix it, but there’s no guarantee that there are no problems somewhere else in the network.
 
One thing to know is that this doesn’t just apply to BlackBerry. It‘s true of any network. A few years ago we had that huge disruption in the electrical system in Ontario and the northeast U.S. That’s an example of a network going down. These are very complex systems and they’re engineered to deal with any problem we can think of, but once in a while there are conditions that are highly unlikely to happen that do happen.