The company said that it designed the system to be as simple as possible, and built software without any bells and whistles to reduce the risk complexity brings.
Do it yourself
“The more complicated a component the more likely it is to have an issue,” global infrastructure VP Peter DeSantis said at the Amazon Web Services (AWS) Infrastructure Keynote.
“UPS systems have lots of very complex electronics, but the software is where things get really complex. The UPS has a hard job to start with, and vendors have jam-packed the UPS with features over the last 20 years. Now we already disable many of these features, but it still adds complexity.”
For years, Amazon Web Services has used the standard UPS set up of dedicated rooms filled with lead-acid batteries. “We’re not the only ones that have come to the conclusion that a single UPS is not reliable enough,” DeSantis said. “Lots of smart people have worked on solutions, the common approach is to throw more redundancy at the design, usually by adding a second UPS. Often this is done by using a feature of the UPS that allows it to be paralleled with other UPSs. But you still have a big complicated component keeping you awake at night, you’ve just added one more integrated complicated component.”
The company’s servers are powered by two independent power line ups, DeSantis explained. “Each line up has its own switchgear, its own generator, its own in rack UPS, even its own distribution wires. By keeping these line ups completely independent, all the way down to the rack, we’re able to provide very high availability and protect ourselves from issues with the UPS.
“Our data centers running this design achieve availability of almost seven nines (99.99997 percent).”
But that’s still not enough, DeSantis said, as having large UPS systems pose a risk. “Rather than using a big, third party UPS, we now use small battery packs and custom power supplies that we integrate into every rack,” DeSantis said. “You can think about this as a micro-UPS, but it’s far less complicated. And because we designed it ourselves, we know everything about it, and we control all the pieces of the software – this allows us to eliminate complexity from features we don’t need, and we can iterate at Amazon speed to improve the design.
“The batteries can also be removed and replaced in seconds rather than hours, and you can do this without turning off the system. This allows us to drastically reduce the risk of maintenance we need to do to the battery shelves. This design is giving us even better availability than 99.99997 percent.
“This is the exact sort of design that lets me sleep like a baby.”
The company has developed other internal systems, with DeSantis detailing how the company developed its own switchgear software – something the company revealed in 2016.
“Switchgear is fairly uncomplicated equipment – it’s big and super important, but it’s really just a bunch of mechanical circuit breakers, some power sensing equipment, and a simple software control system,” DeSantis said. “That control system is simple, but it is software. Most vendors will refer to it as firmware, but that just means it is embedded software that gets saved to a persistent memory module. And software that you don’t own that is in your infrastructure can cause problems.”
For example, if Amazon Web Services (AWS) finds a bug, they could end up spending weeks working with the vendor to reproduce that bug in their environment. “And then you wait months for the vendor to produce a fix and validate that fix. And in the infrastructure world, you have to take that fix and apply it to all these devices, and you might have to send a technician to manually do that. And by the time you’re done, it can easily take a year to fix an issue. This just won’t work to operate the way we want.”
The other issue is that the switchgear firmware is developed for numerous use cases, so comes with extraneous features that don’t make sense for an AWS facility. “So many years ago, we developed our own switchgear control system. Now this may look fairly simple, and indeed we’ve invested heavily in keeping it as simple as possible, we don’t add fancy features to our controller – instead, we focus on ensuring it does its very important job perfectly.
“Today, we use dozens of different makes and models of switchgear from several partners. But they’re all controlled by [our firmware, and this means we can operate our global data center exactly the same way everywhere.”