Learning Library

← Back to Library

Inside IBM's Power Server Testing Lab

Key Points

  • The IBM facility in Austin is a System Integration and Test Development Center where brand‑new Power Systems are assembled, powered on for the first time, and run through comprehensive “smoke‑check” tests to ensure they meet reliability standards.
  • Engineers conduct continuous, high‑intensity stress testing—including firmware, software, and hardware integration checks—so the servers can operate flawlessly under real‑world workloads.
  • The “guard band” area acts as an extreme‑environment chamber where servers are subjected to harsh conditions such as severe heat, freezing temperatures, high/low voltage, and other stressors to verify they remain functional.
  • Additional durability tests include simulated earthquakes and other environmental challenges, effectively making the lab a “torture chamber” that guarantees enterprise‑grade reliability for IBM’s Power server customers.

Full Transcript

# Inside IBM's Power Server Testing Lab **Source:** [https://www.youtube.com/watch?v=7ZdsWebj9Jw](https://www.youtube.com/watch?v=7ZdsWebj9Jw) **Duration:** 00:09:33 ## Summary - The IBM facility in Austin is a System Integration and Test Development Center where brand‑new Power Systems are assembled, powered on for the first time, and run through comprehensive “smoke‑check” tests to ensure they meet reliability standards. - Engineers conduct continuous, high‑intensity stress testing—including firmware, software, and hardware integration checks—so the servers can operate flawlessly under real‑world workloads. - The “guard band” area acts as an extreme‑environment chamber where servers are subjected to harsh conditions such as severe heat, freezing temperatures, high/low voltage, and other stressors to verify they remain functional. - Additional durability tests include simulated earthquakes and other environmental challenges, effectively making the lab a “torture chamber” that guarantees enterprise‑grade reliability for IBM’s Power server customers. ## Sections - [00:00:00](https://www.youtube.com/watch?v=7ZdsWebj9Jw&t=0s) **Inside IBM's Power Server Test Lab** - A walkthrough shows IBM engineers assembling and rigorously testing Power Systems to ensure reliability before customer delivery. ## Full Transcript
0:00standing here outside this admittedly 0:02generic and boring looking building in 0:04Austin Texas but inside IBM develops and 0:07tests power servers now the lab is 0:09usually off limits to everyone except 0:10the core team but I got us a special 0:12invite to check it out Follow Me 0:14[Music] 0:24Wait thank you so much for having us 0:26today I'm excited for you to show me 0:28around this data center it's not a data 0:30center okay what is it it's a system 0:32integration and test Development Center 0:35for our Power Systems okay what does 0:38that mean this is where we put all the 0:40systems together 0:41for the first time put the power supply 0:43as the back planes the chassis memory 0:45processors all together we're the first 0:47ones to turn it on we do what we call a 0:49system power on and then check for smoke 0:52and by check for smoke you mean make 0:54sure that nothing's broken correct we 0:56run Soup To Nuts all the testing you can 0:58possibly do to make sure that the 0:59customers are getting the Boost reliable 1:01server that they can possibly get I love 1:03that I'd love to see what you got can 1:05you show me let's go 1:07so as you can see there's a noise 1:09warning for this lab inside right now 1:11about 65 decibels it's about to get a 1:14lot louder although I can barely hear 1:16him over the sound of hundreds of 1:17Enterprise grade systems buzzing away 1:19whoa 1:20whoa from here Wayne explained that this 1:22is one of the labs where the reliability 1:24and availability of power servers are 1:26tested that includes checking both 1:27firmware and software Integrations in 1:30essence IBM scientists and Engineers run 1:32these power servers non-stop at their 1:34limit to monitor for performance dips 1:37this rigor helps them build machines 1:39that businesses can rely on for their 1:40most important work even when they 1:42encounter extreme conditions and that 1:44takes us to our next stop where are we 1:46now with these things this is called the 1:49guard band this is where they do the 1:52external testing on the system we'll run 1:54it through extreme heat extreme cold 1:57voltage high voltage low voltage and 2:00make sure that it'll always continue 2:01running so our reliability is what a 2:04customer expects okay it's basically a 2:06torture chamber for servers yeah you 2:08could call it that like one of these 2:10things here yeah in fact that's one of 2:12our small ones this this is the small 2:14I've got a bigger one back here this 2:15thing here yeah 2:18it's massive right now we're doing a 2:21cold testing on the power uh e 1080. 2:24cool check it out 2:26there's a freezing in here yeah how cold 2:29is it uh right now it's 10 degrees C 2:31it's probably going to go a little bit 2:32colder than this okay but the servers 2:34have to run at cold temperatures so you 2:36basically leave it in here turn on the 2:38ice and get it as close as possible to 2:40make sure that they can run actually 2:42they like it colder like this okay but 2:44uh they need to run when it gets really 2:46cold and in the same chamber can you 2:48also turn up the heat so you test your 2:50heat we're going to go probably up to 85 2:52degrees in here wow so what other type 2:54of testing are you putting these 2:54Services we also do earthquake testing 2:56for systems like we go to California 2:58where they might be sub to an earthquake 3:00we also do RF testing where we subject 3:03the system to RF signals and also try to 3:05see what kind of our signals come out 3:07then there's a noise testing too how 3:08much noise is actually coming out or put 3:10out by the system versus how much noise 3:12it could take it sounds like you got to 3:14consider a ton of things when you're 3:15like designing an architecture oh yeah 3:17is there someone I could talk to about 3:18that well kava is a main engineer so 3:21he'd be a good one to talk to I'd love 3:23to meet comics 3:25all right coffee so this is the E 1080 3:27server right yes that's correct so this 3:30has four power 10 processors but e1080 3:33comes with multiple flavors customers 3:35could order it up to a 16 socket system 3:37which means four of these stacked on top 3:40of each other for these big boys and 3:41then cable up together from the back 3:43gotcha let's take a look inside before 3:45we do that let's let's put our status oh 3:47okay and what are these for these are to 3:49make sure we discharge our body to the 3:52system so we don't harm any Electronics 3:54cool I don't want to break anything so 3:55I'll do that 3:57sat here okay 4:01we got our four processors here so these 4:04four things are processors no no no this 4:06is this is a heatsink this is what cools 4:08the processor okay underneath the heat 4:10sinks we actually have four of these 4:12this is the actual power 10 e 1080 4:15processor okay so this is like the 4:17brains of the server this is the brain 4:18of the system yes and what's special 4:20about this power 10 processor here 4:22actually inside what you're holding is 4:25actually a power 10 chip okay we're 4:27getting smaller and smaller here right 4:29so so this is based on a seven nanometer 4:32technology and there are 18 billion 4:35devices into what you're holding right 4:3618 million or billion 18 billion 18 4:39billion devices on this chip correct 4:42that's amazing so what are some of the 4:44unique 4:45properties of the chip so we have a 4:47built-in AI inferencing inside a chip 4:50and it's about 5x faster than what our 4:53previous generation power 9 processors 4:54have not bad and what does AI 4:56inferencing do for the user it's used 4:58for machine learning okay and in other 5:01systems uh for machine learning use gpus 5:04and fpgas you have to transfer data from 5:08your processor to somewhere else get it 5:11processed and move back and our 5:13processor everything is done and the 5:15processor itself so they don't have to 5:17move the data as far so it's faster and 5:19it's more reliable correct and also a 5:21lot more secure because 5:23you're not moving the data somewhere 5:25else to get processed and come back 5:27these are actually cables to bring the 5:30signal from top of our processors 5:33bring the high speed signals out via 5:35these cables to the back of the system 5:37and that's how we are connecting four of 5:41these systems to each other so when 5:43they're stacked on top of each other 5:44comes out through this snake node comes 5:47directly from top of a processor which 5:50we are the only one in the industry is 5:51doing that wow goes through these copper 5:55cables internal cables go through 5:58an external cable like this 6:01through the back go to another node that 6:03that's how we managed to connect four of 6:05these together to create one gigantic 16 6:08socket system another really important 6:11part 6:12our memory our memory subsystem we have 6:15a buffer chip in our memory okay and 6:17every single communication between the 6:19processor and the memory is fully 6:21encrypted through the hardware also as a 6:24spare dram as well so if there's any any 6:27of these drams that you can see if 6:29there's any of them fail it will 6:31actually can switch the spare and keep 6:33going without losing any performance 6:35without the system coming down and 6:37because this system it's all about 6:40reliability and availability we have an 6:43extra power phase for on all the dims 6:46very cool and this system fully loaded 6:49can have up to 64 terabyte of memory 64 6:52terabytes that's a lot that's a lot and 6:55because of that and because of the 6:56performance the system provides we 6:58actually hold the world record for sap 7:00Hana the largest sap Hana database 7:04that's certified it's run on this system 7:06wow in front of the memory we have all 7:08the Regulators these are vrms or what we 7:11call voltage Regulators every different 7:14component in the system use different 7:15voltages okay so The Regulators like 7:18this brings the 12 volts down that our 7:21power supply generates to different 7:23voltage levels that other components 7:24need because this is our most reliable 7:26system every single Regulators they're n 7:29plus two phase that means there are two 7:31extra phase wow just sitting there as a 7:33backup just in case anything goes wrong 7:36system automatically switches to the 7:38backup and keep going and there are 7:39other components here in the front as 7:41well for example 7:43this is our clock card 7:45this card generates reference clock for 7:48the processor and the memory for 7:50different components in the system and 7:51if you notice there are two of them 7:53identical 7:55and the reason for that is again this is 7:57our most reliable system for every 7:59single component there is a redundant 8:01one a backup that's fair so there's a 8:04lot of stuff going on in here and I 8:05think it gets pretty hot I know we got 8:07the heatsink but we also have fans up 8:08front correct in the front of a system 8:10we have five fans assemblies 8:14these are all concurrently maintainable 8:16what that means is while the system is 8:19up and running if there's any issue with 8:21any of the fans we can service them 8:23without the system going down the wall 8:26system is up and running so everything 8:27you and Wayne have showed me today means 8:29this machine is super powerful it's 8:30super reliable yes that's correct you 8:33can do a lot of work with a lot less of 8:37these machines so because of that you 8:39can have a lot smaller carbon Footprints 8:42things that you could do potentially 8:44with 100 servers you could do probably 8:47with two of these so you're lowering 8:49your carbon footprint and probably 8:50saving doing some cost savings as well 8:52of course you you save costs by less 8:55less power getting used and also for 8:58licensing typically licensed things they 9:01get charged by how much core how many 9:03core you have and because this machine 9:05can do a lot more with a lot less system 9:08a lot less cores you save a lot of money 9:10from licensing fees as well so what's 9:12next for you guys what are you working 9:13on now so so we're already working on 9:15next generation of these systems so no 9:17rest no downtime you're already doing 9:19the Next Generation and you're going to 9:21look to beat the records you set with 9:22this one of course 9:24[Music]