Inside IBM's Power Server Testing Lab
Key Points
- The IBM facility in Austin is a System Integration and Test Development Center where brand‑new Power Systems are assembled, powered on for the first time, and run through comprehensive “smoke‑check” tests to ensure they meet reliability standards.
- Engineers conduct continuous, high‑intensity stress testing—including firmware, software, and hardware integration checks—so the servers can operate flawlessly under real‑world workloads.
- The “guard band” area acts as an extreme‑environment chamber where servers are subjected to harsh conditions such as severe heat, freezing temperatures, high/low voltage, and other stressors to verify they remain functional.
- Additional durability tests include simulated earthquakes and other environmental challenges, effectively making the lab a “torture chamber” that guarantees enterprise‑grade reliability for IBM’s Power server customers.
Full Transcript
# Inside IBM's Power Server Testing Lab **Source:** [https://www.youtube.com/watch?v=7ZdsWebj9Jw](https://www.youtube.com/watch?v=7ZdsWebj9Jw) **Duration:** 00:09:33 ## Summary - The IBM facility in Austin is a System Integration and Test Development Center where brand‑new Power Systems are assembled, powered on for the first time, and run through comprehensive “smoke‑check” tests to ensure they meet reliability standards. - Engineers conduct continuous, high‑intensity stress testing—including firmware, software, and hardware integration checks—so the servers can operate flawlessly under real‑world workloads. - The “guard band” area acts as an extreme‑environment chamber where servers are subjected to harsh conditions such as severe heat, freezing temperatures, high/low voltage, and other stressors to verify they remain functional. - Additional durability tests include simulated earthquakes and other environmental challenges, effectively making the lab a “torture chamber” that guarantees enterprise‑grade reliability for IBM’s Power server customers. ## Sections - [00:00:00](https://www.youtube.com/watch?v=7ZdsWebj9Jw&t=0s) **Inside IBM's Power Server Test Lab** - A walkthrough shows IBM engineers assembling and rigorously testing Power Systems to ensure reliability before customer delivery. ## Full Transcript
standing here outside this admittedly
generic and boring looking building in
Austin Texas but inside IBM develops and
tests power servers now the lab is
usually off limits to everyone except
the core team but I got us a special
invite to check it out Follow Me
[Music]
Wait thank you so much for having us
today I'm excited for you to show me
around this data center it's not a data
center okay what is it it's a system
integration and test Development Center
for our Power Systems okay what does
that mean this is where we put all the
systems together
for the first time put the power supply
as the back planes the chassis memory
processors all together we're the first
ones to turn it on we do what we call a
system power on and then check for smoke
and by check for smoke you mean make
sure that nothing's broken correct we
run Soup To Nuts all the testing you can
possibly do to make sure that the
customers are getting the Boost reliable
server that they can possibly get I love
that I'd love to see what you got can
you show me let's go
so as you can see there's a noise
warning for this lab inside right now
about 65 decibels it's about to get a
lot louder although I can barely hear
him over the sound of hundreds of
Enterprise grade systems buzzing away
whoa
whoa from here Wayne explained that this
is one of the labs where the reliability
and availability of power servers are
tested that includes checking both
firmware and software Integrations in
essence IBM scientists and Engineers run
these power servers non-stop at their
limit to monitor for performance dips
this rigor helps them build machines
that businesses can rely on for their
most important work even when they
encounter extreme conditions and that
takes us to our next stop where are we
now with these things this is called the
guard band this is where they do the
external testing on the system we'll run
it through extreme heat extreme cold
voltage high voltage low voltage and
make sure that it'll always continue
running so our reliability is what a
customer expects okay it's basically a
torture chamber for servers yeah you
could call it that like one of these
things here yeah in fact that's one of
our small ones this this is the small
I've got a bigger one back here this
thing here yeah
it's massive right now we're doing a
cold testing on the power uh e 1080.
cool check it out
there's a freezing in here yeah how cold
is it uh right now it's 10 degrees C
it's probably going to go a little bit
colder than this okay but the servers
have to run at cold temperatures so you
basically leave it in here turn on the
ice and get it as close as possible to
make sure that they can run actually
they like it colder like this okay but
uh they need to run when it gets really
cold and in the same chamber can you
also turn up the heat so you test your
heat we're going to go probably up to 85
degrees in here wow so what other type
of testing are you putting these
Services we also do earthquake testing
for systems like we go to California
where they might be sub to an earthquake
we also do RF testing where we subject
the system to RF signals and also try to
see what kind of our signals come out
then there's a noise testing too how
much noise is actually coming out or put
out by the system versus how much noise
it could take it sounds like you got to
consider a ton of things when you're
like designing an architecture oh yeah
is there someone I could talk to about
that well kava is a main engineer so
he'd be a good one to talk to I'd love
to meet comics
all right coffee so this is the E 1080
server right yes that's correct so this
has four power 10 processors but e1080
comes with multiple flavors customers
could order it up to a 16 socket system
which means four of these stacked on top
of each other for these big boys and
then cable up together from the back
gotcha let's take a look inside before
we do that let's let's put our status oh
okay and what are these for these are to
make sure we discharge our body to the
system so we don't harm any Electronics
cool I don't want to break anything so
I'll do that
sat here okay
we got our four processors here so these
four things are processors no no no this
is this is a heatsink this is what cools
the processor okay underneath the heat
sinks we actually have four of these
this is the actual power 10 e 1080
processor okay so this is like the
brains of the server this is the brain
of the system yes and what's special
about this power 10 processor here
actually inside what you're holding is
actually a power 10 chip okay we're
getting smaller and smaller here right
so so this is based on a seven nanometer
technology and there are 18 billion
devices into what you're holding right
18 million or billion 18 billion 18
billion devices on this chip correct
that's amazing so what are some of the
unique
properties of the chip so we have a
built-in AI inferencing inside a chip
and it's about 5x faster than what our
previous generation power 9 processors
have not bad and what does AI
inferencing do for the user it's used
for machine learning okay and in other
systems uh for machine learning use gpus
and fpgas you have to transfer data from
your processor to somewhere else get it
processed and move back and our
processor everything is done and the
processor itself so they don't have to
move the data as far so it's faster and
it's more reliable correct and also a
lot more secure because
you're not moving the data somewhere
else to get processed and come back
these are actually cables to bring the
signal from top of our processors
bring the high speed signals out via
these cables to the back of the system
and that's how we are connecting four of
these systems to each other so when
they're stacked on top of each other
comes out through this snake node comes
directly from top of a processor which
we are the only one in the industry is
doing that wow goes through these copper
cables internal cables go through
an external cable like this
through the back go to another node that
that's how we managed to connect four of
these together to create one gigantic 16
socket system another really important
part
our memory our memory subsystem we have
a buffer chip in our memory okay and
every single communication between the
processor and the memory is fully
encrypted through the hardware also as a
spare dram as well so if there's any any
of these drams that you can see if
there's any of them fail it will
actually can switch the spare and keep
going without losing any performance
without the system coming down and
because this system it's all about
reliability and availability we have an
extra power phase for on all the dims
very cool and this system fully loaded
can have up to 64 terabyte of memory 64
terabytes that's a lot that's a lot and
because of that and because of the
performance the system provides we
actually hold the world record for sap
Hana the largest sap Hana database
that's certified it's run on this system
wow in front of the memory we have all
the Regulators these are vrms or what we
call voltage Regulators every different
component in the system use different
voltages okay so The Regulators like
this brings the 12 volts down that our
power supply generates to different
voltage levels that other components
need because this is our most reliable
system every single Regulators they're n
plus two phase that means there are two
extra phase wow just sitting there as a
backup just in case anything goes wrong
system automatically switches to the
backup and keep going and there are
other components here in the front as
well for example
this is our clock card
this card generates reference clock for
the processor and the memory for
different components in the system and
if you notice there are two of them
identical
and the reason for that is again this is
our most reliable system for every
single component there is a redundant
one a backup that's fair so there's a
lot of stuff going on in here and I
think it gets pretty hot I know we got
the heatsink but we also have fans up
front correct in the front of a system
we have five fans assemblies
these are all concurrently maintainable
what that means is while the system is
up and running if there's any issue with
any of the fans we can service them
without the system going down the wall
system is up and running so everything
you and Wayne have showed me today means
this machine is super powerful it's
super reliable yes that's correct you
can do a lot of work with a lot less of
these machines so because of that you
can have a lot smaller carbon Footprints
things that you could do potentially
with 100 servers you could do probably
with two of these so you're lowering
your carbon footprint and probably
saving doing some cost savings as well
of course you you save costs by less
less power getting used and also for
licensing typically licensed things they
get charged by how much core how many
core you have and because this machine
can do a lot more with a lot less system
a lot less cores you save a lot of money
from licensing fees as well so what's
next for you guys what are you working
on now so so we're already working on
next generation of these systems so no
rest no downtime you're already doing
the Next Generation and you're going to
look to beat the records you set with
this one of course
[Music]