In-Memory Computing for Energy-Efficient AI
Key Points
- AI powers everyday services like speech‑to‑text and chatbots, but the data movement between memory and CPU consumes a large share of the energy used by these systems.
- Training massive deep‑learning models (e.g., large language models) can emit as much carbon as five cars and may take weeks in cloud clusters, highlighting the urgency for more energy‑efficient compute.
- AI progress is categorized as narrow, broad, and general; as we move toward broader and more complex models, the demand for faster, greener hardware will only increase.
- In‑memory computing proposes to fuse memory and processing, eliminating the costly back‑and‑forth data transfers that dominate runtime and power usage in traditional architectures.
- By treating memory arrays as networks of resistive elements that can perform calculations directly, in‑memory computing offers a promising route to boost speed while dramatically lowering AI’s energy footprint.
Sections
- In-Memory Computing for Green AI - Nicole explains that everyday AI services consume large amounts of energy—primarily due to data transfers between CPU and memory—and presents in‑memory computing as a promising approach to make AI more energy‑efficient.
- In-Memory Computing for Efficient AI - The speaker explains how integrating memory and compute via resistive crossbar arrays can eliminate data movement, thereby increasing speed and energy efficiency for larger, more complex AI models.
- Mapping Neural Networks onto Crossbar Arrays - The passage explains how a neural‑network layer is realized in hardware by programming the crossbar’s conductance matrix to match layer weights, encoding inputs as voltage vectors, and reading column currents to perform the required matrix‑vector multiplication before applying the activation function.
- Contribute to Energy‑Efficient AI - The presenter encourages viewers to help create more energy‑efficient AI by participating, subscribing, liking, and accessing the open‑source analog AI hardware toolkit via the provided links.
Full Transcript
# In-Memory Computing for Energy-Efficient AI **Source:** [https://www.youtube.com/watch?v=BTnr8z-ePR4](https://www.youtube.com/watch?v=BTnr8z-ePR4) **Duration:** 00:09:57 ## Summary - AI powers everyday services like speech‑to‑text and chatbots, but the data movement between memory and CPU consumes a large share of the energy used by these systems. - Training massive deep‑learning models (e.g., large language models) can emit as much carbon as five cars and may take weeks in cloud clusters, highlighting the urgency for more energy‑efficient compute. - AI progress is categorized as narrow, broad, and general; as we move toward broader and more complex models, the demand for faster, greener hardware will only increase. - In‑memory computing proposes to fuse memory and processing, eliminating the costly back‑and‑forth data transfers that dominate runtime and power usage in traditional architectures. - By treating memory arrays as networks of resistive elements that can perform calculations directly, in‑memory computing offers a promising route to boost speed while dramatically lowering AI’s energy footprint. ## Sections - [00:00:00](https://www.youtube.com/watch?v=BTnr8z-ePR4&t=0s) **In-Memory Computing for Green AI** - Nicole explains that everyday AI services consume large amounts of energy—primarily due to data transfers between CPU and memory—and presents in‑memory computing as a promising approach to make AI more energy‑efficient. - [00:03:13](https://www.youtube.com/watch?v=BTnr8z-ePR4&t=193s) **In-Memory Computing for Efficient AI** - The speaker explains how integrating memory and compute via resistive crossbar arrays can eliminate data movement, thereby increasing speed and energy efficiency for larger, more complex AI models. - [00:06:22](https://www.youtube.com/watch?v=BTnr8z-ePR4&t=382s) **Mapping Neural Networks onto Crossbar Arrays** - The passage explains how a neural‑network layer is realized in hardware by programming the crossbar’s conductance matrix to match layer weights, encoding inputs as voltage vectors, and reading column currents to perform the required matrix‑vector multiplication before applying the activation function. - [00:09:36](https://www.youtube.com/watch?v=BTnr8z-ePR4&t=576s) **Contribute to Energy‑Efficient AI** - The presenter encourages viewers to help create more energy‑efficient AI by participating, subscribing, liking, and accessing the open‑source analog AI hardware toolkit via the provided links. ## Full Transcript
How many times a day do you use AI?
You may be surprised to find
that AI powers many of the tech services you use throughout your day.
Any time you use speech to text on your phone
or you use a chat bot for customer service, you're using AI.
Behind the scenes, these existing technologies are consuming lots of energy.
One very exciting field has emerged to try and make AI more energy efficient,
and that's in-memory computing.
But you may be wondering why is energy efficient AI desirable?
My name is Nicole Saulnier,
and I'm a researcher with IBM working on in-memory computing.
In a traditional computer there are two main blocks.
A memory, and a CPU or Central Processing Unit.
These are connected together by a bus.
And data is transferred back and forth
to execute instructions and perform computations.
As transistors have continued to scale,
the CPU has become faster and more energy efficient.
This has increased the importance of the limitations of the speed and energy
that's used or consumed during the transfer of data back and forth
between the memory and the CPU.
In data intensive computation such as deep learning,
the data communication is actually dominating the model runtimes and the energy consumption.
To put that into perspective,
we can think about some commonly used models today.
To train just one very large natural language processing model,
we actually are consuming about the same amount of energy as the equivalent carbon footprint of five cars.
And even in a cloud environment where many computers are working together to solve the same problem,
it can take over one week to train.
To appreciate these energy and time constraints,
we have to look at the field of AI and look at the trends.
We can categorize AI by dividing it into three different categories.
In narrow AI, we're able to solve a single task with superhuman accuracy.
In broad AI, we're performing multiple tasks within the same domain.
Things like diagnosing a patient with cancer and providing a treatment plan for them.
And then in general AI, we're working across domain,
applying learning from one area to another area with ease and often without any supervision.
Today we're here somewhere between narrow and broad AI,
and we know if we want to move further to the right,
the size and complexity of our models are going to be increasing.
This is going to drive a need for innovation
and for more energy efficient AI compute.
And that's where in-memory computing comes in.
Instead of spending all this time, transferring our data back and forth,
what if we could design a system that eliminated this data movement
and actually perform the functions of both the memory and the CPU?
Then we could potentially have an increase in our speed and our energy performance.
In order to think about this,
it helps to break it down and to start thinking about what types of computations a memory could perform.
One can think of a memory as an array of resistive elements,
and each of these resistive elements can be programed to some conductance value "G",
where G is just going to be the inverse of our resistance.
If we have a simple crossbar of two metal wires
and we put one of our resistive elements between them,
this can be programed to conductance G1.
We can apply some voltage V1 across it,
and then we can calculate the current I1 flowing through our device.
And this will just be I1 is equal to v1 times G1.
And this is just dictated by Ohm's Law.
Now, if we extend our array and we add a second row of devices,
the current through this device can be expressed as I2,
and we can calculate the current coming out at the bottom of our column
as I is equal to I1 plus I2.
And this is just Kirchhoff's Law, which has performed an accumulation operation for us.
So, we're able to perform different operations: a multiplication with ohms law,
and an addition with Kirchhoff law by using these types of devices.
And we now take our simple one column and expand it out into an array.
We can put an element at each cross point
between the various metal wires,
and we can represent these as different conductance values.
Each can be programmed to a different value.
And we can represent this as a matrix G,
and that consists of all of our different elements.
We can then apply different voltages to each row of our array,
and we can represent that as a vector V of our input voltages.
Then our currents coming out the bottom of our columns
can be represented by the resultant vector I,
which is equal to our voltage vector times our conductance matrix G.
And this is just a matrix vector multiplication or an MVM,
and this is super convenient because it turns out that in AI inference workloads
around 60 to 90 percent of our operations are going to be these MVM operations.
So we have that basic building block available to us.
How do we actually map our neural network onto our hardware?
Well, a layer of a neural network consists of many output neurons.
And each of those output neurons, for instance N,
it's going to be driven by a set of input neurons through a set of weights.
And if we have our input to our neural network layer as X,
we can express the output from the layer mathematically as this equation.
Now we have to map this equation onto our memory array.
The first thing we can do is take all of our conductances
and program them such that our conductance matrix G is equal to our weights of our layer.
And then we can encode the inputs to our neural network layer X as a vector of input voltages V.
And finally, we can collect the currents coming out of the columns of each column of the array
and apply our activation function F to our current.
And that is going to give us the output from our neural network layer Y.
And this way, we can use these concepts to map our neural network onto our memory array
to perform analog in-memory computing for more energy efficient AI.
There are a lot of details that go into the design, the build,
and the usage of these analog in-memory computing chips.
You can join us and check out our AI hardware toolkit
to learn more about different neural networks and simulate those,
and you can also explore various memory elements, which we have included.
And the best part is you can contribute
and join us to help make AI more energy efficient.
Thanks for watching.
If you like the video, don't forget to like and subscribe to the channel
and also check out the links below for access to our open source analog AI hardware toolkit.