Building a 160‑Core M.2 Supercluster
Key Points
- The creator upgraded from a 256‑core, four‑layer PCB with cramped surface‑mount pin headers to a 160‑core RISC‑V supercluster built around modular M.2‑style edge connectors, drastically reducing board size to 22 × 26 mm.
- By stacking ten vertical M.2 slots, each hosting its own MCU module, the design achieved dense routing of power, I/O, and PCIe lanes while keeping within standard PCB manufacturing constraints.
- Switching to the newer QFN‑package CH323 chips and using 0.5 mm‑pitch edge connectors solved earlier mechanical issues but introduced new challenges sourcing affordable vertical connectors, ultimately requiring manual assembly from low‑cost AliExpress parts.
- Despite the increased component count (64 devices, 515 pads, 413 vias) and higher per‑module cost (≈ $15 for a batch of 20), the final prototype proved functional and demonstrated the feasibility—and the headaches—of scaling to a high‑core, modular supercluster.
Sections
- Building a 160‑Core M.2 Supercluster - The creator recounts the challenges and design choices behind assembling a densely packed 160‑core computer cluster using M.2 edge connectors, PCI Express lanes, and compact MCU packages.
- Soldering Tight Connectors & Multi-MCU Programming - The speaker details the challenge of reaching inaccessible pins on a dense board by wrapping a wire on the soldering iron, then explains how shared‑bus programming of many MCUs led to unintended current paths, heating, and the eventual solution of powering and flashing all units together.
- Spam Overload Sparks Data‑Removal Pitch - The narrator vents about relentless spam and data‑broker privacy violations, advertises Incogn’s service that forces brokers to delete personal information with various subscription plans and a discount code, then abruptly switches to describing a technical demo of distributed rendering across multiple MCUs.
- Debugging 160‑MCU Open‑Drain Bus - The presenter troubleshoots a 160‑MCU board, replaces overused GPIO pins, and discovers that 10 kΩ pull‑up resistors on the open‑drain bus are too weak for megahertz signals, causing slow rise times and corrupted color data, prompting a switch to stronger pull‑ups.
- Microcontroller SHA-256 Benchmark Success - The creator built a 160‑microcontroller SHA‑256 hashing cluster that outperformed an 8‑core desktop while consuming under 4 W, shares the design files, reflects on the challenges, and asks viewers how the low‑power rig could be used.
Full Transcript
# Building a 160‑Core M.2 Supercluster **Source:** [https://www.youtube.com/watch?v=HRfbQJ6FdF0](https://www.youtube.com/watch?v=HRfbQJ6FdF0) **Duration:** 00:14:32 ## Summary - The creator upgraded from a 256‑core, four‑layer PCB with cramped surface‑mount pin headers to a 160‑core RISC‑V supercluster built around modular M.2‑style edge connectors, drastically reducing board size to 22 × 26 mm. - By stacking ten vertical M.2 slots, each hosting its own MCU module, the design achieved dense routing of power, I/O, and PCIe lanes while keeping within standard PCB manufacturing constraints. - Switching to the newer QFN‑package CH323 chips and using 0.5 mm‑pitch edge connectors solved earlier mechanical issues but introduced new challenges sourcing affordable vertical connectors, ultimately requiring manual assembly from low‑cost AliExpress parts. - Despite the increased component count (64 devices, 515 pads, 413 vias) and higher per‑module cost (≈ $15 for a batch of 20), the final prototype proved functional and demonstrated the feasibility—and the headaches—of scaling to a high‑core, modular supercluster. ## Sections - [00:00:00](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=0s) **Building a 160‑Core M.2 Supercluster** - The creator recounts the challenges and design choices behind assembling a densely packed 160‑core computer cluster using M.2 edge connectors, PCI Express lanes, and compact MCU packages. - [00:03:31](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=211s) **Soldering Tight Connectors & Multi-MCU Programming** - The speaker details the challenge of reaching inaccessible pins on a dense board by wrapping a wire on the soldering iron, then explains how shared‑bus programming of many MCUs led to unintended current paths, heating, and the eventual solution of powering and flashing all units together. - [00:06:58](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=418s) **Spam Overload Sparks Data‑Removal Pitch** - The narrator vents about relentless spam and data‑broker privacy violations, advertises Incogn’s service that forces brokers to delete personal information with various subscription plans and a discount code, then abruptly switches to describing a technical demo of distributed rendering across multiple MCUs. - [00:10:07](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=607s) **Debugging 160‑MCU Open‑Drain Bus** - The presenter troubleshoots a 160‑MCU board, replaces overused GPIO pins, and discovers that 10 kΩ pull‑up resistors on the open‑drain bus are too weak for megahertz signals, causing slow rise times and corrupted color data, prompting a switch to stronger pull‑ups. - [00:13:32](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=812s) **Microcontroller SHA-256 Benchmark Success** - The creator built a 160‑microcontroller SHA‑256 hashing cluster that outperformed an 8‑core desktop while consuming under 4 W, shares the design files, reflects on the challenges, and asks viewers how the low‑power rig could be used. ## Full Transcript
I worked on this for months. At first
glance, it might look like a heavily
botched SSD. But wait, what is this?
Yeah, you guessed it. It's a 160 core
risk 5 supercluster. What can it
actually do?
Oh, yeah.
And how did building it nearly break me?
Some strange state.
Find out in today's episode.
I don't know how this should ever work
with 160 cores.
[Music]
This project technically started a few
years ago when I built a 256 core mega
cluster that were my first four layer
PCBs and a serious challenge. I tried to
cram 16 MCUs onto a tiny module. Here's
the result. A 4x4 cm board packed with
traces across all layers. What I didn't
like, the densely packed surface mount
pin headers. They bend easily, are
expensive, and make the assembly a pain.
Back then, the CH323
chip was only available in a small SOP
package with tiny lags. So, that size
was as compact as it could get. Soon
after, they released a smaller QFN
version of the same chip. Still 48 MHz,
2 kilob SRAMM, and 16 kilob flash. I
tried a new design, but never made a
video about it. From that point on, I
knew next time I'm going to use edge
connectors. They are cheap and dense,
especially the ones with a 0.5 mm pin
pitch like those in M.2 form factors.
This year, I finally revisited the idea.
I fell in love with the possibilities
offered by M.2's exposed PCI Express
lanes. I had tested the idea with my
last M.2 matrix project using a PCI
Express serial interface chip. Serial is
easy to access from the browser thanks
to web serial. So, I reused that
interface and placed it on the bottom
side of my board. But instead of jamming
32 MCUs on the top, I had a different
idea. 10 vertical M.2 slots, each with
its own modular board of MCUs. Each
module could have a different
configuration if needed. Spoiler, that
turned out to be a terrible idea for
many reasons, but more on that in a bit.
I started laying out the modules. The
goal, root bus, IO's, and power within a
standard PCB budget. Exceeding those
specs makes the manufacturing cost grow
exponentially. In the end, I squeezed
everything onto a four layer PCB. 64
components, 515 pads, 413 VAS, and
countless 100 micrometer traces as thin
as a human hair. After days of
iteration, I got it down to 22x 26 mm.
And it looks amazing. About a quarter of
the size of my old design. The 22 mm
width fulfills the M.2 spec, but the
final height of the cluster is probably
not. Unfortunately, this complexity
added some costs. Each module comes at
about 15 bucks when ordering a batch of
20, but honestly, totally worth it. Now,
these modules aren't pin compatible with
the M.2 interface. They just use the
same physical connector. So, I planned
to use vertical connectors, and this is
where the real trouble began. JC would
be only able to source entire reels of
this, and that would cost 500 bucks. I
ended up finding a seller on AliExpress
who offered them individually for a few
bucks each, which meant, yep, manual
assembly. Piece of cake. Not quite.
[Music]
I
[Music]
All right. I usually refill one side of
the board, then solder the other by
hand. With minor success, many pins
didn't make contact, and I had to
retouch them all. But here, the pins on
the vertical connectors were completely
unreachable. Who designed this?
Eventually, I wrapped a 1 mm wire around
my soldering iron tip to sneak between
the connectors. This was finicky, but it
worked.
You might have noticed some chips are
missing and there is a whole story to
that. You have to rewind the time.
Before I started with the cluster board,
I made the programmer board to flash
individual MCUs by powering them one at
a time as they are sharing the
programming pin.
Okay, let's plug it in. I hope I hope I
didn't mirror the pins or whatever.
I thought low side power switching would
be a good choice.
Did it crash?
It wasn't. Even though the MCUs had
individual grounds, they were still
connected via bus lines and shared reset
lines. What happened? The current found
a way, likely through body diodes. MCU
started heating up randomly, but luckily
none were permanently damaged.
Some strange state. All grounds are
connected now.
So, I had to power them all at once and
try programming them together. I had
done this before, and while it was
tricky back then, this time it worked
surprisingly well.
All right. Complicated design to make
lens blink. Yeah.
I wanted the host MCU to talk to all the
60 small ones individually over a single
open drain bus without interrupting the
rest. When I designed the programmer
board first, I used the GPIO zero pin as
a trigger for the start of a packet.
However, during the design of the
cluster board, I changed my decision to
another pin, but the programmer board
was already ordered, so I botched it.
Luckily, the GPIO's I freed up from the
now defunct lowside switches came to the
rescue. I reused all 16 of those to make
the programmer behave just like the
cluster. All right, here we go.
Yeah,
I pre-programmed each MCU with a fixed
index in the option byes that represents
its physical position in the array. Now,
we could even use the whole thing as a
3D LED matrix or for debugging, we could
identify which of the physical MCUs
stopped responding. You know who else
stopped responding? Me to anonymous
phone calls. My car warranty expired 20
years ago and I'm still not interested
in extending it. If you're tired being
spammed every single day, calls, emails,
even physical mail, then today's
sponsors incogn is here to help. Now, I
make plenty of questionable decisions in
my lab and I fully embrace them for your
entertainment. But I will admit I was
sloppy with my personal data in the
past. I filled out one of those sketchy
winner car forms. I didn't win the car,
but data brokers definitely won my info.
Suddenly, I was getting spammed from
every direction. It got so bad I snapped
at the real business caller thinking it
was spam again. I'm still ashamed of
that moment. And it's not just annoying.
Some of these data brokers collect
everything, email addresses, home
address, work, and financial history,
and resell it to advertisers, scammers,
and people search sites that expose it
literally to anyone online. Incogn puts
an end to that. They use data privacy
laws to force over 230 data brokers to
delete your data. You just sign up and
they handle the rest. I tested it myself
and honestly I was surprised. Within
minutes of signing up, they already had
processed a bunch of removal requests
and even started suppressing new ones
before they could even access my info.
There are multiple plans. The standard
plan covers you with automated removals,
monthly updates, and 24/7 support. The
family plan extends that up to four more
people in your household. And the
ultimate plan gives you custom removal
requests from any data broker or site.
It's risk-free to try for 30 days and
cancel anytime. Take your personal data
back with incogn. Use code bit looney at
the link below and get 60% off an annual
plan. And now back to the project
activated.
To show what the cluster is capable of,
I implemented array marer that was a
rabbit hole on its own. Moment of truth.
What? No. The plan? Render a scene with
reflections and shadows distributed
across all MCUs. Each MCU would take
care of a pixel at a time. In the
browser, this worked fine. Translating
it to C++ for the MCUs. Way harder.
These tiny risk 5 course had a
rudimentary instruction set. No floating
point, no square root, not even integer
multiplication or division. But with
fixed point math, I got it working.
I would say for one MCU, that's not bad.
Kind of.
It breaks when sending and only like
four are working at the same time. I had
each MCU light up its LED while
rendering a pixel. But remember those
hair thin traces? Too many LEDs at once
caused brownouts.
It died completely.
No time out anymore. What? That took me
a while to discover.
I don't know how this should ever work
with 160 cores. Uh oh.
Oh no.
I was using a 200 ohm LED resistor which
drew too much current. So I switched the
LED GPIOs from pushpull to using the
internal pullup instead. The LED lit up
faintly, but the current draw was
dramatically reduced and it solved so
many issues I haven't even diagnosed
yet. Now I uploaded the code without the
blink. So here LED timeout is off. Ping
again. Is it connected? Ping. Bam.
Everything works.
What?
Oh rendering error here. What? We are
done.
This is this is already too fast. This
is real time. I can see in real time how
the image is built.
Should we try this board with 160 MCUs
at the same time?
I never tried it though.
[Music]
Something was still off. Using more than
one module caused the system to crash or
reset.
We lose like two cores at the end or
even more. Sometimes the last three MCUs
on each module wouldn't respond. That
triggered a memory. GPIO C 13 to 15 are
special limited in speed and current. I
had used them due to GPIO shortage. Big
mistake. To fix it, I repurposed GPIO's
from the old lowside driver circuit that
I didn't populate and replace the
problematic pins. Bing is again another
level.
Okay, let's try
the ping.
Finally, for the first time, all 160
MCUs responded correctly.
All green. Yes.
But did they perform? The ray marer ran
but the colors were wrong. It was
devastating.
And we have some flipped bites here or
but after some thought I realized the
view vectors sent to the MCUs were
correct. The scene geometry was fine.
Only the returned color data was
corrupted. I used my debug tool to ping
an MCU with extra data attached. Testing
each of the data lines. And sure enough,
some bits wouldn't echo properly,
specifically low to high side
transitions. Here's the issue. I used an
open drain bus where each GPIO can only
pull low. There is an external pull-up
resistor to bring the signal high. This
avoids shorts when multiple MCUs try
talking at once. But the 10k pull-ups
were too weak for the meahertz signals.
The signal rise time was too slow,
resulting in false slows. I tried 1
kiloohm pull-ups, but the already
underpowered MCUs couldn't handle the
current and stopped working entirely.
Yeah, that's not working. My compromise
at the delay on the host side to let the
signal settle.
That worked and all colors were now
correct.
But still, I had two modules dropping
out. After testing and swapping slots, I
suspected bit soda joints. Patching up a
few code looking pins fixed the issue
eventually and finally the cluster was
fully functional. Oh yeah.
Until this point, several times I wanted
to give up. I'm glad I didn't. I guess
solving those puzzles and going through
this emotional roller coaster, a success
is even more rewarding. But how
successful is it? The ray mar didn't
look any faster than running on a single
module, and that's because the serial
port is the bottleneck, 115 kilobits per
second. The interface chip should
support up to 8 mgabits, but I haven't
been able to get it working yet. If
anyone out there has figured this out,
let me know. Until then, I can either
increase the rendering complexity to
utilize the MCUs more or try computer
heavy low bandwidth tasks like hashing.
So, I built a SHA 256 hashing benchmark.
SHA 256 is an algorithm that crypto
miners use. Not that I support that
waste of energy, but it's a good
benchmark and it worked amazingly. The
160 MCUs combined actually outperformed
my 8 core desktop CPU. At just 7
milliamps per core at 3.3 volt, the
whole cluster draws under 4 W. That's
pretty competitive for the hash rate.
So, what do you think we should use it
for? This project took way more time,
effort, and emotional energy than I
expected. There were so many failures,
but just as many lessons learned. I hope
I was able to inspire you and maybe even
teach a few things through my mistakes.
You will find all the design files and
code linked below. Just don't send the
design to a fab before fixing it. If you
appreciate this kind of content,
consider subscribing or sharing with a
friend. And thanks to my supporters. You
have been incredibly patient waiting for
signs of life. I see you next time. Bye.