Learning Library

← Back to Library

Building a 160‑Core M.2 Supercluster

Key Points

  • The creator upgraded from a 256‑core, four‑layer PCB with cramped surface‑mount pin headers to a 160‑core RISC‑V supercluster built around modular M.2‑style edge connectors, drastically reducing board size to 22 × 26 mm.
  • By stacking ten vertical M.2 slots, each hosting its own MCU module, the design achieved dense routing of power, I/O, and PCIe lanes while keeping within standard PCB manufacturing constraints.
  • Switching to the newer QFN‑package CH323 chips and using 0.5 mm‑pitch edge connectors solved earlier mechanical issues but introduced new challenges sourcing affordable vertical connectors, ultimately requiring manual assembly from low‑cost AliExpress parts.
  • Despite the increased component count (64 devices, 515 pads, 413 vias) and higher per‑module cost (≈ $15 for a batch of 20), the final prototype proved functional and demonstrated the feasibility—and the headaches—of scaling to a high‑core, modular supercluster.

Sections

Full Transcript

# Building a 160‑Core M.2 Supercluster **Source:** [https://www.youtube.com/watch?v=HRfbQJ6FdF0](https://www.youtube.com/watch?v=HRfbQJ6FdF0) **Duration:** 00:14:32 ## Summary - The creator upgraded from a 256‑core, four‑layer PCB with cramped surface‑mount pin headers to a 160‑core RISC‑V supercluster built around modular M.2‑style edge connectors, drastically reducing board size to 22 × 26 mm. - By stacking ten vertical M.2 slots, each hosting its own MCU module, the design achieved dense routing of power, I/O, and PCIe lanes while keeping within standard PCB manufacturing constraints. - Switching to the newer QFN‑package CH323 chips and using 0.5 mm‑pitch edge connectors solved earlier mechanical issues but introduced new challenges sourcing affordable vertical connectors, ultimately requiring manual assembly from low‑cost AliExpress parts. - Despite the increased component count (64 devices, 515 pads, 413 vias) and higher per‑module cost (≈ $15 for a batch of 20), the final prototype proved functional and demonstrated the feasibility—and the headaches—of scaling to a high‑core, modular supercluster. ## Sections - [00:00:00](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=0s) **Building a 160‑Core M.2 Supercluster** - The creator recounts the challenges and design choices behind assembling a densely packed 160‑core computer cluster using M.2 edge connectors, PCI Express lanes, and compact MCU packages. - [00:03:31](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=211s) **Soldering Tight Connectors & Multi-MCU Programming** - The speaker details the challenge of reaching inaccessible pins on a dense board by wrapping a wire on the soldering iron, then explains how shared‑bus programming of many MCUs led to unintended current paths, heating, and the eventual solution of powering and flashing all units together. - [00:06:58](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=418s) **Spam Overload Sparks Data‑Removal Pitch** - The narrator vents about relentless spam and data‑broker privacy violations, advertises Incogn’s service that forces brokers to delete personal information with various subscription plans and a discount code, then abruptly switches to describing a technical demo of distributed rendering across multiple MCUs. - [00:10:07](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=607s) **Debugging 160‑MCU Open‑Drain Bus** - The presenter troubleshoots a 160‑MCU board, replaces overused GPIO pins, and discovers that 10 kΩ pull‑up resistors on the open‑drain bus are too weak for megahertz signals, causing slow rise times and corrupted color data, prompting a switch to stronger pull‑ups. - [00:13:32](https://www.youtube.com/watch?v=HRfbQJ6FdF0&t=812s) **Microcontroller SHA-256 Benchmark Success** - The creator built a 160‑microcontroller SHA‑256 hashing cluster that outperformed an 8‑core desktop while consuming under 4 W, shares the design files, reflects on the challenges, and asks viewers how the low‑power rig could be used. ## Full Transcript
0:00I worked on this for months. At first 0:02glance, it might look like a heavily 0:04botched SSD. But wait, what is this? 0:08Yeah, you guessed it. It's a 160 core 0:11risk 5 supercluster. What can it 0:13actually do? 0:14Oh, yeah. 0:16And how did building it nearly break me? 0:18Some strange state. 0:19Find out in today's episode. 0:21I don't know how this should ever work 0:23with 160 cores. 0:25[Music] 0:28This project technically started a few 0:30years ago when I built a 256 core mega 0:33cluster that were my first four layer 0:35PCBs and a serious challenge. I tried to 0:38cram 16 MCUs onto a tiny module. Here's 0:42the result. A 4x4 cm board packed with 0:45traces across all layers. What I didn't 0:48like, the densely packed surface mount 0:50pin headers. They bend easily, are 0:52expensive, and make the assembly a pain. 0:55Back then, the CH323 0:57chip was only available in a small SOP 1:00package with tiny lags. So, that size 1:03was as compact as it could get. Soon 1:05after, they released a smaller QFN 1:07version of the same chip. Still 48 MHz, 1:112 kilob SRAMM, and 16 kilob flash. I 1:14tried a new design, but never made a 1:16video about it. From that point on, I 1:18knew next time I'm going to use edge 1:21connectors. They are cheap and dense, 1:23especially the ones with a 0.5 mm pin 1:26pitch like those in M.2 form factors. 1:29This year, I finally revisited the idea. 1:31I fell in love with the possibilities 1:33offered by M.2's exposed PCI Express 1:36lanes. I had tested the idea with my 1:39last M.2 matrix project using a PCI 1:41Express serial interface chip. Serial is 1:44easy to access from the browser thanks 1:45to web serial. So, I reused that 1:48interface and placed it on the bottom 1:49side of my board. But instead of jamming 1:5232 MCUs on the top, I had a different 1:54idea. 10 vertical M.2 slots, each with 1:57its own modular board of MCUs. Each 2:00module could have a different 2:01configuration if needed. Spoiler, that 2:03turned out to be a terrible idea for 2:05many reasons, but more on that in a bit. 2:08I started laying out the modules. The 2:10goal, root bus, IO's, and power within a 2:13standard PCB budget. Exceeding those 2:16specs makes the manufacturing cost grow 2:18exponentially. In the end, I squeezed 2:20everything onto a four layer PCB. 64 2:23components, 515 pads, 413 VAS, and 2:27countless 100 micrometer traces as thin 2:30as a human hair. After days of 2:32iteration, I got it down to 22x 26 mm. 2:36And it looks amazing. About a quarter of 2:39the size of my old design. The 22 mm 2:42width fulfills the M.2 spec, but the 2:44final height of the cluster is probably 2:46not. Unfortunately, this complexity 2:49added some costs. Each module comes at 2:51about 15 bucks when ordering a batch of 2:5320, but honestly, totally worth it. Now, 2:57these modules aren't pin compatible with 2:59the M.2 interface. They just use the 3:02same physical connector. So, I planned 3:05to use vertical connectors, and this is 3:07where the real trouble began. JC would 3:09be only able to source entire reels of 3:12this, and that would cost 500 bucks. I 3:15ended up finding a seller on AliExpress 3:17who offered them individually for a few 3:19bucks each, which meant, yep, manual 3:22assembly. Piece of cake. Not quite. 3:27[Music] 3:31I 3:32[Music] 3:47All right. I usually refill one side of 3:49the board, then solder the other by 3:51hand. With minor success, many pins 3:53didn't make contact, and I had to 3:55retouch them all. But here, the pins on 3:57the vertical connectors were completely 3:59unreachable. Who designed this? 4:02Eventually, I wrapped a 1 mm wire around 4:04my soldering iron tip to sneak between 4:06the connectors. This was finicky, but it 4:09worked. 4:24You might have noticed some chips are 4:25missing and there is a whole story to 4:28that. You have to rewind the time. 4:32Before I started with the cluster board, 4:34I made the programmer board to flash 4:35individual MCUs by powering them one at 4:38a time as they are sharing the 4:40programming pin. 4:41Okay, let's plug it in. I hope I hope I 4:44didn't mirror the pins or whatever. 4:46I thought low side power switching would 4:48be a good choice. 4:49Did it crash? 4:51It wasn't. Even though the MCUs had 4:53individual grounds, they were still 4:55connected via bus lines and shared reset 4:57lines. What happened? The current found 4:59a way, likely through body diodes. MCU 5:02started heating up randomly, but luckily 5:05none were permanently damaged. 5:06Some strange state. All grounds are 5:10connected now. 5:11So, I had to power them all at once and 5:13try programming them together. I had 5:15done this before, and while it was 5:17tricky back then, this time it worked 5:19surprisingly well. 5:21All right. Complicated design to make 5:24lens blink. Yeah. 5:26I wanted the host MCU to talk to all the 5:2860 small ones individually over a single 5:31open drain bus without interrupting the 5:33rest. When I designed the programmer 5:34board first, I used the GPIO zero pin as 5:37a trigger for the start of a packet. 5:39However, during the design of the 5:41cluster board, I changed my decision to 5:43another pin, but the programmer board 5:45was already ordered, so I botched it. 5:48Luckily, the GPIO's I freed up from the 5:50now defunct lowside switches came to the 5:53rescue. I reused all 16 of those to make 5:56the programmer behave just like the 5:58cluster. All right, here we go. 6:03Yeah, 6:07I pre-programmed each MCU with a fixed 6:09index in the option byes that represents 6:12its physical position in the array. Now, 6:14we could even use the whole thing as a 6:163D LED matrix or for debugging, we could 6:19identify which of the physical MCUs 6:21stopped responding. You know who else 6:23stopped responding? Me to anonymous 6:26phone calls. My car warranty expired 20 6:29years ago and I'm still not interested 6:31in extending it. If you're tired being 6:33spammed every single day, calls, emails, 6:36even physical mail, then today's 6:38sponsors incogn is here to help. Now, I 6:40make plenty of questionable decisions in 6:42my lab and I fully embrace them for your 6:45entertainment. But I will admit I was 6:47sloppy with my personal data in the 6:49past. I filled out one of those sketchy 6:51winner car forms. I didn't win the car, 6:54but data brokers definitely won my info. 6:56Suddenly, I was getting spammed from 6:58every direction. It got so bad I snapped 7:01at the real business caller thinking it 7:03was spam again. I'm still ashamed of 7:05that moment. And it's not just annoying. 7:07Some of these data brokers collect 7:08everything, email addresses, home 7:10address, work, and financial history, 7:13and resell it to advertisers, scammers, 7:15and people search sites that expose it 7:17literally to anyone online. Incogn puts 7:20an end to that. They use data privacy 7:22laws to force over 230 data brokers to 7:24delete your data. You just sign up and 7:27they handle the rest. I tested it myself 7:29and honestly I was surprised. Within 7:31minutes of signing up, they already had 7:33processed a bunch of removal requests 7:35and even started suppressing new ones 7:37before they could even access my info. 7:39There are multiple plans. The standard 7:41plan covers you with automated removals, 7:43monthly updates, and 24/7 support. The 7:46family plan extends that up to four more 7:49people in your household. And the 7:50ultimate plan gives you custom removal 7:52requests from any data broker or site. 7:54It's risk-free to try for 30 days and 7:57cancel anytime. Take your personal data 7:59back with incogn. Use code bit looney at 8:02the link below and get 60% off an annual 8:04plan. And now back to the project 8:08activated. 8:09To show what the cluster is capable of, 8:11I implemented array marer that was a 8:13rabbit hole on its own. Moment of truth. 8:17What? No. The plan? Render a scene with 8:21reflections and shadows distributed 8:23across all MCUs. Each MCU would take 8:25care of a pixel at a time. In the 8:27browser, this worked fine. Translating 8:30it to C++ for the MCUs. Way harder. 8:33These tiny risk 5 course had a 8:35rudimentary instruction set. No floating 8:37point, no square root, not even integer 8:40multiplication or division. But with 8:42fixed point math, I got it working. 8:44I would say for one MCU, that's not bad. 8:48Kind of. 8:50It breaks when sending and only like 8:53four are working at the same time. I had 8:56each MCU light up its LED while 8:58rendering a pixel. But remember those 9:01hair thin traces? Too many LEDs at once 9:04caused brownouts. 9:05It died completely. 9:07No time out anymore. What? That took me 9:10a while to discover. 9:11I don't know how this should ever work 9:14with 160 cores. Uh oh. 9:18Oh no. 9:20I was using a 200 ohm LED resistor which 9:23drew too much current. So I switched the 9:25LED GPIOs from pushpull to using the 9:28internal pullup instead. The LED lit up 9:30faintly, but the current draw was 9:32dramatically reduced and it solved so 9:34many issues I haven't even diagnosed 9:36yet. Now I uploaded the code without the 9:40blink. So here LED timeout is off. Ping 9:45again. Is it connected? Ping. Bam. 9:48Everything works. 9:51What? 9:52Oh rendering error here. What? We are 9:55done. 9:58This is this is already too fast. This 10:01is real time. I can see in real time how 10:03the image is built. 10:09Should we try this board with 160 MCUs 10:12at the same time? 10:15I never tried it though. 10:20[Music] 10:32Something was still off. Using more than 10:34one module caused the system to crash or 10:37reset. 10:38We lose like two cores at the end or 10:40even more. Sometimes the last three MCUs 10:43on each module wouldn't respond. That 10:45triggered a memory. GPIO C 13 to 15 are 10:49special limited in speed and current. I 10:52had used them due to GPIO shortage. Big 10:54mistake. To fix it, I repurposed GPIO's 10:57from the old lowside driver circuit that 11:00I didn't populate and replace the 11:02problematic pins. Bing is again another 11:05level. 11:06Okay, let's try 11:08the ping. 11:09Finally, for the first time, all 160 11:12MCUs responded correctly. 11:13All green. Yes. 11:18But did they perform? The ray marer ran 11:22but the colors were wrong. It was 11:24devastating. 11:26And we have some flipped bites here or 11:29but after some thought I realized the 11:31view vectors sent to the MCUs were 11:33correct. The scene geometry was fine. 11:35Only the returned color data was 11:37corrupted. I used my debug tool to ping 11:40an MCU with extra data attached. Testing 11:43each of the data lines. And sure enough, 11:45some bits wouldn't echo properly, 11:48specifically low to high side 11:50transitions. Here's the issue. I used an 11:52open drain bus where each GPIO can only 11:55pull low. There is an external pull-up 11:57resistor to bring the signal high. This 11:59avoids shorts when multiple MCUs try 12:01talking at once. But the 10k pull-ups 12:04were too weak for the meahertz signals. 12:07The signal rise time was too slow, 12:09resulting in false slows. I tried 1 12:12kiloohm pull-ups, but the already 12:14underpowered MCUs couldn't handle the 12:16current and stopped working entirely. 12:19Yeah, that's not working. My compromise 12:23at the delay on the host side to let the 12:25signal settle. 12:29That worked and all colors were now 12:31correct. 12:33But still, I had two modules dropping 12:35out. After testing and swapping slots, I 12:38suspected bit soda joints. Patching up a 12:41few code looking pins fixed the issue 12:43eventually and finally the cluster was 12:46fully functional. Oh yeah. 12:49Until this point, several times I wanted 12:51to give up. I'm glad I didn't. I guess 12:54solving those puzzles and going through 12:56this emotional roller coaster, a success 12:58is even more rewarding. But how 13:00successful is it? The ray mar didn't 13:03look any faster than running on a single 13:05module, and that's because the serial 13:07port is the bottleneck, 115 kilobits per 13:10second. The interface chip should 13:11support up to 8 mgabits, but I haven't 13:14been able to get it working yet. If 13:16anyone out there has figured this out, 13:18let me know. Until then, I can either 13:21increase the rendering complexity to 13:23utilize the MCUs more or try computer 13:25heavy low bandwidth tasks like hashing. 13:28So, I built a SHA 256 hashing benchmark. 13:32SHA 256 is an algorithm that crypto 13:35miners use. Not that I support that 13:37waste of energy, but it's a good 13:39benchmark and it worked amazingly. The 13:42160 MCUs combined actually outperformed 13:45my 8 core desktop CPU. At just 7 13:48milliamps per core at 3.3 volt, the 13:51whole cluster draws under 4 W. That's 13:53pretty competitive for the hash rate. 13:55So, what do you think we should use it 13:57for? This project took way more time, 14:00effort, and emotional energy than I 14:01expected. There were so many failures, 14:04but just as many lessons learned. I hope 14:06I was able to inspire you and maybe even 14:08teach a few things through my mistakes. 14:11You will find all the design files and 14:13code linked below. Just don't send the 14:15design to a fab before fixing it. If you 14:18appreciate this kind of content, 14:20consider subscribing or sharing with a 14:22friend. And thanks to my supporters. You 14:24have been incredibly patient waiting for 14:27signs of life. I see you next time. Bye.