Facebook TensorScience

Building your own deep learning machine in 2023: some reflections

Published on: November 12, 2023

My experience building a DIY deep learning machine in 2023.

Introduction

This year I dove into the world of DIY deep learning rigs. There's something exciting and interesting about piecing together a powerful machine that can churn through terabytes of data to train and apply neural nets. These nets nowadays can properly see, speak, and write, and with the advent of generative AI, possibly generate their own ideas in the near future as well. When I first started, I knew little about PCIe lanes, bandwidth or why I couldn't just put any four GPUs into a motherboard and call it a day. I found out that you need to pay attention to - for instance - the version and number of lanes – stuff about x8 and x16 configurations that can totally bottleneck your performance if you're not careful. (P.S. you haven't felt true frustration until you've tried fitting a square peg into a metaphorical round hole, PCI Express style.).

Deep Learning: Hardware Fundamentals

I also discovered another layer of complexity when it came to power delivery and cooling. No one mentions that, along with performance specs, you almost need to be an amateur electrician to figure out if your rigs are going to trip breakers. I had a fun time learning about amperage and surge protectors (once I accidentally turned my room into a sauna at night which wasn't appreciated by the rest of my family). Talk about 'hands-on learning'.

As for the hardware, GPUs blow my mind as to how they can be repurposed from gaming cards into engines for deep learning. The price tags are hefty though. If you're not on the lookout for the newest, spendiest models and are okay with last gen's cards, you can still get quite some bang for your buck. There are often good deals in /r/hardwareswap and eBay.

Especially on ML tasks, the amount of memory on a graphic card is a game-changer for training more complex models. It means the difference between waiting for days versus hours for algorithms to finish. Even though I'm not chasing after state-of-the-art deep learning models, finding a GPU like the RTX 3080 makes a huge difference for my projects (such as running my own local chatbot). Having the right amount of RAM is essential, so don't skimp on it. If you're curious about squeezing every ounce of performance out of your GPUs, I recommend keeping an eye on the Deep Learning Performance Guide by Hugging Face. This guide is often updated and includes the latest advances on the models available on the platform.

Even with the hardware specs down, don't underestimate the software side either: frameworks like PyTorch and TensorFlow and libraries for CUDA optimization can make or break your setup. And you only understand that once you have a model freeze because you forgot to update your CUDA libraries (best to always keep that stuff up to date as the field is moving fast).

The biggest takeaway for me has been understanding and respecting the hardware: each component has its role in a finely tuned ecosystem. I've done quite some research which I will get deeper into below - like following projects on GitHub like tortoise-tts. This keeps me on top of the latest ML developments and doubles as a learning opportunity.

Building my own deep learning system isn't just about the end result for me - I see it as being about the knowledge and experience gained throughout the process. It's satisfying to see a pile of boxes and wires turn into a powerhouse capable of helping me develop AI applications.

Self-Built Versus Cloud

I've been going back and forth on the economics of building a personal deep learning rig versus leveraging the cloud - the ROI dilemma is not easy. When I first thought about putting together my own setup, the initial costs looked like a steep climb – we're talking about a heap of cash upfront. But I liked the idea of having total control over my hardware and to some extent learning about assembling and putting the different components together (see, for instance: my experience using an external GPU (eGPU) for deep learning).

Looking at the cloud, it's convenient. You pay for what you use, and you get access to cutting-edge hardware with a few clicks. But here's the thing: cost-efficiency swings both ways. On one hand, cloud services like AWS, Google Cloud, or vast.ai can rack up a hefty tab if you're training models frequently or working with large datasets. Check out their pricing and a simple calculator will show you that the numbers can escalate quickly (AWS Pricing for instance).

Now, on the flip side, if I look at my DIY rig, I've invested in some solid equipment – a rig that packs a punch with an Intel Core i7, an RTX 3080, and ample RAM. It wasn't cheap, but it wasn't break-the-bank expensive either. I've snagged myself a space where I can train models without sweating over a ticking cost meter. Plus, if I need a break from deep learning tasks, I can use the same machine for some pretty hefty gaming.

Of course, there's the whole depreciation argument – GPUs can hold their value, but that's assuming the market doesn't shift drastically. When crypto crashed, everyone tried to offload their mining rigs. I was one of those lucky ones who actually managed to sell an older GPU for more than I paid for it.

Moreover, we can't ignore the educational value that building your own rig brings. As I mentioned earlier, I've learned stacks about hardware intricacies – PCIe lanes, bottlenecking, even the subtle art of optimizing thermal paste application. Each lesson makes me a better techie, and that's something you just can't put a price on.

Then there's the impact factor. I've been able to train models for things that matter to me – we're talking personal projects that could, eventually, turn into something bigger. The satisfaction I get when a model I've been working on starts to perform well is great.

I still use the cloud though for heavy lifting when needed. It's the flexibility for me – my rig for day-to-day tasks, ready-to-use cloud firepower for those resource-hungry models. It's all about balancing cost and convenience. I go through the hosted repositories on places like GitHub for frameworks or models that are a notch above my home setup.

At the end of the day, I'd say it's not just about return on investment in the financial sense; it's also about return on intellect and the joy of mastering a craft. Sure, my electricity bill looks a tad inflated, and my room sometimes doubles as a sauna (free-of-charge!), but each day with my rig is a new adventure. So, when you break it down, the "DIY vs. cloud" dilemma isn't just a dollars-and-cents question; it cuts into the fabric of what makes us passionate about tech. And can you really put a price tag on that?

Technical Challenges: Cooling and Power

I quickly learned that cooling and power weren't just trivial details but vital aspects of running a performant and sustainable setup.

When I tried cramming multiple GPUs (a second RTX 3080 that I borrowed) into my deep learning rig, the biggest headache was the heat they produced. It's like having a small sun tucked away in your garage. I had to get innovative; standard cooling just wasn't cutting it. Delving into the world of PCIe bifurcation and swamp coolers transformed my space, and these discoveries were a total game-changer for my cooling strategy. While online resources, such as various Github / Huggingface repositories, were awash with software solutions, the cooling setup became a personal experimentation lab.

I turned my setup into something resembling a large computer tower: intake at the bottom, exhaust at the top. At the same time, I developed a love-hate relationship with my electricity bill. Initially, paying $0.0875/kWh, I was living the dream.

But when you're seeing returns, it's hard not to justify the power (and the surge proctors come in handy). With swarm learning, the thought of renting my GPUs out was tempting, especially when I saw monthly profits hit the $600-800 range. But having personal hardware had its perks, like being able to tweak every inch of the system for peak efficiency. Those 3080s and 3090s, with their NVLink capabilities, held more potential in my hands than in any cloud setup, despite their insistence on high-speed interconnects like 100GbE.

When looking into bringing costs down, I stumbled upon mikrotik switches, offering 4x100Gbps switch for $800—what a deal, right? It's kind of mind-boggling that we now live in an age where I can have such high bandwidth in my house without breaking the bank. Back in the day, 100 mbps cost us a small fortune!

All this work, the personalization—the tinkering—it's been about getting that upper edge in data parallel training without getting bottlenecked. The journey from worrying about potentially frying my system to seeing those GPUs slice through data like a hot knife through butter? Priceless. And who knew that humidity control would be such a significant player in effective cooling? The dry air in Utah was an unexpected ally, reducing the need for costly mini-splits. Sometimes nature gives you a freebie.

Diving into this DIY deep learning world, I've developed skills that aren't just about understanding the tech but mastering the environment it lives in. It's easy to get caught up in the allure of putting together a powerhouse rig, but without considering these vital behind-the-scenes players, you're setting up for a meltdown—quite literally.

Whether you're a fellow DIY enthusiast or someone curious about the world of deep learning hardware, keep in mind that tuning those machines is like a dance, an exquisite balance between power draw, cooling innovation, and an eye for economical solutions. I've fitted fans, parsed power supplies, crunched numbers on ROI, and it's been one heck of a learning curve—not just building the system but becoming its maestro.

Scalability and Performance Harnessing Multiple GPUs

Spreading the computational load across several units really pushed my projects' boundaries (P.S. if you're interested in starting with deep learning as well hands-on, check out my introduction to Pytorch in Python, which can help you get started). It's convenient when you're working on a convolutional neural network (CNN) for image classification, and you can finish it because you've got enough GPU power to experiment with larger datasets without grinding to a halt. Or when training language models, watching the loss curve steadily dive thanks to the combined power of multiple GPUs. It's like each GPU is a brain cell, and together, they're a hyper-brain tackling problems en masse.

One core aspect I focused on while expanding my setup was ensuring a balanced architecture. I couldn't slap together a bunch of high-end cards and pray for the best—learning about PCIe lanes and bottlenecks was crucial (I would lose sleep if one card didn't perform just because it was getting choked on a slower connection).

Speaking of connections, ventures like NVIDIA NVLink bridge deal with precisely this issue. It's a cozy little tech that lets GPUs communicate directly, bypassing the need for data to shuffle back and forth through the CPU, which is, you know, super handy for reducing overhead. For instance, the likes of Torch PyTorch DDP framework handle multi-GPU learning very effectively, especially when combined with NVLink or high-speed InfiniBand networks.

Another thing that struck me was the impact of good old software optimization. Even if you've got a beast of a hardware setup, it's the software tweaks that could squeeze that extra juice out your system. Keeping in sync with updates from CUDA or the latest libraries like TensorFlow or PyTorch can lead to dramatic performance boosts without changing a single screw in your rig.

Honestly, I'm a tad envious of those academic folks who have access to supercomputers or colossal amounts of cloud credits, but there's something profoundly rewarding about building your system. The sense of ownership and the knowledge that your machine, with its intricacies and customizations, is churning out those results is priceless.

Ultimately, for those of us without the clout to commandeer a supercomputing cluster, the question becomes how to pool enough GPU horsepower intelligently. And sure, while you might hit a point of diminishing returns with hardware expansion due to scaling inefficiencies or power draw, a well-tuned multi-GPU system can still be a budget-friendlier doorway to serious deep learning horsepower.

And hey, when you're a part of online communities like r/MachineLearning, sharing your multi-GPU triumphs (and woes) feels like contributing to a wider knowledge base. We're all in this whirlwind of deep learning discovery together, after all. It's like I can throw my findings into the ring and maybe help someone out there—it's that shared learning that drives innovation forward.

Building vs Buying: Component Choices

Building or buying a deep learning rig can be a bit of a rabbit hole, but man, is it a fascinating one! I've been there, tweaking and tuning, and I've come to appreciate the delicate balance of component choices. It's this incredible puzzle where every piece impacts your final outcome, and honestly, I've learned a ton along the way.

First off, let's chat about the GPU. It's the heart of your DL system, and skimping here is like trying to win a race with a go-kart at the Grand Prix. I've seen folks go for something like an RTX 3080 because of budget constraints or availability issues. I get it, prices can be wild, but if you can, snag a GPU with more VRAM. It's a game-changer when you start training more complex models. The RTX 3080 with 12GB VRAM is a sweet spot if you manage to find a good deal.

Moving on, the CPU and RAM. The guidelines can be a bit looser here, but they're still crucial. I chose an Intel Core i7 for my latest build – beefy enough to juggle data preprocessing, but not so overkill that I'm crying into my wallet. Honestly, if you're purely into deep learning, you could even downshift to an 8-core CPU to save some dollar bills for other parts. And on the RAM front? 16GB might seem enough until Chrome decides to munch on it like a midnight snack. Jump to 32GB if you can. It's like opening the windows to let your PC breath.

Now, let's chat about the unsung hero – the Power Supply. My mantra is, "Don't mess with the PSU." Get a solid one, like I did with my 750W. It seems hefty, but you do not want to skimp on stable power. Trust me, those GPUs are thirsty.

A quirky little thing I love about building my own rig is the cooling system. I honestly think water cooling is one of humanity's best inventions. It's sleek, keeps your system chilly, and the humming noise is now my geeky lullaby.

I've also seen a trend of questions about finetuning smaller models at home versus going ham grappling with 7B parameter LLMs. Guys, context is key. For the big guns, you want every ounce of power. For lighter tasks and learning, your home-built fortress will do just fine.

And now for the DIY part. Some say why build when you can buy. But where's the fun in that? Every component I’ve chosen, from the NVMe SSD that zips through data, to the cool RGB fans that make my rig look like a mini disco, was meticulously picked. I started off small, but later scored an NVMe drive that made data I/O as smooth as butter. Plus, as my needs grew, so did my rig, bit by bit.

In the end, it all boils down to what you’re diving into. Consider your project requirements, budget, and maybe even the climate you're in! It's your canvas, and you're the da Vinci of your deep learning masterpiece. Check out some of the awesome projects out there, like Tortoise TTS for speech synthesis, to get an idea of what you might need.

To wrap this up, building a deep learning system is a blend of art and science. It's the personal touch, the thrill of the build, the sweet victory when your model trains without a hiccup. It's an investment – in money, sure, but also in knowledge and satisfaction. Whether you're a student, a hobbyist, or someone looking to make a mark in the ML space, rolling up your sleeves and building your own system can be one of the most gratifying adventures in tech. Just remember, what you're creating is more than a machine; it’s a bridge to the future of AI.