Renderjuice Blog | undefined

How to Build a Render Farm

Originally a response to a poster on a forum asking for advice on how to setup a render farm, I started writing a response, but 2 hours into writing, I realized there was too much valuable information just for Reddit. So here it is.

Preface

I see similar content about setting up your own render farms come up quite frequently. I have a lot of experience with this and spent a good couple of years, building the very render farm site you’re on now! But regardless of the apparent bias, I’m hoping to shine some light without bias as best as possible to save folks some time and headaches.

But I'll tell you, it ain't easy! Especially if you're working with a distributed team of artists with many different Blender versions, different connectivity speeds, and not trained on utilizing render farms.

If you're in it for the learning experience, all power to you! You will definitely learn a lot through the struggle. But, here are the major things I've learned that'll help guide people. I marked different sections for different target readers who just want to render faster want to focus on their render quality, those just trying to use Deadline, and those who kind of sit in between.

Can’t I just use Deadline?

I see the recommendation to use Deadline a lot. I do not recommend Deadline anymore. Take it from me, I've tried it, done it, many many times.

Here’s why:

Deadline is difficult and complex - and setting up a render farm is already difficult and complex.
In fact, AWS used to publish a guide on how to properly setup Deadline. The guide, was over 225 pages long. And I read it through four or five times. The guide does not hide that this is no small task. It took me a very long time to grok what was happening most steps of the way and after about a month and a half of grueling over it. I still didn't quite understand it. The core of its architecture with a central scheduler, remote connection servers, and a non-relation database all of which needed to be run securely is a lot for any artist or small studio looking to just render faster.
Deadline is tarnished with AWS’ unpure motives - On top of the already complex architecture it became obvious as I attempted to setup Deadline that a lot of the complexity was stemming from motives to drive AWS profits. It appeared to me that every single component was intended to be eventually run on AWS. Guess how many AWS services you were paying for if you followed this guide? A whopping ten at minimum. Don’t believe me?

Here they are:
1. Shared network drives - “AWS FSX”
2. Blob Storage - “AWS S3”
3. Serverless Functions - “AWS Lambda”
4. Container Instances - “AWS Elastic Container Service”
5. Team Workspaces - “AWS Workspaces”
6. Windows Active Directory, but via AWS - “AWS Managed Directory”
7. Compute Instances - “AWS EC2”
8. Queue Services - “AWS SQS”
9. Logging Services - “AWS CloudWatch”
10. Document Database - “AWS DocumentDB”
It’s not hard to see why this would be the case. After paying a surely hefty price to acquire Thinkbox, it wouldn’t make sense not to try to earn some business for Amazon.
If you’re unfamiliar with the acquisition game that well, all you need to know is that AWS is not an acquirer to have the highest hopes for. They’re well known in the tech space for doing exactly this. Locking in users to pay for expensive services that help drive their profits and leaving users hanging.
Finally, AWS services are not cheap. Not even close to cheap. If you have a massive budget, and run a well financed studio it may be viable, but I can confidently say that it would be one of the most expensive ways to render faster.

It's risky long term - Setting up a render farm is not a small project and if you want to build on your solution for years to come, you're in a tricky spot because AWS runs it. It's only a matter of time before Thinkbox is pretty much impossible to use without AWS and doing things "the AWS way". This is a known problem in tech overall. Large companies and VC funds have the cash to purchase “free” or open source products and will drive initial traction to a product, knowing that they'll eventually lock-in users. And, once you're nice and snugly locked in to their product and ecosystem, they change the once "open source" license to becoming at best, semi-proprietary, and you're shit out of luck.

If you sink years into a piece of software, learning its intricacies and studying it religiously, , it's not a good feeling when it gets rugged from under you. Furthermore, it’ll take a concerted and painful effort to migrate your workloads away.

This might read as a bit of a conspiracy, "tinfoil hat" moment, especially since Thinkbox’s actual team originally had good intentions. However, if you work with software on a daily basis, this license flip and lock-in is a very well known problem and it happens to the very best founders and companies. In fact, it is happening frighteningly fast to the most important pieces of software right now:
Exhibit A: Redis' new BSD license - ARS Technica article - Redis license change and forking are a mess. Hackernews reactions

Exhibit B: IBM acquires Terraform shortly after, Terraform's license changes to BSL - article. Read reactions to the news of acquisition here from developers. And just this title: "HashiCorp's Licensing Change is only the latest challenge to Open Source". The emphasis in that title, is on "only the latest" to make it clear that it's not a one-time thing.

Exhibit C: The death of CentOS - which used to be the most popular Linux distro.

AWS and Deadline are merchants of complexity - Going back to point 1, I've been doing software for over a decade and a half now, and I'd say I'm pretty competent at what I do. Yet setting up Deadline stumped me. The major reason was that AWS was clearly coupling parts of their own paid services to Deadline. For the folks here who want to utilize their own GPU hardware, it will take a mighty effort to isolate the components of Deadline that are intended for AWS and those that are for you and your own hardware. That experience was about 3 years ago and AWS is continuing to progress in the VFX space, but with a "managed" solution with their own compute. I'm hoping you can see where this is going.
Let's say in the event that AWS doesnt change the license because they realize its bad for their reputation amongst developers. AWS can just make running your Deadline on your own and without their help incredibly difficult.
To be clear, I'm not anti-AWS per se, but from AWS' perspective, more complexity is good. It allows them to push certifications and more and more paid solutions onto you. You can see it in their impossible to decipher pricing models, their terrible documentation, and how every AWS service needs another AWS service. This is good for them, AWS can charge you to help you decipher the complexity they made when things get too complex. So, to conclude, would you trust a public company who sells support, is known for bad developer experience, and is incentivized to push their own compute onto you? They do not have aligned incentives to make this easy for you.
Prices - The final thing is that AWS GPU compute is very expensive and storage via S3 is the most expensive in the biz. Good luck storing large simulation cache files!

Ok, you're convinced not to use Deadline, what are the alternatives?

If you have more than 10 rendering nodes on hand and need complex weight balancing algorithms, the ability to have zero downtime in rendering and it's not just you and 2-3 other people placing renders, you might actually want a lightweight render manager, this is the component that schedules renders on multiple machines and helps parallelize).

I'd recommend Flamenco from Blender themselves. We know the Blender foundation is trustworthy and their governance is good (donate to keep it running so well!), Their codebase is properly open source and they consume their own products for their own animations. It's come a pretty long way since I originally saw it but I have high hopes, especially since it's built for Blender and unlike AWS they don't have the same misaligned incentive structure. It's still in-the-works though, so a few other alternatives are OpenCue and CGRU. I've dabbled in setting up both, but don't know them as thoroughly as I do Deadline, but I've heard good things although they won't have that same polish. But who cares, save the polish for your actual renders.

Note these other tips: Try to unify as much of the hardware and software that you're using as you can. Shoot for the same GPUs if possible, if not, VRAM specs, CUDA versions, OpenGL drivers, support only a limited set of Blender versions. If you're going to use add-ons, take note of which ones you must support. Learn a bit about how to setup Blender files for submission to render farms in general, because they're not the same as rendering locally due to distributed nature (in the sense that it's multiple machines, not one). Note that your file I/O from the shared drive needs to be fast or your render speeds will crawl when your nodes are trying to load it into memory.

If you're going to use Docker, you're going to have to fiddle around with NVIDIA's container support to get CUDA running smoothly.

If you're going down this path, I recommend reading the sections below as it's good to know the workflow (at a high level, they're the same with or without a render manager), before diving in.

The infrastructure setup (your GPUs, their connectivity to your shared storage solution, your team's network access) are more important than the choice of render management solution and it can save you some headache going high-level first.

Here's the real underrated advice, just don’t use a central scheduler:

If you're just trying to render for yourself and a friend or two or even three, I do not believe that it's worth it to use one of the render managers (Deadline, OpenCue, etc). Using scripting and some glue code is significantly easier and far more intuitive than sticking a complex scheduling software in the middle of it all. The Blender CLI is not hard to navigate and can be opened with a GUI or headlessly over SSH.

The flow at a high level is this:

Set up storage - Setup a shared NAS drive or file server to your machines.
Ensure connectivity - Ensure your machines can be accessed whether that's over SSH or by just walking over.
Render - Split your frame range up to correspond to the rough speed of the GPUs.
Output to the NAS - We still do this all the time at our own render farm, Renderjuice when we're testing and debugging things for other customer's renders.

You can use some fairly simple Ansible playbooks and Bash scripts to make that happen. SSH into the machines and run the render script, then quit. Loosely, you mostly need to just split the frames up equally on your machines (if they're equally powered). This is definitively, the easiest way to set up your own render farm. It's still a farm whether or not it has a shiny manager software in the middle. One tip here though is to ensure every Blender version is the same across all your machines and try to only support the few that you need. This is probably the top mistake.

It's far more intuitive to operate this way and it's easy to tell when something goes wrong. It's easy to check on your jobs and you get the errors right there! The CLI output tells you right on the screen. You can just opt-out of going through the hassle of aggregating logs from multiple machines into a separate logging server, which you have to do when using render managers like Deadline. And you don't have to think about power and electricity management! You can literally walk over to your machines and pull the plug or write shutdown on the CLI. Most importantly though, this method feels similar to working on your local workstation versus putting it into a render manager. If your machine is hooked up to a display and your render fails, you can pull up the Blender GUI like you always do and look through your file (it's probably packed wrong lol). There's a ton of other headaches that this method helps with versus having a manager in the middle and makes it far easier to debug; did you know Deadline doesn't have GPU rendering capabilities for Blender by default? or that race conditions within CUDA's code can cause OPTIX to fail? or degenerated polygons can pass through some parts of rendering and then stop in the middle?

This lets you focus more the Blender part and not the rendering part if possible and still reap 80% of the time savings that you hypothetically would have gotten with a render manager. If you need more and more speed, I do still recommend eventually using a render farm service like Renderjuice with more resources, especially if you need a rushed render or a very heavy render done.

If, you just want to render and focus on the Blender part, want an out of the box solution, or are non-technical and don't fit neatly above:

I do recommend you check out what we've done at renderjuice.com - we're a small indie, self-funded team and consequently had to work our tails off to get all the above working, but that's good, because we've solved most of the complex kinks (of which there are many) for you! Plus you don't have to maintain the damn thing, and it'll constantly get better off our backs. Some things in the mix that we're working on are immediate render startups, thorough add-on integration, and we've got ACES support in beta :).

I know I'm biased, but hopefully the information above somewhat demonstrates competency and trust. I do think we came up with a phenomenal rendering solution that works very well. We're responsive and helpful wherever we can be. And we've solely been focused on supporting Blender to the best that we can. We've had some pretty well known renders go through our pipes and hopefully it'll work for you.

There's a bunch of other things you gotta look out for if you're going down the render manager route and/or have a distributed team and more complex needs. For example, file read i/o speeds, network latency, CUDA drivers (rip my hair out), Blender version management, Blender's Python management (Blender ships with its own Python), VRAM management.

We've solutioned around as many of them as we can so hopefully you give it a spin.

Back to blog