Running large language models (LLMs) on your own hardware, often called "private" or "on-premise" AI, keeps your data inside your building, replaces per-user cloud fees with a one-time purchase, and removes vendor lock-in. The catch is sizing it right. AI servers are priced largely by GPU memory (VRAM), the onboard memory where the model runs, and buying too little (or too much) is an expensive mistake.

This guide covers the numbers that actually matter: how much VRAM your models need, why system RAM and the CPU matter too, when self-hosting beats a cloud API, and how to spec a server. eRacks has built custom open-source servers since 1999, and we will size one to your workload at no charge. Configure an AI server →

Why run AI on your own hardware?

  • Privacy and compliance. Some data legally or contractually cannot leave your control: protected health information (HIPAA), attorney-client material, classified or controlled government data, unpublished research, source code. A private server means your prompts and documents never transit a third party's infrastructure.
  • Predictable cost. A one-time purchase instead of per-seat or per-token billing that grows with every user and every query.
  • Control. Your models, your uptime, no rate limits, and no vendor quietly deprecating the model you built a workflow around.

For light or occasional use, a cloud API is cheaper and simpler. Private AI wins when you have data you cannot send out, or when usage is steady and everyday.

The first number: GPU memory (VRAM)

A model has to fit in GPU memory to run at full speed. How much you need is set by the model's parameter count and its quantization, which compresses the model's weights to fewer bits each (Q4 is about 4 bits per weight and is near-lossless for most tasks, Q8 is about 8 bits, fp16 is full precision at 2 bytes per weight). Add headroom for context length and concurrent users.

Model size (example) Q4 (4-bit) Q8 (8-bit) Good for
7 to 8B (Llama 3.1 8B, Mistral) ~6 GB ~10 GB chat, RAG, coding assist
32 to 34B (Qwen 2.5 32B) ~22 GB ~38 GB strong reasoning, agents
70B (Llama 3.3 70B, DeepSeek-R1) ~42 GB ~80 GB frontier-class open models
120B+ or several models at once 70 GB+ 140 GB+ heavy or multi-tenant

A quick rule: VRAM in GB is roughly the parameter count in billions times 0.6 for Q4, or times 1.1 for Q8, with context headroom included. (RAG, or retrieval-augmented generation, feeds the model your own documents at query time.)

It is not only VRAM: system RAM and CPU matter too

GPU memory holds the model, but two other things shape a real AI server.

System RAM. The server needs enough conventional RAM to stage models into the GPUs, to run the model server and your data pipeline (RAG retrieval, embeddings, document parsing), and to spill over when a model is a little too big for VRAM. Frameworks like llama.cpp and vLLM can offload some layers to the CPU and system RAM, which is slower but keeps the model running. A good rule is to size system RAM at roughly 1.5 to 2 times your total VRAM. If you run a model entirely on the CPU with no GPU at all, the whole model lives in system RAM, so a 70B model at Q4 needs about 48 GB of RAM by itself.

CPU and PCIe lanes. The processor feeds the GPUs through PCIe lanes, so a multi-GPU server needs a CPU with enough lanes to drive every card at full bandwidth. That is a big reason eRacks builds on server-class AMD EPYC and Intel Xeon processors rather than desktop chips: they provide far more PCIe lanes and support ECC (error-correcting) memory. The CPU also runs the model server, schedules concurrent users, and does the embedding and retrieval work in a RAG setup. For a single-GPU chat box the CPU is rarely the bottleneck; for multi-GPU, multi-user, or RAG-heavy workloads, cores and lanes start to matter.

The short version: size the VRAM to your largest model, the system RAM to about 1.5 to 2 times the VRAM, and choose a server CPU with the lanes for your GPU count.

When self-hosting beats the cloud

The arithmetic is direct. A cloud subscription such as ChatGPT Team runs about $30 per user per month. For a 30-person team that is roughly $10,800 a year, every year, with your prompts on someone else's servers. An on-premise eRacks AILSA at $5,995 covers the same everyday inference on hardware you own, and pays for itself in under a year. After that it is effectively free.

The framework: self-host wins when the annual cloud fees times the years you will use it exceed the hardware plus power, or when any data simply cannot go to a third party. In practice that is roughly 5 to 10 or more regular users, or any privacy mandate.

The GPUs: VRAM without the NVIDIA tax

You do not need flagship NVIDIA silicon to run these models. You need VRAM.

  • Intel Arc Pro B50 16GB (low-profile, about $349 to $399): the value pick. Four of them give 64 GB for well under $8,000 of GPU.
  • Intel Arc Pro B70 32GB (about $949): roughly half the price per gigabyte of VRAM of comparable NVIDIA professional cards. Four give 128 GB.
  • NVIDIA RTX PRO 4000 Blackwell SFF 24GB (configure-to-order): when you need the CUDA ecosystem and ECC memory in a small-form-factor, 70-watt card.

Sizing an eRacks server to your models

Server GPU memory Comfortably runs From
AILSA (2U) up to 96 GB (low-profile to flagship) Llama 3.3 70B (Q4), Qwen 2.5 32B, several smaller models $5,995
AIDAN (2U) 32 GB (1 Arc Pro B70) 32 to 34B models, 8B at full precision $13,895
AINSLEY (4U) 128 GB (4 Arc Pro B70) 70B with room for long context $21,995
AISHA (4U) up to 256 GB (8 Arc Pro B70) 70B at Q8, or several models, multi-tenant $30,995

Every eRacks AI server ships with Ubuntu LTS and a complete open-source AI stack (Ollama, Open WebUI, vLLM, llama.cpp, PyTorch) pre-installed and tested. Staff reach the AI from a browser on day one. There are no per-seat or per-token fees, you own the hardware outright, and your data never leaves the building.

Bottom line

  1. Start from the model, not the GPU. Decide the largest model you will run and at what quantization, size the VRAM (about params times 0.6 for Q4), then add system RAM at 1.5 to 2 times that, and a server CPU with the lanes for your GPUs.
  2. If privacy is the driver, you are already done. On-premise is the answer, and the only question is which size.
  3. The entry is lower than people expect. A 70B-class model, private, from $5,995.

eRacks will spec a private AI server to your exact models and user count at no charge, including the open-source software stack, burned in and tested. Configure an AI server → or email joe@eracks.com.

Last updated: June 27, 2026