Why run AI on your own hardware instead of a cloud API?

Three reasons: privacy and compliance (data like protected health information, attorney-client material, government data, or unpublished research that cannot legally or contractually leave your control), predictable cost (a one-time purchase instead of per-seat or per-token billing), and control (your models, your uptime, no rate limits). For light or occasional use a cloud API is cheaper and simpler; private AI wins when you have data you cannot send out, or when usage is steady and everyday.

How much GPU memory (VRAM) do I need to run a large language model?

VRAM in gigabytes is roughly the parameter count in billions times 0.6 for 4-bit quantization (Q4), or times 1.1 for 8-bit (Q8), with context headroom included. So a 7 to 8B model needs about 6 GB at Q4, a 32 to 34B model about 22 GB, and a 70B model about 42 GB at Q4 or about 80 GB at Q8. The model has to fit in GPU memory to run at full speed.

How much system RAM does an AI server need?

Size system RAM at roughly 1.5 to 2 times your total GPU VRAM. The server uses conventional RAM to stage models into the GPUs, to run the model server and your data pipeline (retrieval and embeddings), and to spill over when a model is slightly too big for VRAM, since frameworks like llama.cpp and vLLM can offload layers to the CPU and system RAM. If you run a model entirely on the CPU with no GPU, the whole model lives in system RAM, so a 70B model at Q4 needs about 48 GB of RAM by itself.

Does the CPU matter for AI inference, or only the GPU?

The CPU feeds the GPUs through PCIe lanes, so a multi-GPU server needs a processor with enough lanes to drive every card at full bandwidth. That is why eRacks builds on server-class AMD EPYC and Intel Xeon processors rather than desktop chips: they provide far more PCIe lanes and support ECC memory. The CPU also runs the model server, schedules concurrent users, and does embedding and retrieval work. For a single-GPU chat box the CPU is rarely the bottleneck; for multi-GPU, multi-user, or retrieval-heavy workloads, cores and lanes start to matter.

When is self-hosting AI cheaper than a cloud subscription?

Self-hosting wins when the annual cloud fees times the years you will use it exceed the hardware plus power, or when any data simply cannot go to a third party. In practice that is roughly 5 to 10 or more regular users, or any privacy mandate. A cloud subscription such as ChatGPT Team runs about $30 per user per month, so a 30-person team spends about $10,800 a year; an on-premise eRacks AILSA at $5,995 covers the same everyday inference and pays for itself in under a year.

Can I run a 70B model privately, and what does it cost?

Yes. A 70-billion-parameter model such as Llama 3.3 70B runs privately at 4-bit quantization on the eRacks AILSA, a 2U server starting at $5,995 with up to 96 GB of GPU memory. It ships with Ubuntu, Ollama, and Open WebUI, so staff use it from a browser on day one, with no per-seat or per-token fees and no data leaving the building.

Which GPUs run large language models without paying the NVIDIA premium?

You need VRAM, not flagship silicon. The Intel Arc Pro B50 16GB (about $349 to $399, low-profile) is the value pick; four give 64 GB. The Intel Arc Pro B70 32GB (about $949) costs roughly half the price per gigabyte of VRAM of comparable NVIDIA professional cards; four give 128 GB. The NVIDIA RTX PRO 4000 Blackwell SFF 24GB is available when you need the CUDA ecosystem and ECC memory.

Private AI, Sized & Priced

Running large language models (LLMs) on your own hardware, often called "private" or "on-premise" AI, keeps your data inside your building, replaces per-user cloud fees with a one-time purchase, and removes vendor lock-in. The catch is sizing it right. AI servers are priced largely by GPU memory (VRAM), the onboard memory where the model runs, and buying too little (or too much) is an expensive mistake.

This guide covers the numbers that actually matter: how much VRAM your models need, why system RAM and the CPU matter too, when self-hosting beats a cloud API, and how to spec a server. eRacks has built custom open-source servers since 1999, and we will size one to your workload at no charge. Configure an AI server →

Why run AI on your own hardware?

Privacy and compliance. Some data legally or contractually cannot leave your control: protected health information (HIPAA), attorney-client material, classified or controlled government data, unpublished research, source code. A private server means your prompts and documents never transit a third party's infrastructure.
Predictable cost. A one-time purchase instead of per-seat or per-token billing that grows with every user and every query.
Control. Your models, your uptime, no rate limits, and no vendor quietly deprecating the model you built a workflow around.

For light or occasional use, a cloud API is cheaper and simpler. Private AI wins when you have data you cannot send out, or when usage is steady and everyday.

The first number: GPU memory (VRAM)

A model has to fit in GPU memory to run at full speed. How much you need is set by the model's parameter count and its quantization, which compresses the model's weights to fewer bits each (Q4 is about 4 bits per weight and is near-lossless for most tasks, Q8 is about 8 bits, fp16 is full precision at 2 bytes per weight). Add headroom for context length and concurrent users.

Model size (example)	Q4 (4-bit)	Q8 (8-bit)	Good for
7 to 8B (Llama 3.1 8B, Mistral)	~6 GB	~10 GB	chat, RAG, coding assist
32 to 34B (Qwen 2.5 32B)	~22 GB	~38 GB	strong reasoning, agents
70B (Llama 3.3 70B, DeepSeek-R1)	~42 GB	~80 GB	frontier-class open models
120B+ or several models at once	70 GB+	140 GB+	heavy or multi-tenant

A quick rule: VRAM in GB is roughly the parameter count in billions times 0.6 for Q4, or times 1.1 for Q8, with context headroom included. (RAG, or retrieval-augmented generation, feeds the model your own documents at query time.)

It is not only VRAM: system RAM and CPU matter too

GPU memory holds the model, but two other things shape a real AI server.

System RAM. The server needs enough conventional RAM to stage models into the GPUs, to run the model server and your data pipeline (RAG retrieval, embeddings, document parsing), and to spill over when a model is a little too big for VRAM. Frameworks like llama.cpp and vLLM can offload some layers to the CPU and system RAM, which is slower but keeps the model running. A good rule is to size system RAM at roughly 1.5 to 2 times your total VRAM. If you run a model entirely on the CPU with no GPU at all, the whole model lives in system RAM, so a 70B model at Q4 needs about 48 GB of RAM by itself.

CPU and PCIe lanes. The processor feeds the GPUs through PCIe lanes, so a multi-GPU server needs a CPU with enough lanes to drive every card at full bandwidth. That is a big reason eRacks builds on server-class AMD EPYC and Intel Xeon processors rather than desktop chips: they provide far more PCIe lanes and support ECC (error-correcting) memory. The CPU also runs the model server, schedules concurrent users, and does the embedding and retrieval work in a RAG setup. For a single-GPU chat box the CPU is rarely the bottleneck; for multi-GPU, multi-user, or RAG-heavy workloads, cores and lanes start to matter.

The short version: size the VRAM to your largest model, the system RAM to about 1.5 to 2 times the VRAM, and choose a server CPU with the lanes for your GPU count.

When self-hosting beats the cloud

The arithmetic is direct. A cloud subscription such as ChatGPT Team runs about $30 per user per month. For a 30-person team that is roughly $10,800 a year, every year, with your prompts on someone else's servers. An on-premise eRacks AILSA at $5,995 covers the same everyday inference on hardware you own, and pays for itself in under a year. After that it is effectively free.

The framework: self-host wins when the annual cloud fees times the years you will use it exceed the hardware plus power, or when any data simply cannot go to a third party. In practice that is roughly 5 to 10 or more regular users, or any privacy mandate.

The GPUs: VRAM without the NVIDIA tax

You do not need flagship NVIDIA silicon to run these models. You need VRAM.

Intel Arc Pro B50 16GB (low-profile, about $349 to $399): the value pick. Four of them give 64 GB for well under $8,000 of GPU.
Intel Arc Pro B70 32GB (about $949): roughly half the price per gigabyte of VRAM of comparable NVIDIA professional cards. Four give 128 GB.
NVIDIA RTX PRO 4000 Blackwell SFF 24GB (configure-to-order): when you need the CUDA ecosystem and ECC memory in a small-form-factor, 70-watt card.

Sizing an eRacks server to your models

Server	GPU memory	Comfortably runs	From
AILSA (2U)	up to 96 GB (low-profile to flagship)	Llama 3.3 70B (Q4), Qwen 2.5 32B, several smaller models	$5,995
AIDAN (2U)	32 GB (1 Arc Pro B70)	32 to 34B models, 8B at full precision	$13,895
AINSLEY (4U)	128 GB (4 Arc Pro B70)	70B with room for long context	$21,995
AISHA (4U)	up to 256 GB (8 Arc Pro B70)	70B at Q8, or several models, multi-tenant	$30,995

Every eRacks AI server ships with Ubuntu LTS and a complete open-source AI stack (Ollama, Open WebUI, vLLM, llama.cpp, PyTorch) pre-installed and tested. Staff reach the AI from a browser on day one. There are no per-seat or per-token fees, you own the hardware outright, and your data never leaves the building.

Bottom line

Start from the model, not the GPU. Decide the largest model you will run and at what quantization, size the VRAM (about params times 0.6 for Q4), then add system RAM at 1.5 to 2 times that, and a server CPU with the lanes for your GPUs.
If privacy is the driver, you are already done. On-premise is the answer, and the only question is which size.
The entry is lower than people expect. A 70B-class model, private, from $5,995.

eRacks will spec a private AI server to your exact models and user count at no charge, including the open-source software stack, burned in and tested. Configure an AI server → or email joe@eracks.com.

Email:
Password:
Remember me for a month:

Rackmount Servers

Rackmount NAS Storage Systems & Servers

AI & GPT Rackmount Servers & GPU Systems

Desktops and Laptop Systems

Studio and Quiet Rackmounts and Systems

eRacks Accessories

Appliances and Open Source Project Systems

All eRacks Product Categories

General Purpose

Shallow Depth

Video NAS

Network Attached Storage (NAS) Rackmount Servers

Flash / SSD Storage Servers

AI Rackmount Servers

Open-Air GPU Systems

Desktops

Laptops & Notebooks

Studio

Quiet Systems

Racks and Hardware

Monitors

eRacks Apparel

Firewall Servers

Network Servers