Lessons from the Edge: Turning the M4 Mac Mini into a Local LLM Workhorse
Introduction
I promised to document my experience—here it is. Over the past month I ran a base-spec M4 Mac Mini (512 GB internal SSD, 30 GB unified memory\*¹) with a 4 TB external SSD as both a desktop and an OpenAI-compatible inference server. Spoiler: it’s quiet, frugal, and strong enough for day-to-day private AI workloads.
*¹ Apple hasn’t published granular GPU limits; Activity Monitor shows ~22 GB addressable by the GPU. Your mileage may vary.
1 Why Local LLMs?
- Data sovereignty & GDPR – nothing leaves the building
- Cost certainty – no usage-based billing
- Air-gap option – disable outbound traffic entirely
Telemetry caveat: Ollama and LM Studio collect anonymous stats unless you disable them (OLLAMA_DISABLE_TELEMETRY=1
, LM Studio → Settings → Privacy).
2 Boot Options: Internal vs External macOS
I installed macOS 14.5 on a USB-4 4 TB SSD to hot-swap configurations.
- Pros: instant roll-back, sandboxed experiments
- Cons: macOS updates can block the drive if Secure Boot is on Full.
- Fix: Reboot → hold ⌘R → Startup Security Utility → Reduced Security + Allow booting from external media.
Booting internally avoids this hurdle entirely.
3 Toolchain
Runner | Why I Use It | Key Tweaks |
---|---|---|
Ollama | Fast CLI, good model library | zsh mismatches when you change context-size, even with re-configured models. |
LM Studio | GUI + OpenAI API in one app. Good model library |
Edit ~/.cache/lm-studio/.internal/http-server-config.json :"networkInterface": "0.0.0.0" to bind to all NICs. |
llama.cpp | Lowest overhead, script-friendly | Compile with make LLAMA_METAL=1 LLAMA_METAL_EMBED=1 . Minimal logs: run via -v for more detail. |
For experimentation, testing and interactive as well as server use, LM Studio was my preferred choice.
4 Networking Cheat-Sheet
- Assign the Mac Mini (brainbox.local) a static IP (e.g.,
192.168.0.42
). - Add it on clients:
192.168.0.42 brainbox.local
in/etc/hosts
. - Bonjour works mostly on the same subnet, but static mapping survives VLANs & VPNs.
5 Performance Snapshot (measured with llama-bench
)
Model | Quant. | VRAM (GB) | Tokens / s | Power (W) |
---|---|---|---|---|
Mistral-7B-Instruct | Q4_K_M | ≈ 10 | 54 | 28 |
Qwen-14B-Chat | Q5_K_S | ≈ 19 | 25 | 29 |
Llama-3–24B | Q4_K_M | ≈ 22 | 12 | 30 |
(Batch = 1, 4096 ctx, “metal” backend) |
Above 24B the context window or swap hits host RAM and responsiveness drops sharply.
6 System Characteristics
- Power: 6 W idle → 30 W sustained load
- Noise: ≤ 20 dB(A); the fan stays below 2000 RPM
- Footprint: 19 cm square—fits under a monitor stand
7 Typical In-House Use Cases
- Privacy Chat sessions Anything LLM and RAG
- Virtual assistant “Agent Kim” replying to emails (agentic system tasked by [local] e-mail and watched folder)
- Batch embedding for a local semantic-search index
- Document conversion & watermarking
8 CrewAI Configuration Example (most OpenAI-compatible client work) for local processing
import os
os.environ["OPENAI_API_BASE"] = "http://brainbox:1234/v1"
os.environ["OPENAI_MODEL_NAME"] = "openai/qwen2.5-coder-7b-instruct"
os.environ["OPENAI_API_KEY"] = "lmstudio_placeholder" # dummy
os.environ["LITELLM_PROVIDER"] = "openai-chat"
Swap brainbox
for your Mini’s IP/hostname.
9 Takeaways
The base M4 Mac Mini isn’t a GPU monster, but for 7B-to-13B models it feels like a dedicated inference appliance—drawing less power than many laptop chargers. If your data can’t leave the premises (or you’re simply done paying per token), this little box is worth a spot on the desk for less than 1500 € for the hardware.