Skip to main content
Adzbyte
AITools

Local AI Models for Development: Running LLMs on Your Mac

Adrian Saycon
Adrian Saycon
March 2, 2026Updated March 4, 20263 min read
Local AI Models for Development: Running LLMs on Your Mac

Every API call to GPT-4 or Claude costs money, has rate limits, and sends your code to someone else’s servers. For quick dev tasks — drafting commit messages, explaining regex, generating boilerplate — running a model locally makes more sense than you’d think.

The Hardware Reality

You need a Mac with Apple Silicon (M1 or later) and at least 16GB of unified memory. 8GB works for tiny models, but you’ll be limited to 3B-parameter models that produce mediocre output. With 32GB+ you can run 13B–34B models comfortably, which is where local models start to be genuinely useful.

The key metric is how much memory a model needs. A 7B model at Q4 quantization uses roughly 4-5GB. A 13B model needs ~8GB. Apple Silicon’s unified memory architecture means GPU and CPU share the same RAM, so your model gets the full pool.

Ollama: The Easiest Starting Point

Install Ollama and you can have a local model running in under two minutes:

brew install ollama
ollama serve &
ollama pull llama3.2
ollama run llama3.2

That gives you Meta’s Llama 3.2 (3B or 8B parameters depending on the variant). For coding tasks, I recommend pulling a code-specialized model:

ollama pull deepseek-coder-v2:16b
ollama pull codellama:13b

Ollama runs a local API server on port 11434, so you can integrate it into scripts:

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-coder-v2:16b",
  "prompt": "Write a TypeScript function that debounces any callback",
  "stream": false
}' | jq -r '.response'

LM Studio: When You Want a GUI

LM Studio gives you a ChatGPT-like interface for local models. Download it from lmstudio.ai, browse their model catalog, and download whatever catches your eye. It handles quantization formats (GGUF) and lets you adjust parameters like temperature and context length with sliders.

What sets LM Studio apart is its built-in OpenAI-compatible API server. Start the server, and any tool that works with the OpenAI API works with your local model — just point it to http://localhost:1234/v1.

llama.cpp: Maximum Control

If you want to fine-tune performance or integrate into a C/C++ project, llama.cpp is the engine that powers both Ollama and LM Studio under the hood. Building from source:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_METAL=1
./llama-cli -m models/codellama-13b.Q4_K_M.gguf 
  -p "Explain this regex: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$" 
  -n 512

The LLAMA_METAL=1 flag enables Apple GPU acceleration, which doubles or triples token generation speed on M-series chips.

Model Recommendations for Dev Tasks

  • Code generation/review: DeepSeek Coder V2 16B — best coding model that fits in 32GB
  • General questions: Llama 3.2 8B — fast, good at explanations
  • Commit messages/docs: Mistral 7B — small, quick, good at following instructions
  • Complex reasoning: Qwen 2.5 32B — needs 64GB RAM but impressively capable

Practical Dev Workflow Integration

I keep a shell alias that pipes code to a local model for quick review:

# ~/.zshrc
function ai-review() {
  local file=$1
  local content=$(cat "$file")
  curl -s http://localhost:11434/api/generate -d "{
    "model": "deepseek-coder-v2:16b",
    "prompt": "Review this code for bugs and improvements:\n\n${content}",
    "stream": false
  }" | jq -r '.response'
}

# Usage: ai-review src/utils/parser.ts

What Local Models Can’t Do (Yet)

Don’t expect GPT-4 or Claude-level reasoning from a 13B model. They struggle with multi-step architectural decisions, nuanced code refactoring across files, and maintaining context over long conversations. Use them for focused, single-purpose tasks. The moment you need deep reasoning, switch to a cloud model.

That said, for the 80% of dev tasks that are mechanical — formatting, simple generation, quick lookups — local models save real money and keep your code private.

Adrian Saycon

Written by

Adrian Saycon

A developer with a passion for emerging technologies, Adrian Saycon focuses on transforming the latest tech trends into great, functional products.

Discussion (0)

Sign in to join the discussion

No comments yet. Be the first to share your thoughts.