My Local AI Setup on the M4 Pro: A Journey with Qwen 3.6
How I worked through the Qwen model family and Mac memory limits to see what's possible with local AI today.
I recently got my hands on a new MacBook Pro with the M4 Pro chip and 48GB of unified memory. Naturally, my first goal was to see if I could build a powerful local AI setup.
It’s not that I plan to abandon cloud APIs completely—tools like Gemini via API (pay-as-you-go) still have their place in my workflow. But I wanted a true, capable alternative. I’m leaning toward a hybrid approach, using cloud APIs for heavy orchestration or vision tasks, and relying on a fast, private, local setup for the 90% of my day spent writing and refactoring Go code.
I didn't start as an expert in local models. I started with a wish list and my trusty Gemini CLI agent. This is the story of how my AI assistant and I navigated the confusing world of model versions and memory limits to find a setup that actually feels like the future of coding.
Starting with the "Safe" Recommendations
When I first set out, the initial advice my Gemini CLI gave me was to stick with the Qwen 2.5 family. It’s a very stable, well-regarded series. The suggestion was to run a Qwen 2.5 Chat model for reasoning, the Qwen 2.5 Coder for autocomplete, and nomic-embed-text for handling project context (embeddings).
It sounded like a solid plan. But while I was looking through the Ollama library, I noticed something: there were newer versions. Qwen 3.0, 3.5, and even 3.6 were listed.
I started pushing back and asking my AI assistant questions. Why settle for 2.5 if 3.6 is out? We started a long chat session that led me down a rabbit hole of understanding the differences between these generations. My agent explained that 3.0 introduced a "Thinking" mode, 3.5 brought native multimodality, and 3.6 was the absolute latest, optimized for "agentic" tasks like the ones I wanted to do in my IDE.
The 48GB Memory Trap
Through our research, I learned quickly that even with 48GB of RAM, you can't just run the biggest model of everything. Unified memory is shared with the system and your GPU. If you don't budget it, your Mac starts swapping to the SSD, and suddenly your "fast" local AI feels like a dial-up modem.
My agent helped me figure out a realistic budget:
- macOS + VS Code: ~10 GB
- Docker (where my DB and Go app live): 4 GB
- The "Brain" (Chat model): Needs to be smart, so around 17GB.
- The "Fingers" (Autocomplete): Needs to be fast, maybe 5GB.
This left me a safe buffer for browser tabs and the LLM's own "memory" (KV cache).
Settling on the Stack
After testing and chatting through the options with Gemini, I landed on a split-generation setup:
- For Chat (Reasoning): qwen3.6:27b. This is the latest flagship. It’s smart enough to understand my Go architecture and it can even "see" images if I need it to.
- For Autocomplete: qwen2.5-coder:7b. Even though there are newer general models, the 2.5 Coder is a specialist. It knows how to fill in the middle of a code block (FIM) better than the newer generalists.
The Friction: When "Smart" is Too Slow
I set everything up in VS Code using two different tools:
- Continue: I use this for the simple "Tab" autocomplete. It uses the Qwen 2.5 Coder 7B and feels very light.
- Cline: This is my "Agent." I use it for the heavy lifting—like asking it to refactor a whole package or find a bug across multiple files.
But I hit a wall. When I used the standard qwen3.6:27b in Cline, it was smart, but it was slow. It was generating code at about 20 tokens per second. For a chat, that's fine. But when an agent is rewriting 200 lines of Go code, you find yourself staring at the screen, waiting.
The Pivot to NVFP4
I went back to Gemini for more research and found a specialized version of the model: qwen3.6:27b-coding-nvfp4.
My agent explained that this model is a bit special for Macs. It uses a format called NVFP4 that Ollama translates into Apple's native MLX framework. I’m no expert on the math behind it, but the result was immediate: because it talks more directly to the M4's GPU, the speed nearly doubled.
I went from watching the AI type to seeing the code almost "appear" on the screen. I did have to sacrifice the vision (multimodality) to get this speed, but again, that’s where the hybrid approach shines. For the 99% of my day spent writing code, the speed of the NVFP4 model is much more valuable. When I need an AI to "see" a diagram or coordinate a massive architectural review, I fall back to my cloud APIs.
The Result
My setup now feels perfectly dialed in. By pushing back on the "safe" suggestions and doing the research with my AI assistant, I found a way to use the newest models without melting my machine.
- Brain: Qwen 3.6 27B NVFP4 (Local & Fast)
- Autocomplete: Qwen 2.5 Coder 7B (Local & Precise)
- Tools: Continue for the small stuff, Cline for the big agentic tasks.
It was a journey of learning that speed is often just as important as intelligence when you are trying to stay in the "flow state" of coding.
References & Resources
- Ollama: The engine running my local models.
- Qwen 3.6 Models: The latest flagship series from Alibaba Cloud.
- Cline (formerly Claude Dev): The autonomous agent extension for VS Code.
- Continue: The IDE extension for FIM autocomplete and chat.
- MLX Framework: Apple’s framework for high-performance machine learning on Silicon.
- Gemini CLI: My cloud-based orchestration agent that helped me research this setup.