Local LLM Setup
Robota works with any OpenAI-compatible local inference server. This guide covers the two most popular options — Ollama and LM Studio — plus any llama.cpp server or custom endpoint.
No API key required. Local models run entirely on your machine. Your code, prompts, and conversation history never leave your device.
Quick Start
# 1. Start your local model server (see below)
# 2. Configure Robota to use it
robota --configure
# Select "Gemma / LM Studio" and enter your server URL:
# Ollama: http://localhost:11434/v1
# LM Studio: http://localhost:1234/v1Option 1: Ollama
Ollama is the easiest way to run local models. It manages downloads and serves an OpenAI-compatible API automatically.
Install and start
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (examples)
ollama pull llama3.2 # 3B — fast, good for most tasks
ollama pull codellama # Optimized for code
ollama pull qwen2.5-coder # Strong at TypeScript/Python
ollama pull mistral # Good general purpose
# Verify the server is running
curl http://localhost:11434/v1/modelsOllama starts automatically and listens on http://localhost:11434.
Configure Robota
robota --configureWhen prompted:
- Provider: Select
Gemma / LM Studio(uses OpenAI-compatible API) - Base URL:
http://localhost:11434/v1 - Model: Enter the model name exactly as in
ollama list(e.g.llama3.2,codellama) - API key: Enter any value (e.g.
ollama) — Ollama does not validate it
Or set via environment variables:
ROBOTA_PROVIDER=gemma \
ROBOTA_BASE_URL=http://localhost:11434/v1 \
ROBOTA_MODEL=llama3.2 \
ROBOTA_API_KEY=ollama \
robotaRecommended models for coding
| Model | Size | Best for |
|---|---|---|
qwen2.5-coder:7b | 7B | TypeScript, Python, code review |
codellama:13b | 13B | General code generation |
llama3.2:3b | 3B | Fast responses, simple tasks |
deepseek-coder-v2:16b | 16B | Complex reasoning, refactoring |
Larger models produce better results but require more RAM and run slower.
Option 2: LM Studio
LM Studio provides a GUI for downloading and running models, with a built-in local API server.
Install and start
- Download LM Studio from lmstudio.ai
- In the Discover tab, search for and download a model (e.g.
Gemma 3,Llama 3.2,Qwen 2.5 Coder) - In the Local Server tab, click Start Server
The server runs on http://localhost:1234 by default.
Configure Robota
robota --configureWhen prompted:
- Provider: Select
Gemma / LM Studio - Base URL:
http://localhost:1234/v1 - Model: Enter the model name exactly as shown in LM Studio's Local Server tab
- API key:
lm-studio(placeholder — LM Studio does not validate it)
Option 3: llama.cpp Server
If you compile and run llama.cpp directly:
# Start llama.cpp server
./llama-server -m models/your-model.gguf --port 8080
# Configure Robota
robota --configure
# Base URL: http://localhost:8080/v1
# API key: any value (e.g. "local")Troubleshooting
"Connection refused" or "Network error"
- Verify the server is running:
curl http://localhost:11434/v1/models - Check the port number matches your configuration
- On some systems, the server may bind to
127.0.0.1only — tryhttp://127.0.0.1:11434/v1
Model not responding
- Ensure the model name in Robota's config exactly matches the model loaded in your server
- For Ollama, run
ollama listto see available model names - For LM Studio, the model name is shown in the Local Server tab header
Slow responses
- Local models are slower than cloud APIs, especially on CPU
- Use smaller quantized models (e.g.
q4_k_mvariants) for faster inference - Enable GPU acceleration in Ollama or LM Studio settings if available
Tool calling not working
Some local models do not support the OpenAI function-calling format. If tool calls fail silently:
- Try a model known to support tools:
qwen2.5-coder,llama3.2,mistral-nemo - In Robota, run
/mode planto switch to plan mode, which requires fewer tool calls - Check the model's documentation for tool-call support
Tips for Local Model Usage
- Context windows are smaller. Most local models have 4K–32K token context vs 200K+ for cloud models. Use
/compactwhen the context fills up. - Response quality varies. Code quality from 7B models is good for straightforward tasks; use 13B+ for complex refactoring.
- No rate limits. You can run as many sessions as your hardware supports.
- Offline capable. Once the model is downloaded, no internet connection is required.