Providers
Mirror Mate uses external providers for LLM (language model), TTS (text-to-speech), STT (speech-to-text), VLM (vision language model), and Embedding (vector generation). Providers are configured in config/providers.yaml.
Configuration
providers:
llm:
enabled: true
provider: ollama # openai or ollama
# ...
tts:
enabled: true
provider: voicevox # openai or voicevox
# ...
stt:
enabled: true
provider: web # openai, local, or web
# ...
vlm:
enabled: true
provider: ollama
# ...
embedding:
enabled: true
provider: ollama
# ...
memory:
enabled: true
# ...LLM Providers
| Provider | Description | API Key Required |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini | Yes |
| Ollama | Local LLM hosting | No |
OpenAI
- Get an API key from OpenAI
- Add to
.env:
OPENAI_API_KEY=sk-your-api-key-here- Configure in
providers.yaml:
providers:
llm:
enabled: true
provider: openai
openai:
model: gpt-4o-mini # or gpt-4o
maxTokens: 300
temperature: 0.7Models
| Model | Description | Speed | Cost |
|---|---|---|---|
gpt-4o | Most capable | Medium | Higher |
gpt-4o-mini | Fast and efficient | Fast | Lower |
Ollama
Ollama allows running LLMs locally without API costs.
- Install Ollama:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh- Start Ollama server:
ollama serve- Pull a model:
ollama pull gpt-oss:20b- Configure in
providers.yaml:
providers:
llm:
enabled: true
provider: ollama
ollama:
model: "gpt-oss:20b"
baseUrl: "http://localhost:11434"
maxTokens: 300
temperature: 0.7Recommended Models for Japanese
| Model | Size | Japanese Quality | Tool Calling | Speed |
|---|---|---|---|---|
gpt-oss:20b | 20B | Excellent | Native | Medium |
qwen2.5:14b | 14B | Very Good | Yes | Medium |
qwen2.5:32b | 32B | Excellent | Yes | Slow |
LLM Options
| Option | Type | Description | Default |
|---|---|---|---|
provider | string | openai or ollama | openai |
model | string | Model name/ID | varies |
maxTokens | number | Maximum response length | 300 |
temperature | number | Creativity (0.0-1.0) | 0.7 |
baseUrl | string | API endpoint (Ollama only) | http://localhost:11434 |
TTS Providers
| Provider | Description | API Key Required |
|---|---|---|
| OpenAI | OpenAI TTS API | Yes |
| VOICEVOX | Free, local, Japanese voices | No |
OpenAI TTS
providers:
tts:
enabled: true
provider: openai
openai:
voice: shimmer # alloy, echo, fable, onyx, nova, shimmer
model: tts-1 # tts-1 or tts-1-hd
speed: 0.95Voices
| Voice | Description |
|---|---|
alloy | Neutral, balanced |
echo | Warm, conversational |
fable | Expressive, narrative |
onyx | Deep, authoritative |
nova | Friendly, upbeat |
shimmer | Clear, gentle (default) |
VOICEVOX
- Download and install VOICEVOX from voicevox.hiroshiba.jp
- Start VOICEVOX (runs on port 50021 by default)
- Configure:
providers:
tts:
enabled: true
provider: voicevox
voicevox:
speaker: 3 # Speaker ID
baseUrl: "http://localhost:50021"Speaker IDs (Common)
| ID | Character |
|---|---|
| 0 | 四国めたん (あまあま) |
| 1 | ずんだもん (あまあま) |
| 2 | 四国めたん (ノーマル) |
| 3 | ずんだもん (ノーマル) |
| 8 | 春日部つむぎ |
| 9 | 波音リツ |
STT Providers (Speech-to-Text)
STT providers enable speech recognition for voice input. Mirror Mate supports multiple providers with automatic silence detection.
Note: STT language settings can be automatically configured based on your app locale using Locale Presets. When you change your locale (e.g.,
jatoen), the STT language is automatically updated.
| Provider | Description | API Key Required | Accuracy |
|---|---|---|---|
| Web Speech API | Browser native (Chrome/Edge) | No | Good |
| OpenAI Whisper | Cloud API | Yes | Excellent |
| Local Whisper | Self-hosted (faster-whisper) | No | Excellent |
Web Speech API (Default)
Uses the browser's built-in speech recognition. Best for quick setup with no additional configuration.
providers:
stt:
enabled: true
provider: web
web:
language: ja-JP # BCP 47 language tagPros: Zero cost, instant setup, real-time interim results Cons: Browser-dependent quality, requires Chrome/Edge
OpenAI Whisper
High-accuracy speech recognition using OpenAI's Whisper API.
providers:
stt:
enabled: true
provider: openai
openai:
model: whisper-1
language: ja # ISO 639-1 code (or omit for auto-detect)
temperature: 0Pros: Excellent accuracy (especially Japanese), 99+ languages Cons: API cost ($0.006/minute), requires internet
Local Whisper (faster-whisper)
Self-hosted Whisper for privacy and cost savings. Uses faster-whisper-server with OpenAI-compatible API.
providers:
stt:
enabled: true
provider: local
local:
baseUrl: "http://studio:8080" # Your whisper server
model: large-v3 # tiny, base, small, medium, large-v3
language: jaSetup with Docker
# On Mac Studio (or any server)
docker compose -f compose.studio.yaml up -d faster-whisperSee Docker Documentation for details.
Models
| Model | Size | Accuracy | Speed (30s audio) |
|---|---|---|---|
tiny | 39M | Low | ~2s |
base | 74M | Medium | ~4s |
small | 244M | Good | ~8s |
medium | 769M | Very Good | ~12s |
large-v3 | 1.5G | Excellent | ~15s |
Speed measured on Apple M1/M2 Ultra (CPU mode)
Silence Detection
All STT providers support automatic silence detection to determine when the user has finished speaking.
providers:
stt:
silenceDetection:
silenceThreshold: 1.5 # Seconds of silence before sending
volumeThreshold: 0.02 # RMS volume threshold (0-1)
minRecordingDuration: 500 # Minimum recording time (ms)
maxRecordingDuration: 60000 # Maximum recording time (ms)| Option | Type | Description | Default |
|---|---|---|---|
silenceThreshold | number | Seconds of silence before auto-send | 1.5 |
volumeThreshold | number | RMS volume below which is silence | 0.02 |
minRecordingDuration | number | Min time before silence detection (ms) | 500 |
maxRecordingDuration | number | Max recording duration (ms) | 60000 |
STT Options Summary
| Option | Type | Description | Default |
|---|---|---|---|
provider | string | web, openai, or local | web |
openai.model | string | Whisper model | whisper-1 |
openai.language | string | Language code (ISO 639-1) | auto |
local.baseUrl | string | Whisper server URL | http://localhost:8080 |
local.model | string | Model name | base |
local.language | string | Language code | auto |
VLM Providers (Vision Language Model)
VLM providers enable visual understanding through the see_camera tool.
| Provider | Description | API Key Required |
|---|---|---|
| Ollama | Local vision models (llava, moondream) | No |
Ollama VLM
providers:
vlm:
enabled: true
provider: ollama
ollama:
model: llava:7b # or moondream, granite3.2-vision
baseUrl: "http://localhost:11434"Recommended Vision Models
| Model | Size | Description | Speed |
|---|---|---|---|
moondream | 1.8B | Lightweight, edge-friendly | Fast |
llava:7b | 7B | Good balance of quality/speed | Medium |
granite3.2-vision | 2B | Document understanding | Medium |
Usage
When VLM is enabled and the user asks visual questions, the LLM will use the see_camera tool:
User: "何を持ってるかわかる?"
AI: [calls see_camera tool]
AI: "スマートフォンを持っていますね!"Embedding Providers
Embedding providers generate vector representations of text for semantic search.
| Provider | Description | API Key Required |
|---|---|---|
| Ollama | Local embedding models | No |
Ollama Embedding
providers:
embedding:
enabled: true
provider: ollama # PLaMo server provides Ollama-compatible API
ollama:
model: plamo-embedding-1b
baseUrl: "http://studio:8000" # PLaMo embedding serverRecommended Embedding Models
| Model | Dimensions | Description |
|---|---|---|
plamo-embedding-1b | 2048 | Japanese-optimized, top JMTEB scores (recommended) |
bge-m3 | 1024 | Multi-lingual, good quality (alternative) |
nomic-embed-text | 768 | Fast, English-focused |
Setup
Option 1: PLaMo-Embedding-1B (Recommended for Japanese)
PLaMo-Embedding-1B provides superior Japanese text embedding. See Recommended Setup for full instructions.
# On Mac Studio
docker compose -f compose.studio.yaml up -dOption 2: Ollama with bge-m3 (Alternative)
ollama serve
ollama pull bge-m3providers:
embedding:
enabled: true
provider: ollama
ollama:
model: bge-m3
baseUrl: "http://localhost:11434"Memory Configuration
Memory system enables persistent user context through RAG (Retrieval-Augmented Generation).
providers:
memory:
enabled: true
# RAG settings
rag:
topK: 8 # Max memories to retrieve
threshold: 0.3 # Minimum similarity score (0.0-1.0)
# Memory extraction settings
extraction:
autoExtract: true # Auto-extract from conversations
minConfidence: 0.5 # Minimum confidence for extractionOptions
| Option | Type | Description | Default |
|---|---|---|---|
enabled | boolean | Enable memory system | true |
rag.topK | number | Max memories to retrieve per query | 8 |
rag.threshold | number | Similarity threshold (0.0-1.0) | 0.3 |
extraction.autoExtract | boolean | Auto-extract memories from conversations | true |
extraction.minConfidence | number | Minimum confidence for extraction | 0.5 |
Memory Types
| Type | Description |
|---|---|
profile | User preferences, traits, persistent info |
episode | Recent interactions and events |
knowledge | Facts and learned information |
See Memory Documentation for details.
Remote Server Configuration
Recommended setup: Run heavy services (Ollama, VOICEVOX, PLaMo) on a powerful server (e.g., Mac Studio) and connect via Tailscale:
# config/providers.yaml
providers:
llm:
provider: ollama
ollama:
model: "gpt-oss:20b"
baseUrl: "http://studio:11434" # Tailscale hostname
tts:
provider: voicevox
voicevox:
speaker: 3
baseUrl: "http://studio:50021" # Tailscale hostname
embedding:
enabled: true
provider: ollama # PLaMo server provides Ollama-compatible API
ollama:
model: plamo-embedding-1b
baseUrl: "http://studio:8000" # PLaMo embedding server
memory:
enabled: true
rag:
topK: 8
threshold: 0.3
extraction:
autoExtract: true
minConfidence: 0.5See Docker Documentation for details.
