YMCA (Yaacov's MCp Agent) is a local-first AI agent framework designed to run entirely on your own hardware. It leverages the Model Context Protocol (MCP) to connect to various tool servers while using small language models (SML) with aggressive quantization for efficient local inference. This approach ensures privacy, reduces costs, and eliminates dependencies on cloud services.
YMCA combines the power of small language models (SML) with the flexibility of MCP servers, all while maintaining a small footprint through quantized models. The framework uses llama.cpp and includes built-in memory capabilities, allowing the agent to maintain context across conversations using semantic search and vector embeddings.
Run small AI models directly on your hardware without sending data to external servers. YMCA uses llama.cpp for fast, efficient inference on both CPU and GPU, supporting a wide range of quantized model formats.
Connect to Model Context Protocol servers to extend your agent's capabilities. MCP provides a standardized way to integrate tools, data sources, and services, allowing your agent to interact with databases, APIs, file systems, and more.
Support for heavily quantized models (4-bit, 5-bit, 8-bit) enables running capable models on modest hardware. This makes it possible to deploy AI agents on laptops, workstations, or edge devices without requiring expensive GPU infrastructure.
Built-in memory system using ChromaDB for vector storage and sentence transformers for embeddings. The agent can remember and recall information from past conversations, creating a persistent context that improves over time.
All processing happens locally. Your conversations, data, and model interactions never leave your machine, ensuring complete privacy and data sovereignty.
YMCA implements selective context loading (also known as tool selection) to optimize context window usage. Instead of loading all available tools into the model's context, the system uses semantic search to identify and provide only the most relevant tools for each query. This technique significantly reduces the token count in system prompts, enabling more efficient use of limited context windows.
Automatic post-processing step that improves response clarity and formatting while preserving technical accuracy. Responses are polished for better readability, structure, and presentation. Enabled by default (disable with --no-refine-answers flag or uncheck the web UI checkbox).
It's recommended to use a virtual environment to avoid conflicts with system packages:
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activateWith the virtual environment activated:
# Basic installation
pip install -e .
# Or with development tools (testing, linting)
pip install -e ".[dev]"For best performance, install llama-cpp-python with hardware-specific acceleration. The basic installation builds for CPU only, but you can enable GPU acceleration and other optimizations by setting CMAKE_ARGS before installation:
Apple Silicon (M1, M2, M3) - Metal Acceleration:
CMAKE_ARGS="-DGGML_METAL=on" pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dirNVIDIA GPUs - CUDA Acceleration:
# Build from source with CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
# Check your CUDA version first:
nvidia-smi
# Or use pre-built wheels (faster, replace cu121 with your CUDA version)
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121AMD GPUs - ROCm/HIP Acceleration:
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dirCPU Optimization - OpenBLAS:
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dirCross-Platform GPU - Vulkan:
CMAKE_ARGS="-DGGML_VULKAN=on" pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dirNote: After installing with hardware acceleration, ensure your virtual environment is activated, then reinstall YMCA:
pip install -e .
Start an interactive chat session with the agent:
ymca-chatLaunch the web-based chat interface:
ymca-webThe web server will start on http://localhost:8000 by default.
Web Interface Documentation - Complete guide to the web UI, API endpoints, deployment, and customization.
Interact with the agent's memory system:
ymca-memoryUse this to query, add, or manage stored memories and embeddings.
Memory Tool Documentation - Comprehensive guide to storing, retrieving, and managing semantic memories.
Convert Hugging Face models to GGUF format for use with llama.cpp:
ymca-convert model-name --quantize q4_k_mModel Converter Documentation - Complete guide to downloading, converting, and quantizing models.
Connect to MCP servers and use custom system prompts:
# Use with MCP server and specialized system prompt
ymca-chat --mcp-server "mtv:kubectl-mtv mcp-server" --system-prompt @docs/mtv-system-prompt.txt
# Multiple MCP servers
ymca-chat --mcp-server "server1:cmd1" --mcp-server "server2:cmd2"See docs/mtv-mcp-setup.md for detailed MTV MCP server configuration.
