A microservice for managing short-term memory in LLM-based conversational agents, with guaranteed consistency and no gaps or duplicates in conversation history.
The Problem: Chatbots lose context or waste tokens
- β Long conversations hit token limits β forced to truncate history
- β Keeping all messages β expensive API calls, slow responses
- β Manual context management β buggy, inconsistent results
The Solution: Automatic sliding window that just works
- β Recent messages stay full (high detail where it matters)
- β Old messages auto-compress into summaries (save 80% tokens)
- β Guaranteed consistency (zero gaps, zero duplicates)
- β Set it and forget it (background processing, no manual work)
| Approach | Token Usage | Consistency | Latency | Setup |
|---|---|---|---|---|
| This Project | β Low (~70% avg, up to 80% on 100+ messages) | β Guaranteed | β <200ms | β 5 min |
| Manual truncation | β Gaps likely | β Fast | ||
| Full history | β Very High | β Complete | β Slow | β Easy |
| Custom solution | β Error-prone | β Days/weeks |
Get started in 5 minutes: See QUICKSTART.md for detailed setup instructions.
Prerequisites: Docker, Docker Compose, OpenAI API key
What you'll get:
- β REST API running on localhost:3000
- β PostgreSQL database with memory storage
- β Ready to integrate with your chatbot
Step-by-step:
-
Clone and setup:
git clone <repo-url> cd llm-sliding-window-memory cp .env.example .env # Add your OPENAI_API_KEY
-
Start services:
docker-compose up -d
-
Initialize database:
docker exec -i memory-core-postgres psql -U memoryuser -d memory_core < src/storage/migrations/001_initial_schema.sql
-
Verify it works:
curl http://localhost:3000/api/health # Should return: {"status":"ok"}
- π― Zero Configuration - Works out of the box with sensible defaults
- π° Save 80% on API Costs - Automatic compression reduces token usage dramatically
- β‘ No Latency - Pre-generated summaries, instant context retrieval
- π Bulletproof Consistency - State machine guarantees no gaps or duplicates
- π³ Production Ready - Full Docker support, tested with real users
- π Easy Integration - Simple REST API, works with any LLM
| Metric | Value |
|---|---|
| β‘ Add Message | < 200ms |
| π Get Context | < 100ms |
| πΎ Token Savings | ~80% |
| π€ Compression | GPT-5 Nano |
| π¦ Storage | PostgreSQL + pgvector |
| π³ Deployment | Docker ready |
- Message Pairs: User-assistant message exchanges with indices
- Blocks: Ranges of message pairs that get summarized
- Summaries: Compressed representations of blocks
- State Machine: Controls block lifecycle transitions
- Window Manager: Core logic for memory window management
NOT_CREATED β SUMMARY_PENDING β SUMMARY_PREPARED β SUMMARY_ACTIVE β ARCHIVED
β
NOT_CREATED (rollback on error)
No Gaps: Every message pair index from 1 to N is accounted for in the effective context, either as a raw pair in the window OR covered by an active summary.
No Duplicates: No message pair appears both in raw window AND in an active summary simultaneously.
Atomic Transitions: Block state transitions use row-level locking to prevent race conditions.
Customize memory behavior by editing .env:
# Memory Window Settings
DEFAULT_RAW_WINDOW_SIZE=10 # Recent message pairs kept in full (default: 10, recommended: 10-15)
DEFAULT_BLOCK_SIZE=5 # Pairs compressed per summary (default: 5)
DEFAULT_MAX_SUMMARIES_IN_CONTEXT=3 # Maximum active summaries (default: 3)
DEFAULT_TRIGGER_OFFSET=1 # Extra messages before compression (default: 1)How it works:
- When raw window reaches
RAW_WINDOW_SIZE + TRIGGER_OFFSET, oldestBLOCK_SIZEpairs compress into summary - RAW_WINDOW_SIZE: Use
10for faster compression/lower memory,15for more context retention - System maintains
MAX_SUMMARIES_IN_CONTEXTmost recent summaries + raw pairs
After changing .env:
docker-compose down && docker-compose up -dAutomatically returns existing conversation for a user or creates a new one. This is the recommended endpoint for integrations like Telegram bots.
POST /api/conversations/get-or-create
Content-Type: application/json
{
"userId": "telegram_383946741",
"metadata": {"source": "telegram"}
}Response (existing conversation):
{
"success": true,
"data": {
"conversationId": "uuid",
"isNew": false,
"conversation": {
"id": "uuid",
"userId": "telegram_383946741",
"createdAt": "2025-12-02T...",
"metadata": {"source": "telegram"}
}
}
}Response (new conversation):
{
"success": true,
"data": {
"conversationId": "uuid",
"isNew": true,
"conversation": {...}
}
}POST /api/conversations
Content-Type: application/json
{
"userId": "user-123",
"metadata": {"source": "telegram"}
}Response:
{
"success": true,
"data": {
"id": "uuid",
"userId": "user-123",
"createdAt": "2025-12-02T...",
"metadata": {"source": "telegram"}
}
}POST /api/conversations/:conversationId/messages
Content-Type: application/json
{
"userMessage": "What is the capital of France?",
"assistantMessage": "The capital of France is Paris."
}Response:
{
"success": true,
"data": {
"pair": {...},
"pairIndex": 1,
"blocksCreated": 0,
"blocksActivated": 0
}
}GET /api/conversations/:conversationId/contextResponse:
{
"success": true,
"data": {
"activeSummaries": [
{
"blockRange": [1, 5],
"summaryText": "User asked about weather, assistant provided forecast..."
}
],
"rawPairs": [
{
"pairIndex": 6,
"userMessage": "...",
"assistantMessage": "..."
}
],
"totalEffectivePairs": 15
}
}GET /api/conversations/:conversationId/diagnosticsReturns detailed memory state information including all blocks, their states, and coverage analysis.
PATCH /api/conversations/:conversationId/config
Content-Type: application/json
{
"rawWindowSize": 20,
"blockSize": 5,
"maxSummariesInContext": 3,
"triggerOffset": 2
}llm-sliding-window-memory/
βββ src/
β βββ core/ # Core business logic
β β βββ types.ts
β β βββ state-machine.ts
β β βββ window-manager.ts
β β βββ summary-generator.ts
β βββ storage/ # Database layer
β β βββ database.ts
β β βββ repositories/
β β βββ migrations/
β βββ api/ # REST API
β β βββ routes/
β β βββ middleware/
β βββ services/ # Service orchestration
β βββ server.ts # Entry point
βββ tests/
β βββ unit/
β βββ integration/
β βββ scenarios/
βββ docker-compose.yml
- Add Message: < 200ms (excluding async summary generation)
- Get Context: < 100ms
- Summary Generation: ~5s per 5-pair block (GPT-5 Nano with minimal reasoning)
Maintain conversation history for chat bots with automatic context management.
Keep track of customer interactions with intelligent summarization.
Power conversational AI agents with reliable memory management.
Remember student progress and past lessons across sessions.
Maintain patient conversation history with HIPAA-compliant storage.
Track client conversations and preferences over time.
// 1. Create conversation when user starts chat
const conv = await fetch('http://api/conversations', {
method: 'POST',
body: JSON.stringify({
userId: telegramUserId,
metadata: { chatId: telegramChatId }
})
});
// 2. Get context before sending to LLM
const context = await fetch(`http://api/conversations/${convId}/context`);
const { activeSummaries, rawPairs } = context.data;
// 3. Format context for LLM
const messages = [
...activeSummaries.map(s => ({ role: 'system', content: s.summaryText })),
...rawPairs.flatMap(p => [
{ role: 'user', content: p.userMessage },
{ role: 'assistant', content: p.assistantMessage }
]),
{ role: 'user', content: newUserMessage }
];
// 4. Send to LLM and save response
const llmResponse = await callLLM(messages);
await fetch(`http://api/conversations/${convId}/messages`, {
method: 'POST',
body: JSON.stringify({
userMessage: newUserMessage,
assistantMessage: llmResponse
})
});# Run all tests
npm test
# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:e2e- WebSocket streaming for real-time updates
- Bull queue for reliable background jobs
- Metrics and monitoring dashboard
- Summary consolidation (merge multiple summaries)
- RAG integration with embeddings
- Multi-tenant isolation
- Multi-language support
- Cloud deployment templates (AWS, GCP, Azure)
- GraphQL API
- Analytics dashboard
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Write tests
- Submit a pull request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Join our community to get help, share ideas, and stay updated:
- π¬ Discussions - Ask questions, share ideas
- π Issues - Report bugs, request features
- β Star this repo - Show your support
- π Watch releases - Get notified of updates
For issues and questions, please open a GitHub issue.
- QUICKSTART.md - Setup instructions
- ARCHITECTURE.md - Detailed architecture
- DIAGRAMS.md - System diagrams