Production RAG Architecture & 99.7% Cost Optimization
Capabilities / Domains
Timeline
01/2026 – 02/2026
Built a production Retrieval-Augmented Generation (RAG) system powering an AI knowledge agent on my website — re-architected from an enterprise-grade OpenSearch implementation to a cost-optimized, serverless AWS-native design using S3-hosted embeddings and in-memory cosine similarity search.
The Challenge
TThe initial OpenSearch cluster cost $360–720/month for a corpus of ~200 document chunks and modest daily traffic. Architecturally sound, economically wrong for the scale. The goal: maintain production-grade RAG quality while eliminating the vector database entirely.
The Solution
Re-architected the system using Amazon Bedrock Titan Text Embeddings V2 with serialized embedding storage in S3 and in-memory cosine similarity search within AWS Lambda. Precomputed embeddings are loaded into memory on cold start and reused across warm invocations, eliminating the need for a vector database. Implemented tiered model routing via Amazon Bedrock, DynamoDB-backed rate limiting for budget governance, and full dev/prod environment separation to preserve production reliability.
The Result
Eliminated the $360/month OpenSearch baseline. Total runtime cost: $1.12/month — a 99.7% reduction. Warm response latency ~200ms. Production-grade RAG with rate governance, model abstraction, and full dev/prod separation.