AI SYSTEMS ARCHITECT
    at StephanieSpanjian.com

    Production RAG Architecture & 99.7% Cost Optimization

    99.7%
    AI Cost Reduction
    $1.12
    Per Month
    ↓ $360
    Monthly Cost Reduction
    ~200ms
    In-Memory Search Latency

    Capabilities / Domains

    Cost Optimization
    AI Systems Architecture
    RAG (Retrieval-Augmented Generation)
    Amazon Bedrock
    LLM Orchestration
    Titan Embeddings
    Amazon S3
    Amazon Web Services (AWS)
    AWS Lambda
    DynamoDB
    Vector Search
    TypeScript
    Next.js
    React

    Timeline

    01/2026 – 02/2026

    Built a production Retrieval-Augmented Generation (RAG) system powering an AI knowledge agent on my website — re-architected from an enterprise-grade OpenSearch implementation to a cost-optimized, serverless AWS-native design using S3-hosted embeddings and in-memory cosine similarity search.

    The Challenge

    TThe initial OpenSearch cluster cost $360–720/month for a corpus of ~200 document chunks and modest daily traffic. Architecturally sound, economically wrong for the scale. The goal: maintain production-grade RAG quality while eliminating the vector database entirely.

    The Solution

    Re-architected the system using Amazon Bedrock Titan Text Embeddings V2 with serialized embedding storage in S3 and in-memory cosine similarity search within AWS Lambda. Precomputed embeddings are loaded into memory on cold start and reused across warm invocations, eliminating the need for a vector database. Implemented tiered model routing via Amazon Bedrock, DynamoDB-backed rate limiting for budget governance, and full dev/prod environment separation to preserve production reliability.

    The Result

    Eliminated the $360/month OpenSearch baseline. Total runtime cost: $1.12/month — a 99.7% reduction. Warm response latency ~200ms. Production-grade RAG with rate governance, model abstraction, and full dev/prod separation.