How we built a frugal, sovereign AI site: dual-tier models, Infisical secrets, Docker Hub and Hugging Face for Wagmi, EU AI Act gates, and a stack that runs for under $100/month.
This site is the demo. Not a portfolio beside the product: the website you are reading is the proof—architecture, security, inference economics, and regulatory discipline in one shipped artifact. It was hand-coded with Cursor [1], without low-code or page builders.
The story we wanted to tell is deliberate: build an economic, frugal, and sovereign application—owned code, owned models, inference backends we can swap—while scaling intelligence only when it is worth it, and always balancing cost against user experience. Anonymous visitors get a fast, cheap CPU model; authenticated users unlock a larger GPU tier when the extra capability justifies the bill. Production stays predictable; staging proves scale-to-zero without betting the product on a single vendor’s runtime.
What follows is a CTO-level view: principles, stack roles, guardrails, monthly economics, and the roadmap—including continuous training, EU AI Act gates, and a feedback loop so users can flag inappropriate answers before we grant the large model broader powers (email, calendar, documents, internal agents).
Frugal by default. Static marketing and blog on Cloudflare Pages (edge, near-zero marginal cost). The Node runtime on Koyeb carries APIs and chat; the small Wagmi model (Qwen 2.5 1.5B, Q4) runs on CPU inside the same container for anonymous traffic—no paid GPU until someone authenticates. GPU inference scales to zero when idle.
Sovereign and portable. One TypeScript codebase, Docker images we build and pin, models we fine-tune and bake—not a black-box SaaS chat widget. We point LLM_API_URL at co-located or remote llama.cpp OpenAI-compatible endpoints without rewriting the app.
Inference-agnostic. The Vercel AI SDK and Zod-validated env (LLM_RUNTIME, LLM_RUNTIME_CPU, LLM_RUNTIME_GPU, model aliases) separate product logic from runtime. Staging validates the full server image; production serves static assets at the edge and proxies API traffic to Koyeb.
Rights follow capability. The small model answers from public knowledge (RAG over wagmi-skills.md, ai.txt, blog-derived facts). The large model will earn broader tools—and stricter oversight—as we add email, calendar, documents, and internal agents.
| Layer | Tool | Role |
|---|---|---|
| Secrets | Infisical (EU) | Single source of truth for dev, staging, prod—database URLs, Supabase keys, LLM endpoints, API keys. Local dev via infisical run; Koyeb and GitHub Actions sync from the same namespaces. No secrets in git. |
| Data | PostgreSQL + Drizzle ORM | Typed schema, migrations in drizzle/, transactional writes. Convenience without ORM magic strings. |
| Auth | Supabase | Email OTP (6-digit), SSR-safe session cookies. We use Supabase where it excels; the rest stays in our app. |
| Content | Content Collections + Zod | Blog in content/blog/; build-time Markdown, bilingual EN/FR. |
| Images | Docker Hub (jeanbapt/…) | Registry for the unified web image (Next.js + CPU llama-server + baked GGUF) and separate GPU images (~10 GB). Koyeb pulls via a private registry secret. |
| Model factory | Hugging Face | Private GGUF repos, SFT datasets (datasets/wagmi-sft/), training and eval on Hub—where we cook the Wagmi models (wagmi-sft, wagmi-sft-14b). |
| Staging runtime | Koyeb | Full-stack validation: Standard xlarge, scale-to-zero, same Docker image as a production fallback. |
| Production front | Cloudflare Pages | Static export + Pages Functions proxy /api/* and auth routes to Koyeb. |
| Production API / AI | Koyeb | eco-xlarge, always-on web + scale-to-zero GPU service for auth tier. |
| CI/CD | GitHub Actions | Lint, tests, Docker build/push, deploy workflows, release gate, AI Act gate (see below). |
| IDE | Cursor | AI-assisted implementation; architecture and review remain human. |
Anonymous visitor Authenticated user
│ │
▼ ▼
Small model (1.5B) Large model (14B)
CPU, loopback llama-server GPU service, scale-to-zero
Fast, ~$0 marginal GPU Wake + deeper context
BM25 RAG grounding CPU-first reply, then GPU upgrade
wagmi-sft on co-located llama-server (127.0.0.1:11434)—zero network hop, predictable latency for first token after warm-up.wagmi-sft-14b on a dedicated GPU image; LLM_GPU_WAKE_URL wakes scale-to-zero from the public edge when private mesh alone is insufficient.LLM_MODEL_AUTH_FALLBACKS and CPU-first chat (CHAT_AUTH_CPU_FIRST) keep the product usable if the GPU is cold or down.We cap maxTokens (300 small / 1024 large) because tiny models often fail to emit EOS and repeat forever. A behavioural benchmark (scripts/benchmark-rag-qwen15b.ts) blocks shipping a small model that hallucinates or ignores auth nudges.
Why we did not split CPU LLM into a per-session micro-service on Koyeb: platform scale-to-zero is idle-traffic-based (minutes), not “chat session ended”; splitting would add a second cold start without matching session granularity. The monolithic web image (~1.5 GB) is the pragmatic trade-off for anonymous UX on scale-to-zero staging.
Before features, we unified configuration:
.infisical.json; CLI against EU (https://eu.infisical.com).dev, staging, prod mirror how we think about risk—not ad-hoc .env files in each laptop.scripts/infisical/placeholders-*.env) document required keys; pnpm run infisical:push-placeholders seeds empty shapes without committing values.pnpm run dev wraps infisical run then Next.js so local parity matches staging.GitHub Actions and Koyeb receive only what each surface needs (registry tokens, API tokens); runtime LLM and database URLs live on the Koyeb service env, synced from Infisical prod—not re-injected on every deploy (which would wipe the full env list).
PostgreSQL via Drizzle: schema in src/lib/db/schema.ts, migrations under drizzle/, pooling URL from Infisical, withTransaction for writes.
Supabase: OTP auth, cookie sessions (HttpOnly, Secure, SameSite). Chat persistence only after explicit consent; anonymous chats are not stored.
APIs (App Router): streaming POST /api/chat, health, LLM status, contact classification, auth callback. Contact is chat-only—no separate form.
Local RAG (local-rag.ts): BM25-style token overlap over verified chunks—no vector DB for the small tier. SFT pipeline: scripts/generate-wagmi-sft-dataset.ts → datasets/wagmi-sft/ for Hub training.
Docker Hub hosts immutable tags (staging-<sha>, prod-<sha>). Multi-stage Dockerfile: Node 24 bookworm-slim, non-root nextjs, standalone Next output, llama-server + baked wagmi-sft GGUF (see docs/ci-gguf-registry.md). GPU images build on their own cadence when docker/llamacpp-wagmi-gpu/** changes—not on every copy tweak.
Hybrid production:
out/ static assets, global CDN, DDoS edge—ideal for marketing and blog.Staging: push to dev or manual workflow → tests → Docker push → Koyeb deal-ex-machina-staging/web, min-scale 0, health grace 360 s for cold VM + model warm-up.
Hugging Face is where models are trained, versioned, and stored as private GGUF artifacts. CI fetches weights into a BuildKit context or uses a companion blob image so deploys do not re-download multi-GB files from scratch every time.
In production today:
scripts/release/release-gate.mjs): health, chat SSE rounds, optional Lighthouse.scripts/release/ai-act-gate.mjs): wired in deploy-production.yml—checks model family, llamacpp runtime, and evidence URLs for red-team / tool-eval artifacts when switching families (e.g. LFM2). We are extending this into a fuller evaluation loop (automated eval sets, artifact retention, stricter no-go).Roadmap—always with user protection:
Security (production headers): CSP, HSTS, X-Frame-Options: DENY, nosniff, strict Referrer-Policy, Permissions-Policy without camera/mic/geo. CORS allowlist for mutating requests. Rate limits (e.g. chat 20 req / 15 min). Zod on all inputs; Drizzle only for SQL; profanity filter on chat. Health returns { "status": "ok" } only—no env leakage.
Privacy / GDPR: Data minimization; 7-day retention for consented chat; essential cookies only; EN/FR privacy pages and explicit consent checkbox before persistence.
EU AI Act: Transparency (users know they talk to AI); limited-risk posture; no prohibited practices; terms and privacy reference AI-generated content limits; incident path via contact. Technical documentation and human oversight expand as tool use grows.
Quality: Biome, Vitest, Playwright, Lighthouse CI on PRs; frozen lockfile in CI; Dependabot; supply-chain pre-commit guard on workflows.
We target under ~$100/month for the public site + staging bursts + occasional GPU—not “infinite scale” pricing, but controlled sovereignty:
| Item | Typical cost | Notes |
|---|---|---|
| Koyeb web (prod) | ~$43/mo | eco-xlarge (8 GB), min=1—always-on CPU + small model resident |
| Koyeb GPU (prod) | $0–30/mo variable | Scale-to-zero; ~$0.50/h class GPU only when auth users wake it |
| Koyeb staging | $0–15/mo | Standard xlarge, min=0; pay per second when awake |
| Cloudflare Pages | $0 | Within free tier for static + Functions proxy |
| Supabase + Postgres | $0–25/mo | Free/low tiers for OTP + modest DB; pooler URL in Infisical |
| Infisical | $0 | Team/dev tier for secret management |
| Docker Hub + GitHub | $0 | Public/private repos within free allowances |
| Hugging Face | $0 | Hub storage + CI fetch; training jobs episodic |
Baseline always-on: roughly $45–55/month (prod web + DB). Peak month with GPU wake traffic and staging deploys still often stays under $100 because GPU and staging are seconds-priced, not 24/7. The design choice—small model for everyone, big model only after auth—is what makes that envelope possible.
Node 24, TypeScript 5.9 strict, Next.js 16 App Router (standalone + static export), React 19, Tailwind, Radix, next-intl EN/FR, Assistant UI + Vercel AI SDK. Critical CSS, lazy chat bundle, AVIF/WebP, JSON-LD, sitemaps, llms.txt / ai.txt. Performance targets: Lighthouse performance ~96%, CLS 0, aggressive code splitting—details in docs/PERFORMANCE_OPTIMIZATIONS_FINAL_STATUS.md.
No WordPress/Wix/low-code CMS for the core site. No secrets in git or client bundles. No arbitrary any. No shipping without CI lint, tests, and type-check. No health endpoint introspection in production. No third-party analytics cookies.
This stack demonstrates that a consultancy-grade AI surface can be:
Inspect the repo and the running site: they are the deliverable. The roadmap does not replace that discipline—it extends it, one measured step at a time, with more power for the large model only where responsibility keeps pace.
Stack summary: Infisical (secrets) · Supabase + Drizzle + Postgres (auth/data) · Docker Hub (images) · Hugging Face (Wagmi models) · Koyeb (API + CPU/GPU inference) · Cloudflare Pages (production static) · GitHub Actions (CI, release gate, AI Act gate) · dual-tier LLM · target ops < ~$100/month.