The Web Site is the Demo

This site is the demo. Not a portfolio beside the product: the website you are reading is the proof—architecture, security, inference economics, and regulatory discipline in one shipped artifact. It was hand-coded with Cursor [1], without low-code or page builders.

The story we wanted to tell is deliberate: build an economic, frugal, and sovereign application—owned code, owned models, inference backends we can swap—while scaling intelligence only when it is worth it, and always balancing cost against user experience. Anonymous visitors get a fast, cheap CPU model; authenticated users unlock a larger GPU tier when the extra capability justifies the bill. Production stays predictable; staging proves scale-to-zero without betting the product on a single vendor’s runtime.

What follows is a CTO-level view: principles, stack roles, guardrails, monthly economics, and the roadmap—including continuous training, EU AI Act gates, and a feedback loop so users can flag inappropriate answers before we grant the large model broader powers (email, calendar, documents, internal agents).

1. Design principles

Frugal by default. Static marketing and blog on Cloudflare Pages (edge, near-zero marginal cost). The Node runtime on Koyeb carries APIs and chat; the small Wagmi model (Qwen 2.5 1.5B, Q4) runs on CPU inside the same container for anonymous traffic—no paid GPU until someone authenticates. GPU inference scales to zero when idle.

Sovereign and portable. One TypeScript codebase, Docker images we build and pin, models we fine-tune and bake—not a black-box SaaS chat widget. We point LLM_API_URL at co-located or remote llama.cpp OpenAI-compatible endpoints without rewriting the app.

Inference-agnostic. The Vercel AI SDK and Zod-validated env (LLM_RUNTIME, LLM_RUNTIME_CPU, LLM_RUNTIME_GPU, model aliases) separate product logic from runtime. Staging validates the full server image; production serves static assets at the edge and proxies API traffic to Koyeb.

Rights follow capability. The small model answers from public knowledge (RAG over wagmi-skills.md, ai.txt, blog-derived facts). The large model will earn broader tools—and stricter oversight—as we add email, calendar, documents, and internal agents.

2. Stack at a glance: who does what

Layer	Tool	Role
Secrets	Infisical (EU)	Single source of truth for dev, staging, prod—database URLs, Supabase keys, LLM endpoints, API keys. Local dev via `infisical run`; Koyeb and GitHub Actions sync from the same namespaces. No secrets in git.
Data	PostgreSQL + Drizzle ORM	Typed schema, migrations in `drizzle/`, transactional writes. Convenience without ORM magic strings.
Auth	Supabase	Email OTP (6-digit), SSR-safe session cookies. We use Supabase where it excels; the rest stays in our app.
Content	Content Collections + Zod	Blog in `content/blog/`; build-time Markdown, bilingual EN/FR.
Images	Docker Hub (`jeanbapt/…`)	Registry for the unified web image (Next.js + CPU llama-server + baked GGUF) and separate GPU images (~10 GB). Koyeb pulls via a private registry secret.
Model factory	Hugging Face	Private GGUF repos, SFT datasets (`datasets/wagmi-sft/`), training and eval on Hub—where we cook the Wagmi models (`wagmi-sft`, `wagmi-sft-14b`).
Staging runtime	Koyeb	Full-stack validation: Standard xlarge, scale-to-zero, same Docker image as a production fallback.
Production front	Cloudflare Pages	Static export + Pages Functions proxy `/api/*` and auth routes to Koyeb.
Production API / AI	Koyeb	eco-xlarge, always-on `web` + scale-to-zero GPU service for auth tier.
CI/CD	GitHub Actions	Lint, tests, Docker build/push, deploy workflows, release gate, AI Act gate (see below).
IDE	Cursor	AI-assisted implementation; architecture and review remain human.

3. Two-tier intelligence: cost, speed, and when to “go large”

Anonymous visitor          Authenticated user
        │                          │
        ▼                          ▼
  Small model (1.5B)         Large model (14B)
  CPU, loopback llama-server GPU service, scale-to-zero
  Fast, ~$0 marginal GPU   Wake + deeper context
  BM25 RAG grounding         CPU-first reply, then GPU upgrade

Anonymous tier: wagmi-sft on co-located llama-server (127.0.0.1:11434)—zero network hop, predictable latency for first token after warm-up.
Auth tier: wagmi-sft-14b on a dedicated GPU image; LLM_GPU_WAKE_URL wakes scale-to-zero from the public edge when private mesh alone is insufficient.
Degraded path: LLM_MODEL_AUTH_FALLBACKS and CPU-first chat (CHAT_AUTH_CPU_FIRST) keep the product usable if the GPU is cold or down.

We cap maxTokens (300 small / 1024 large) because tiny models often fail to emit EOS and repeat forever. A behavioural benchmark (scripts/benchmark-rag-qwen15b.ts) blocks shipping a small model that hallucinates or ignores auth nudges.

Why we did not split CPU LLM into a per-session micro-service on Koyeb: platform scale-to-zero is idle-traffic-based (minutes), not “chat session ended”; splitting would add a second cold start without matching session granularity. The monolithic web image (~1.5 GB) is the pragmatic trade-off for anonymous UX on scale-to-zero staging.

4. Protecting the environment: Infisical

Before features, we unified configuration:

Project linked in .infisical.json; CLI against EU (https://eu.infisical.com).
Environments dev, staging, prod mirror how we think about risk—not ad-hoc .env files in each laptop.
Placeholder scripts (scripts/infisical/placeholders-*.env) document required keys; pnpm run infisical:push-placeholders seeds empty shapes without committing values.
pnpm run dev wraps infisical run then Next.js so local parity matches staging.

GitHub Actions and Koyeb receive only what each surface needs (registry tokens, API tokens); runtime LLM and database URLs live on the Koyeb service env, synced from Infisical prod—not re-injected on every deploy (which would wipe the full env list).

5. Data and backend (concise)

PostgreSQL via Drizzle: schema in src/lib/db/schema.ts, migrations under drizzle/, pooling URL from Infisical, withTransaction for writes.

Supabase: OTP auth, cookie sessions (HttpOnly, Secure, SameSite). Chat persistence only after explicit consent; anonymous chats are not stored.

APIs (App Router): streaming POST /api/chat, health, LLM status, contact classification, auth callback. Contact is chat-only—no separate form.

Local RAG (local-rag.ts): BM25-style token overlap over verified chunks—no vector DB for the small tier. SFT pipeline: scripts/generate-wagmi-sft-dataset.ts → datasets/wagmi-sft/ for Hub training.

6. Build, ship, and run

Docker Hub hosts immutable tags (staging-<sha>, prod-<sha>). Multi-stage Dockerfile: Node 24 bookworm-slim, non-root nextjs, standalone Next output, llama-server + baked wagmi-sft GGUF (see docs/ci-gguf-registry.md). GPU images build on their own cadence when docker/llamacpp-wagmi-gpu/** changes—not on every copy tweak.

Hybrid production:

Cloudflare Pages: out/ static assets, global CDN, DDoS edge—ideal for marketing and blog.
Koyeb: Node APIs, SSE chat, Postgres, co-located CPU inference; same image could replace Pages-only routing in hours.

Staging: push to dev or manual workflow → tests → Docker push → Koyeb deal-ex-machina-staging/web, min-scale 0, health grace 360 s for cold VM + model warm-up.

7. Model factory, quality gates, and EU AI Act

Hugging Face is where models are trained, versioned, and stored as private GGUF artifacts. CI fetches weights into a BuildKit context or uses a companion blob image so deploys do not re-download multi-GB files from scratch every time.

In production today:

Lint, unit/integration/security tests, type-check, Playwright E2E.
Release gate (scripts/release/release-gate.mjs): health, chat SSE rounds, optional Lighthouse.
AI Act gate (scripts/release/ai-act-gate.mjs): wired in deploy-production.yml—checks model family, llamacpp runtime, and evidence URLs for red-team / tool-eval artifacts when switching families (e.g. LFM2). We are extending this into a fuller evaluation loop (automated eval sets, artifact retention, stricter no-go).

Roadmap—always with user protection:

Continuous training on new blog content, feedback signals, and eval regressions—never “ship weights because Friday.”
User feedback loop: in-chat reporting when Wagmi is wrong, off-brand, or unsafe; feeds dataset hygiene and incident review.
Graduated agency for the large model: email drafts, calendar holds, document retrieval, orchestration of internal agents—each capability gated by role, audit logs, and AI Act documentation. The small model stays read-only over public facts; the large model’s powers scale with accountability.

8. Guardrails: security, privacy, quality

Security (production headers): CSP, HSTS, X-Frame-Options: DENY, nosniff, strict Referrer-Policy, Permissions-Policy without camera/mic/geo. CORS allowlist for mutating requests. Rate limits (e.g. chat 20 req / 15 min). Zod on all inputs; Drizzle only for SQL; profanity filter on chat. Health returns { "status": "ok" } only—no env leakage.

Privacy / GDPR: Data minimization; 7-day retention for consented chat; essential cookies only; EN/FR privacy pages and explicit consent checkbox before persistence.

EU AI Act: Transparency (users know they talk to AI); limited-risk posture; no prohibited practices; terms and privacy reference AI-generated content limits; incident path via contact. Technical documentation and human oversight expand as tool use grows.

Quality: Biome, Vitest, Playwright, Lighthouse CI on PRs; frozen lockfile in CI; Dependabot; supply-chain pre-commit guard on workflows.

9. Economics: running for under $100/month

We target under ~$100/month for the public site + staging bursts + occasional GPU—not “infinite scale” pricing, but controlled sovereignty:

Item	Typical cost	Notes
Koyeb web (prod)	~$43/mo	`eco-xlarge` (8 GB), `min=1`—always-on CPU + small model resident
Koyeb GPU (prod)	$0–30/mo variable	Scale-to-zero; ~$0.50/h class GPU only when auth users wake it
Koyeb staging	$0–15/mo	Standard xlarge, `min=0`; pay per second when awake
Cloudflare Pages	$0	Within free tier for static + Functions proxy
Supabase + Postgres	$0–25/mo	Free/low tiers for OTP + modest DB; pooler URL in Infisical
Infisical	$0	Team/dev tier for secret management
Docker Hub + GitHub	$0	Public/private repos within free allowances
Hugging Face	$0	Hub storage + CI fetch; training jobs episodic

Baseline always-on: roughly $45–55/month (prod web + DB). Peak month with GPU wake traffic and staging deploys still often stays under $100 because GPU and staging are seconds-priced, not 24/7. The design choice—small model for everyone, big model only after auth—is what makes that envelope possible.

10. Runtimes and front end (summary)

Node 24, TypeScript 5.9 strict, Next.js 16 App Router (standalone + static export), React 19, Tailwind, Radix, next-intl EN/FR, Assistant UI + Vercel AI SDK. Critical CSS, lazy chat bundle, AVIF/WebP, JSON-LD, sitemaps, llms.txt / ai.txt. Performance targets: Lighthouse performance ~96%, CLS 0, aggressive code splitting—details in docs/PERFORMANCE_OPTIMIZATIONS_FINAL_STATUS.md.

11. What we explicitly avoid

No WordPress/Wix/low-code CMS for the core site. No secrets in git or client bundles. No arbitrary any. No shipping without CI lint, tests, and type-check. No health endpoint introspection in production. No third-party analytics cookies.

12. Why the web is still the demo

This stack demonstrates that a consultancy-grade AI surface can be:

Frugal (tiered models, scale-to-zero GPU, edge static),
Sovereign (our code, our weights, swappable runtimes),
Governed (Infisical, GDPR, AI Act gate, forthcoming feedback loop),
and Affordable (~sub-$100/month operational envelope).

Inspect the repo and the running site: they are the deliverable. The roadmap does not replace that discipline—it extends it, one measured step at a time, with more power for the large model only where responsibility keeps pace.

References

Cursor
Node.js
TypeScript
Next.js
PostgreSQL
Drizzle ORM
Supabase
Content Collections
Zod
Vercel AI SDK
Assistant UI
Infisical
Docker Hub
Hugging Face
Koyeb
Cloudflare Pages
GitHub Actions
Biome · Vitest · Playwright · Lighthouse CI

Stack summary: Infisical (secrets) · Supabase + Drizzle + Postgres (auth/data) · Docker Hub (images) · Hugging Face (Wagmi models) · Koyeb (API + CPU/GPU inference) · Cloudflare Pages (production static) · GitHub Actions (CI, release gate, AI Act gate) · dual-tier LLM · target ops < ~$100/month.