EngineeringFeatured

How we reduced AI reply costs by 94% without hurting quality

We rebuilt the reply pipeline from scratch â€” 25 templates, a Redis cache, and Gemini for the hard cases. Here's exactly how the math works.

May 15, 2026Â·8 min readÂ·ReviewBox Engineering

When we launched ReviewBox, every AI reply draft went straight to Groq. One review in, one Groq call out. It was simple, it worked, and it cost roughly $0.003 per reply â€” which sounds cheap until you run the numbers at scale.

At 1,000 replies/day that's $90/month. At 10,000 it's $900/month. For a product where AI replies are a core feature included in every plan, that margin problem gets worse the more successful you become. We needed a better architecture.

The result

94%

cost reduction

~70%

served from templates

~15%

served from cache

~15%

hit Groq

The problem with naive AI

Most of our reviews are not unique. “Great app!”, “Love this”, “Keeps crashing after the update” â€” the long tail of app store reviews is heavily repetitive. Sending each one to an LLM is like hiring a novelist to write birthday card messages. The model is massively over-qualified for most of the work.

We audited 10,000 reviews across our beta customers. 68% had a clear match to one of 20 templates we could write by hand. Another 12% were near-identical to a review we'd seen in the last 7 days. Only ~20% actually needed fresh generation.

That audit defined our architecture.

Tier 1: 25 templates (0 tokens)

We wrote 25 reply templates covering the most common review patterns: crash reports, billing disputes, 5-star reviews, feature requests, login issues, performance complaints, release regressions, and more. Each template has 2â€“3 variants, and the variant is chosen deterministically based on the review text length and rating â€” no randomness, no AI.

Matching is handled by a rules engine (src/lib/rules-engine.ts) that runs regex patterns against the review text. The patterns are simple but surprisingly effective:

Tag detection patterns

crash:    /crash|force.?close|stops.?working|keeps.?crashing/i
billing:  /charge|payment|refund|subscription|money|paid/i
login:    /log.?in|sign.?in|password|can't.?access|locked.?out/i
perf:     /slow|lag|freeze|hang|battery|drain|memory/i

A 5-star review with no issue tags â†’ positive_5star_short or positive_5star_detailed template. A 1-star review with a crash keyword â†’ crash_critical template. No API call. No latency. Cost: $0.

We verified quality by having three team members independently rate 200 template-matched replies against AI-generated ones. The templates scored 4.1/5 vs 4.3/5 for AI â€” close enough to be indistinguishable in practice, especially for the high-volume patterns where tone consistency matters more than creativity.

Tier 2: Redis reply cache (0 tokens)

For reviews that don't match a template, we check a Redis cache before touching an LLM. The cache key is a SHA-256 hash of the first 200 characters of the review text, the rating, and the workspace's tone setting:

key = "reply_cache:" + SHA256(text.slice(0,200) + rating + tone)
TTL = 604800  // 7 days

When a user runs AI on “The app is unusable since the last update, please fix asap” and we generate a reply, that reply is cached. The next time someone with the same review text (or the same customer asking us to regenerate) hits the endpoint, they get the cached reply in ~5ms. Groq never sees it.

About 40â€“60% of non-template reviews hit the cache within their 7-day window. That's because app store reviews cluster â€” the same complaints appear in waves, especially after releases.

Tier 3: Compressed Groq prompt

The reviews that make it to Groq get a pre-processed, compressed input. Before sending, we run the review text through compressReviewText(), which strips 30 common filler phrases:

Stripped phrases (sorted longest-first)

“I've been using”, “to be honest”, “just wanted to say”, “first of all”, “for what it's worth”, “in my opinion”, “I'm writing this review”, “long story short”, “all in all”, “needless to say” ...

Input: “I've been using this app for 3 months. Honestly the app keeps crashing.”
Output: “app keeps crashing.”

We also reduced the system prompt from ~245 tokens (it included 3 KB of knowledge base context) to ~40 tokens. The knowledge base now sends at most 1 entry, trimmed to 80 characters.

The combined compression reduces average input from ~500 tokens to ~230 tokens per call â€” a 54% reduction before we even start counting the tier-1 and tier-2 savings.

The full math

Metric	Before	After
Tokens per 100 requests	~50,000	~2,800
% requests hitting Groq	~100%	~15%
Tokens per Groq call	~500	~230
Cost per 1,000 replies	~$3.00	~$0.17
Total token reduction	â€”	94%

What we gave up

Template replies are less personalised. A crash reply for “I love the app but it keeps crashing on my Pixel 8” will not reference the Pixel 8 â€” it will give a generic crash acknowledgement. For most users that's fine. For high-value power users who write detailed reviews, we recommend using the Groq tier manually by clicking “Regenerate with AI” in the draft dialog.

The cache can serve stale tone. If you change your AI tone from “Professional” to “Friendly” mid-week, cached replies from earlier in the week may not reflect the new tone. The cache key includes the tone setting, so new tone = new cache misses = Groq generates fresh ones. The stale cache entries expire in 7 days.

What's next

We're exploring using Gemini 2.0 Flash (free tier: 1,500 requests/day) for sentiment analysis on ambiguous 3-star reviews â€” the only class where our rules engine is genuinely uncertain. Batching up to 50 reviews per Gemini call makes this essentially free.

We're also considering adding a Tier 4: fine-tuned replies based on edits users make to AI drafts. When you edit a draft before publishing, we store the delta. Over time, those deltas become a fine-tuning dataset. The goal is to eventually get the template hit rate from 70% to 85%.

If you want to see the pipeline in action, start a free trial â€” the source indicator in the AI draft dialog tells you whether your reply came from a template, cache, or Groq.

ReviewBox Engineering

The team building ReviewBox.

Try the pipeline yourself.

14-day free trial. See the source indicator on every reply draft.

Start free trial