Generated on: Thu Jan 8 18:13:13 UTC 2026 Source: md-personal repository
9.3 KiB
title, date, draft
| title | date | draft |
|---|---|---|
| From Gemini-3-Flash to T5-Gemma-2: A Journey in Distilling a Family Finance LLM | 2025-12-27 | false |
Running a family finance system is surprisingly complex. What starts as a simple spreadsheet often evolves into a web of rules, exceptions, and "wait, was this dinner or vacation dinner?" questions.
For years, I relied on a rule-based system to categorize our credit card transactions. It worked... mostly. But maintaining if "UBER" in description and amount > 50 style rules is a never-ending battle against the entropy of merchant names and changing habits.
Recently, I decided to modernize this stack using Large Language Models (LLMs). This post details the technical journey from using an off-the-shelf commercial model to distilling that knowledge into a small, efficient local model (google/t5gemma-2-270m) that runs on my own hardware while maintaining high accuracy.
Phase 1: The Proof of Concept with Commercial LLMs
My first step was to replace the spaghetti code of regex rules with a prompt. I used Gemini-3-Flash (via litellm) as my categorization engine.
The core challenge was context. A transaction like MCDONALDS could be:
- Dining: A quick lunch during work.
- Travel-Dining: A meal while on a road trip.
To solve this, I integrated my private Google Calendar (via .ics export). The prompt doesn't just see the transaction; it sees where I was and what I was doing on that day.
The "God Prompt"
The system prompt was designed to return strict JSON, adhering to a schema of Categories (e.g., Dining, Travel, Bills) and Sub-Categories (e.g., Travel -> Accommodation).
{
"Category": "Travel",
"Travel Category": "Dining",
"Reasoning": "User is on 'Trip: 34TH ARCH CANYON 2025', distinguishing this from regular dining."
}
This worked well. The "Reasoning" field even gave me explanations for why it flagged something as Entertainment vs Shopping. But relying on an external API for every single transaction felt like overkill for a personal project, and I wanted to own the stack.
Phase 2: Distilling Knowledge
I wanted to train a smaller model to mimic Gemini's performance. But I didn't want to manually label thousands of transactions.
Consistency Filtering
I had a massive CSV of historical transactions (years of data). However, that data was "noisy"—some manual labels were outdated or inconsistent.
I built a Distillation Pipeline (distill_reasoning.py) that uses the Teacher Model (Gemini) to re-label the historical data. But here's the twist: I only added a data point to my training set if the Teacher's prediction matched the Historical Ground Truth.
# Pseudo-code for consistency filtering
teacher_pred = gemini.categorize(transaction)
historical_label = row['Category']
if teacher_pred.category == historical_label:
# High confidence sample!
training_data.append({
"input": format_transaction(transaction),
"output": teacher_pred.to_json()
})
else:
# Discard: Either history is wrong OR teacher hallucinated.
log_fail(transaction)
This filtered out the noise, leaving me with ~2,000 high-quality, "verified" examples where both the human (me, years ago) and the AI agreed.
Phase 3: Training the Little Guy
For the local model, I chose google/t5gemma-2-270m. This is a Seq2Seq model, which fits the "Text-to-JSON" task perfectly, and it's tiny (270M parameters), meaning it can run on almost anything.
The Stack
- Library:
transformers,peft,bitsandbytes - Technique: LoRA (Low-Rank Adaptation). I targeted all linear layers (
q_proj,k_proj,v_proj, etc.) withr=16. - Optimization:
AdamWwith linear decay.
Pitfall #1: The "Loss is 0" Initial Panic
My first training run showed a loss of exactly 0.000 essentially immediately. In deep learning, if it looks too good to be true, it's a bug.
It turned out to be a syntax error in my arguments passed to the Trainer (or rather, my custom loop). Once fixed, the loss looked "healthy"—starting high and decaying noisily.
Pitfall #2: Stability vs. Noise
The loss curve was initially extremely erratic. The batch size on my GPU was limited (Physical Batch Size = 4).
The Fix: I implemented Gradient Accumulation (accumulating over 8 steps) to simulate a batch size of 32. This smoothed out the optimization landscape significantly.

Pitfall #3: Overfitting
With a small dataset (~2k samples), overfitting is a real risk. I employed a multi-layered defense strategy:
- Data Quality First: The "Consistency Filtering" phase was the most critical step. By discarding ambiguous samples where the teacher model disagreed with history, I prevented the model from memorizing noise.
- Model Regularization:
- LoRA Dropout: I set
lora_dropout=0.1, randomly dropping 10% of the trainable adapter connections during training to force robust feature learning. - Gradient Clipping: We capped the gradient norm at
1.0. This prevents the "exploding gradient" problem and keeps weight updates stable. - AdamW: Using the AdamW optimizer adds decoupled weight decay, implicitly penalizing overly complex weights.
- LoRA Dropout: I set
I also set up a rigorous evaluation loop (10% validation split, eval every 50 steps) to monitor the Train Loss vs Eval Loss in real-time. The final curves showed them tracking downwards together, confirming generalization.
Phase 4: Results and The "Travel" Edge Case
The distilled model is surprisingly capable. It learned the JSON schema very well. Although I included a regex fallback in the inference script as a safety net, the model generates valid JSON the vast majority of the time.
Head-to-Head: Local Model vs Gemini-Flash
I ran a blind evaluation on 20 random unseen transactions.
- Gemini-3-Flash Accuracy: 90% (18/20)
- Local T5-Gemma-2 Accuracy: 85% (17/20)
The gap is surprisingly small. In fact, the local model sometimes outperformed the API because it was fine-tuned on my specific data distribution.
Win for Local Model:
Transaction:
XX RANCH #1702Local Prediction:Groceries(Correct) API Prediction:Gas(Incorrect) Local Reasoning: " XX RANCH refers to a well-known supermarket chain. API Reasoning: "XX RANCH is a known convenience store and gas station chain." Analysis: The local model "knows" (from training data) that XX Ranch is a Asian grocery store I frequent, whereas the general-purpose API assumed it was a gas station based on the name pattern.
Win for API (World Knowledge):
Transaction:
LOVE'S #0792Local Prediction:Dining(Hallucination) API Prediction:Travel-Gas(Correct) Local Reasoning: "Love's is a well-known restaurant chain, which falls under the Dining category." API Reasoning: "Love's is a well-known gas station chain, and the transaction occurred during a trip to Moab, categorizing it as travel-related fuel." Analysis: The API knows "Love's" is a major gas station chain. The small local model lacks this world knowledge and hallucinates it as a restaurant, highlighting the pure "Knowledge Gap" between a 270M and a 70B+ model. Additionally, Gemini Flash has Google Search grounding enabled, allowing it to verify real-world entities in real-time—a capability our isolated local model intrinsically lacks.
Surprise Win: JSON Stability
One pleasant surprise was the format adherence. I initially feared I'd need constrained generation tools like outlines or a simplified schema for a 270M parameter model. However, the distilled T5-Gemma model followed the complex JSON schema (including nested fields) with near-perfect reliability, proving that specific structure can be learned effectively through fine-tuning alone.
Key Lesson: The "Noisy Ground Truth" Trap
Since this is a distillation (SFT) pipeline, not Reinforcement Learning, the model has no way to "unlearn" bad habits via negative rewards. It relies entirely on the quality of the teacher's reasoning.
Transaction:
[TRAVEL] SWEETHOME KITCHENLocal Prediction:DiningAPI Prediction:Travel-DiningLocal Reasoning: "The description 'SWEETHOME KITCHEN' indicates a restaurant or dining establishment, which falls under the Dining category." API Reasoning: "The transaction is for a kitchen/restaurant and occurred while the user was traveling to Pfeiffer Big Sur SP, making it a travel-related dining expense."
In this case, the API correctly used the calendar context ("User is in Big Sur"). The local model missed this link. This highlights that simply having the data isn't enough—the reasoning in the training set must explicitly force the model to look at the context, or it will revert to simple pattern matching (Kitchen = Dining).
Conclusion
We often think we need 70B parameter models for everything. usage shows that for a specific, well-defined task with consistent formatting, a 270M parameter model—fine-tuned on high-quality, distilled data—can punch way above its weight class.
The key was data quality over quantity. By using the commercial model to "verify" my historical data, I created a dataset that was cleaner than either source alone.