Files
ericxliu-me/content/posts/technical-deep-dive-llm-categorization.md
Automated Publisher 13abf5792b
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 17s
📚 Auto-publish: Add/update 6 blog posts
Generated on: Sat Jan 10 20:10:48 UTC 2026
Source: md-personal repository
2026-01-10 20:10:48 +00:00

9.2 KiB

title, date, draft
title date draft
From Gemini-3-Flash to T5-Gemma-2: A Journey in Distilling a Family Finance LLM 2025-12-27 false

Running a family finance system is surprisingly complex. What starts as a simple spreadsheet often evolves into a web of rules, exceptions, and "wait, was this dinner or vacation dinner?" questions.

For years, I relied on a rule-based system to categorize our credit card transactions. It worked... mostly. But maintaining if "UBER" in description and amount > 50 style rules is a never-ending battle against the entropy of merchant names and changing habits.

Recently, I decided to modernize this stack using Large Language Models (LLMs). This post details the technical journey from using an off-the-shelf commercial model to distilling that knowledge into a small, efficient local model (google/t5gemma-2-270m) that runs on my own hardware while maintaining high accuracy.

Phase 1: The Proof of Concept with Commercial LLMs

My first step was to replace the spaghetti code of regex rules with a prompt. I used Gemini-3-Flash (via litellm) as my categorization engine.

The core challenge was context. A transaction like MCDONALDS could be:

  • Dining: A quick lunch during work.
  • Travel-Dining: A meal while on a road trip.

To solve this, I integrated my private Google Calendar (via .ics export). The prompt doesn't just see the transaction; it sees where I was and what I was doing on that day.

The "God Prompt"

The system prompt was designed to return strict JSON, adhering to a schema of Categories (e.g., Dining, Travel, Bills) and Sub-Categories (e.g., Travel -> Accommodation).

{
  "Category": "Travel",
  "Travel Category": "Dining",
  "Reasoning": "User is on 'Trip: 34TH ARCH CANYON 2025', distinguishing this from regular dining."
}

This worked well. The "Reasoning" field even gave me explanations for why it flagged something as Entertainment vs Shopping. But relying on an external API for every single transaction felt like overkill for a personal project, and I wanted to own the stack.

Phase 2: Distilling Knowledge

I wanted to train a smaller model to mimic Gemini's performance. But I didn't want to manually label thousands of transactions.

Consistency Filtering

I had a massive CSV of historical transactions (years of data). However, that data was "noisy"—some manual labels were outdated or inconsistent.

I built a Distillation Pipeline (distill_reasoning.py) that uses the Teacher Model (Gemini) to re-label the historical data. But here's the twist: I only added a data point to my training set if the Teacher's prediction matched the Historical Ground Truth.

# Pseudo-code for consistency filtering
teacher_pred = gemini.categorize(transaction)
historical_label = row['Category']

if teacher_pred.category == historical_label:
    # High confidence sample!
    training_data.append({
        "input": format_transaction(transaction),
        "output": teacher_pred.to_json()
    })
else:
    # Discard: Either history is wrong OR teacher hallucinated.
    log_fail(transaction)

This filtered out the noise, leaving me with ~2,000 high-quality, "verified" examples where both the human (me, years ago) and the AI agreed.

Phase 3: Training the Little Guy

For the local model, I chose google/t5gemma-2-270m. This is a Seq2Seq model, which fits the "Text-to-JSON" task perfectly, and it's tiny (270M parameters), meaning it can run on almost anything.

The Stack

  • Library: transformers, peft, bitsandbytes
  • Technique: LoRA (Low-Rank Adaptation). I targeted all linear layers (q_proj, k_proj, v_proj, etc.) with r=16.
  • Optimization: AdamW with linear decay.

Pitfall #1: The "Loss is 0" Initial Panic

My first training run showed a loss of exactly 0.000 essentially immediately. In deep learning, if it looks too good to be true, it's a bug. It turned out to be a syntax error in my arguments passed to the Trainer (or rather, my custom loop). Once fixed, the loss looked "healthy"—starting high and decaying noisily.

Pitfall #2: Stability vs. Noise

The loss curve was initially extremely erratic. The batch size on my GPU was limited (Physical Batch Size = 4). The Fix: I implemented Gradient Accumulation (accumulating over 8 steps) to simulate a batch size of 32. This smoothed out the optimization landscape significantly. S3 File

Pitfall #3: Overfitting

With a small dataset (~2k samples), overfitting is a real risk. I employed a multi-layered defense strategy:

  1. Data Quality First: The "Consistency Filtering" phase was the most critical step. By discarding ambiguous samples where the teacher model disagreed with history, I prevented the model from memorizing noise.
  2. Model Regularization:
    • LoRA Dropout: I set lora_dropout=0.1, randomly dropping 10% of the trainable adapter connections during training to force robust feature learning.
    • Gradient Clipping: We capped the gradient norm at 1.0. This prevents the "exploding gradient" problem and keeps weight updates stable.
    • AdamW: Using the AdamW optimizer adds decoupled weight decay, implicitly penalizing overly complex weights.

I also set up a rigorous evaluation loop (10% validation split, eval every 50 steps) to monitor the Train Loss vs Eval Loss in real-time. The final curves showed them tracking downwards together, confirming generalization.

Phase 4: Results and The "Travel" Edge Case

The distilled model is surprisingly capable. It learned the JSON schema very well. Although I included a regex fallback in the inference script as a safety net, the model generates valid JSON the vast majority of the time.

Head-to-Head: Local Model vs Gemini-Flash

I ran a blind evaluation on 20 random unseen transactions.

  • Gemini-3-Flash Accuracy: 90% (18/20)
  • Local T5-Gemma-2 Accuracy: 85% (17/20)

The gap is surprisingly small. In fact, the local model sometimes outperformed the API because it was fine-tuned on my specific data distribution.

Win for Local Model:

Transaction: XX RANCH #1702 Local Prediction: Groceries (Correct) API Prediction: Gas (Incorrect) Local Reasoning: " XX RANCH refers to a well-known supermarket chain. API Reasoning: "XX RANCH is a known convenience store and gas station chain." Analysis: The local model "knows" (from training data) that XX Ranch is a Asian grocery store I frequent, whereas the general-purpose API assumed it was a gas station based on the name pattern.

Win for API (World Knowledge):

Transaction: LOVE'S #0792 Local Prediction: Dining (Hallucination) API Prediction: Travel-Gas (Correct) Local Reasoning: "Love's is a well-known restaurant chain, which falls under the Dining category." API Reasoning: "Love's is a well-known gas station chain, and the transaction occurred during a trip to Moab, categorizing it as travel-related fuel." Analysis: The API knows "Love's" is a major gas station chain. The small local model lacks this world knowledge and hallucinates it as a restaurant, highlighting the pure "Knowledge Gap" between a 270M and a 70B+ model. Additionally, Gemini Flash has Google Search grounding enabled, allowing it to verify real-world entities in real-time—a capability our isolated local model intrinsically lacks.

Surprise Win: JSON Stability

One pleasant surprise was the format adherence. I initially feared I'd need constrained generation tools like outlines or a simplified schema for a 270M parameter model. However, the distilled T5-Gemma model followed the complex JSON schema (including nested fields) with near-perfect reliability, proving that specific structure can be learned effectively through fine-tuning alone.

Key Lesson: The "Noisy Ground Truth" Trap

Since this is a distillation (SFT) pipeline, not Reinforcement Learning, the model has no way to "unlearn" bad habits via negative rewards. It relies entirely on the quality of the teacher's reasoning.

Transaction: [TRAVEL] SWEETHOME KITCHEN Local Prediction: Dining API Prediction: Travel-Dining Local Reasoning: "The description 'SWEETHOME KITCHEN' indicates a restaurant or dining establishment, which falls under the Dining category." API Reasoning: "The transaction is for a kitchen/restaurant and occurred while the user was traveling to Pfeiffer Big Sur SP, making it a travel-related dining expense."

In this case, the API correctly used the calendar context ("User is in Big Sur"). The local model missed this link. This highlights that simply having the data isn't enough—the reasoning in the training set must explicitly force the model to look at the context, or it will revert to simple pattern matching (Kitchen = Dining).

Conclusion

We often think we need 70B parameter models for everything. usage shows that for a specific, well-defined task with consistent formatting, a 270M parameter model—fine-tuned on high-quality, distilled data—can punch way above its weight class.

The key was data quality over quantity. By using the commercial model to "verify" my historical data, I created a dataset that was cleaner than either source alone.