📚 Auto-publish: Add/update 3 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 3m29s
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 3m29s
Generated on: Sat Dec 27 21:18:10 UTC 2025 Source: md-personal repository
This commit is contained in:
141
content/posts/technical-deep-dive-llm-categorization.md
Normal file
141
content/posts/technical-deep-dive-llm-categorization.md
Normal file
@@ -0,0 +1,141 @@
|
||||
---
|
||||
title: "From Gemini-3-Flash to T5-Gemma-2 A Journey in Distilling a Family Finance LLM"
|
||||
date: 2025-12-08
|
||||
draft: false
|
||||
---
|
||||
|
||||
|
||||
Running a family finance system is surprisingly complex. What starts as a simple spreadsheet often evolves into a web of rules, exceptions, and "wait, was this dinner or *vacation* dinner?" questions.
|
||||
|
||||
For years, I relied on a rule-based system to categorize our credit card transactions. It worked... mostly. But maintaining `if "UBER" in description and amount > 50` style rules is a never-ending battle against the entropy of merchant names and changing habits.
|
||||
|
||||
Recently, I decided to modernize this stack using Large Language Models (LLMs). This post details the technical journey from using an off-the-shelf commercial model to distilling that knowledge into a small, efficient local model (`google/t5gemma-2-270m`) that runs on my own hardware while maintaining high accuracy.
|
||||
|
||||
## Phase 1: The Proof of Concept with Commercial LLMs
|
||||
|
||||
My first step was to replace the spaghetti code of regex rules with a prompt. I used **Gemini-3-Flash** (via `litellm`) as my categorization engine.
|
||||
|
||||
The core challenge was context. A transaction like `MCDONALDS` could be:
|
||||
- **Dining**: A quick lunch during work.
|
||||
- **Travel-Dining**: A meal while on a road trip.
|
||||
|
||||
To solve this, I integrated my **private Google Calendar** (via `.ics` export). The prompt doesn't just see the transaction; it sees *where I was* and *what I was doing* on that day.
|
||||
|
||||
### The "God Prompt"
|
||||
The system prompt was designed to return strict JSON, adhering to a schema of Categories (e.g., `Dining`, `Travel`, `Bills`) and Sub-Categories (e.g., `Travel` -> `Accommodation`).
|
||||
|
||||
```json
|
||||
{
|
||||
"Category": "Travel",
|
||||
"Travel Category": "Dining",
|
||||
"Reasoning": "User is on 'Trip: 34TH ARCH CANYON 2025', distinguishing this from regular dining."
|
||||
}
|
||||
```
|
||||
|
||||
This worked well. The "Reasoning" field even gave me explanations for why it flagged something as `Entertainment` vs `Shopping`. But relying on an external API for every single transaction felt like overkill for a personal project, and I wanted to own the stack.
|
||||
|
||||
## Phase 2: Distilling Knowledge
|
||||
|
||||
I wanted to train a smaller model to mimic Gemini's performance. But I didn't want to manually label thousands of transactions.
|
||||
|
||||
### Consistency Filtering
|
||||
I had a massive CSV of historical transactions (years of data). However, that data was "noisy"—some manual labels were outdated or inconsistent.
|
||||
|
||||
I built a **Distillation Pipeline** (`distill_reasoning.py`) that uses the Teacher Model (Gemini) to re-label the historical data. But here's the twist: I only added a data point to my training set if the **Teacher's prediction matched the Historical Ground Truth**.
|
||||
|
||||
```python
|
||||
# Pseudo-code for consistency filtering
|
||||
teacher_pred = gemini.categorize(transaction)
|
||||
historical_label = row['Category']
|
||||
|
||||
if teacher_pred.category == historical_label:
|
||||
# High confidence sample!
|
||||
training_data.append({
|
||||
"input": format_transaction(transaction),
|
||||
"output": teacher_pred.to_json()
|
||||
})
|
||||
else:
|
||||
# Discard: Either history is wrong OR teacher hallucinated.
|
||||
log_fail(transaction)
|
||||
```
|
||||
|
||||
This filtered out the noise, leaving me with ~2,000 high-quality, "verified" examples where both the human (me, years ago) and the AI agreed.
|
||||
|
||||
## Phase 3: Training the Little Guy
|
||||
|
||||
For the local model, I chose **google/t5gemma-2-270m**. This is a Seq2Seq model, which fits the "Text-to-JSON" task perfectly, and it's tiny (270M parameters), meaning it can run on almost anything.
|
||||
|
||||
### The Stack
|
||||
- **Library**: `transformers`, `peft`, `bitsandbytes`
|
||||
- **Technique**: **LoRA** (Low-Rank Adaptation). I targeted all linear layers (`q_proj`, `k_proj`, `v_proj`, etc.) with `r=16`.
|
||||
- **Optimization**: `AdamW` with linear decay.
|
||||
|
||||
### Pitfall #1: The "Loss is 0" Initial Panic
|
||||
My first training run showed a loss of exactly `0.000` essentially immediately. In deep learning, if it looks too good to be true, it's a bug.
|
||||
It turned out to be a syntax error in my arguments passed to the `Trainer` (or rather, my custom loop). Once fixed, the loss looked "healthy"—starting high and decaying noisily.
|
||||
|
||||
### Pitfall #2: Stability vs. Noise
|
||||
The loss curve was initially extremely erratic. The batch size on my GPU was limited (Physical Batch Size = 4).
|
||||
**The Fix**: I implemented **Gradient Accumulation** (accumulating over 8 steps) to simulate a batch size of 32. This smoothed out the optimization landscape significantly.
|
||||

|
||||
|
||||
### Pitfall #3: Overfitting
|
||||
With a small dataset (~2k samples), overfitting is a real risk. I employed a multi-layered defense strategy:
|
||||
|
||||
1. **Data Quality First**: The "Consistency Filtering" phase was the most critical step. By discarding ambiguous samples where the teacher model disagreed with history, I prevented the model from memorizing noise.
|
||||
2. **Model Regularization**:
|
||||
* **LoRA Dropout**: I set `lora_dropout=0.1`, randomly dropping 10% of the trainable adapter connections during training to force robust feature learning.
|
||||
* **Gradient Clipping**: We capped the gradient norm at `1.0`. This prevents the "exploding gradient" problem and keeps weight updates stable.
|
||||
* **AdamW**: Using the AdamW optimizer adds decoupled weight decay, implicitly penalizing overly complex weights.
|
||||
|
||||
I also set up a rigorous evaluation loop (10% validation split, eval every 50 steps) to monitor the `Train Loss` vs `Eval Loss` in real-time. The final curves showed them tracking downwards together, confirming generalization.
|
||||
|
||||
## Phase 4: Results and The "Travel" Edge Case
|
||||
|
||||
The distilled model is surprisingly capable. It learned the JSON schema very well. Although I included a regex fallback in the inference script as a safety net, the model generates valid JSON the vast majority of the time.
|
||||
|
||||
### Head-to-Head: Local Model vs Gemini-Flash
|
||||
|
||||
I ran a blind evaluation on 20 random unseen transactions.
|
||||
- **Gemini-3-Flash Accuracy**: 90% (18/20)
|
||||
- **Local T5-Gemma-2 Accuracy**: 85% (17/20)
|
||||
|
||||
The gap is surprisingly small. In fact, the local model sometimes outperformed the API because it was fine-tuned on *my* specific data distribution.
|
||||
|
||||
**Win for Local Model:**
|
||||
> **Transaction**: `XX RANCH #1702`
|
||||
> **Local Prediction**: `Groceries` (Correct)
|
||||
> **API Prediction**: `Gas` (Incorrect)
|
||||
> **Local Reasoning**: " XX RANCH refers to a well-known supermarket chain.
|
||||
> **API Reasoning**: "XX RANCH is a known convenience store and gas station chain."
|
||||
> **Analysis**: The local model "knows" (from training data) that XX Ranch is a Asian grocery store I frequent, whereas the general-purpose API assumed it was a gas station based on the name pattern.
|
||||
|
||||
**Win for API (World Knowledge):**
|
||||
> **Transaction**: `LOVE'S #0792`
|
||||
> **Local Prediction**: `Dining` (Hallucination)
|
||||
> **API Prediction**: `Travel-Gas` (Correct)
|
||||
> **Local Reasoning**: "Love's is a well-known restaurant chain, which falls under the Dining category."
|
||||
> **API Reasoning**: "Love's is a well-known gas station chain, and the transaction occurred during a trip to Moab, categorizing it as travel-related fuel."
|
||||
> **Analysis**: The API knows "Love's" is a major gas station chain. The small local model lacks this world knowledge and hallucinates it as a restaurant, highlighting the pure "Knowledge Gap" between a 270M and a 70B+ model. Additionally, Gemini Flash has **Google Search grounding** enabled, allowing it to verify real-world entities in real-time—a capability our isolated local model intrinsically lacks.
|
||||
|
||||
### Surprise Win: JSON Stability
|
||||
|
||||
One pleasant surprise was the **format adherence**. I initially feared I'd need constrained generation tools like `outlines` or a simplified schema for a 270M parameter model. However, the distilled T5-Gemma model followed the complex JSON schema (including nested fields) with near-perfect reliability, proving that specific structure can be learned effectively through fine-tuning alone.
|
||||
|
||||
### Key Lesson: The "Noisy Ground Truth" Trap
|
||||
|
||||
Since this is a **distillation (SFT)** pipeline, not Reinforcement Learning, the model has no way to "unlearn" bad habits via negative rewards. It relies entirely on the quality of the teacher's reasoning.
|
||||
|
||||
> **Transaction**: `[TRAVEL] SWEETHOME KITCHEN`
|
||||
> **Local Prediction**: `Dining`
|
||||
> **API Prediction**: `Travel-Dining`
|
||||
> **Local Reasoning**: "The description 'SWEETHOME KITCHEN' indicates a restaurant or dining establishment, which falls under the Dining category."
|
||||
> **API Reasoning**: "The transaction is for a kitchen/restaurant and occurred while the user was traveling to Pfeiffer Big Sur SP, making it a travel-related dining expense."
|
||||
|
||||
In this case, the API correctly used the calendar context ("User is in Big Sur"). The local model missed this link. This highlights that simply having the data isn't enough—the *reasoning* in the training set must explicitly force the model to look at the context, or it will revert to simple pattern matching (Kitchen = Dining).
|
||||
|
||||
## Conclusion
|
||||
|
||||
We often think we need 70B parameter models for everything. usage shows that for a specific, well-defined task with consistent formatting, a **270M parameter model**—fine-tuned on high-quality, distilled data—can punch way above its weight class.
|
||||
|
||||
The key was **data quality over quantity**. By using the commercial model to "verify" my historical data, I created a dataset that was cleaner than either source alone.
|
||||
Reference in New Issue
Block a user