📚 Auto-publish: Add/update 3 blog posts

Generated on: Sat Dec 27 21:18:10 UTC 2025 Source: md-personal repository
2025-12-27 21:18:10 +00:00
parent 79473f582a
commit cd4cace37f
3 changed files with 142 additions and 0 deletions
--- a/content/posts/technical-deep-dive-llm-categorization.md
+++ b/content/posts/technical-deep-dive-llm-categorization.md
@@ -0,0 +1,141 @@
+---
+title: "From Gemini-3-Flash to T5-Gemma-2 A Journey in Distilling a Family Finance LLM"
+date: 2025-12-08
+draft: false
+---
+
+
+Running a family finance system is surprisingly complex. What starts as a simple spreadsheet often evolves into a web of rules, exceptions, and "wait, was this dinner or *vacation* dinner?" questions.
+
+For years, I relied on a rule-based system to categorize our credit card transactions. It worked... mostly. But maintaining `if "UBER" in description and amount > 50` style rules is a never-ending battle against the entropy of merchant names and changing habits.
+
+Recently, I decided to modernize this stack using Large Language Models (LLMs). This post details the technical journey from using an off-the-shelf commercial model to distilling that knowledge into a small, efficient local model (`google/t5gemma-2-270m`) that runs on my own hardware while maintaining high accuracy.
+
+## Phase 1: The Proof of Concept with Commercial LLMs
+
+My first step was to replace the spaghetti code of regex rules with a prompt. I used **Gemini-3-Flash** (via `litellm`) as my categorization engine.
+
+The core challenge was context. A transaction like `MCDONALDS` could be:
+- **Dining**: A quick lunch during work.
+- **Travel-Dining**: A meal while on a road trip.
+
+To solve this, I integrated my **private Google Calendar** (via `.ics` export). The prompt doesn't just see the transaction; it sees *where I was* and *what I was doing* on that day.
+
+### The "God Prompt"
+The system prompt was designed to return strict JSON, adhering to a schema of Categories (e.g., `Dining`, `Travel`, `Bills`) and Sub-Categories (e.g., `Travel` -> `Accommodation`).
+
+```json
+{
+  "Category": "Travel",
+  "Travel Category": "Dining",
+  "Reasoning": "User is on 'Trip: 34TH ARCH CANYON 2025', distinguishing this from regular dining."
+}
+```
+
+This worked well. The "Reasoning" field even gave me explanations for why it flagged something as `Entertainment` vs `Shopping`. But relying on an external API for every single transaction felt like overkill for a personal project, and I wanted to own the stack.
+
+## Phase 2: Distilling Knowledge
+
+I wanted to train a smaller model to mimic Gemini's performance. But I didn't want to manually label thousands of transactions.
+
+### Consistency Filtering
+I had a massive CSV of historical transactions (years of data). However, that data was "noisy"—some manual labels were outdated or inconsistent.
+
+I built a **Distillation Pipeline** (`distill_reasoning.py`) that uses the Teacher Model (Gemini) to re-label the historical data. But here's the twist: I only added a data point to my training set if the **Teacher's prediction matched the Historical Ground Truth**.
+
+```python
+# Pseudo-code for consistency filtering
+teacher_pred = gemini.categorize(transaction)
+historical_label = row['Category']
+
+if teacher_pred.category == historical_label:
+    # High confidence sample!
+    training_data.append({
+        "input": format_transaction(transaction),
+        "output": teacher_pred.to_json()
+    })
+else:
+    # Discard: Either history is wrong OR teacher hallucinated.
+    log_fail(transaction)
+```
+
+This filtered out the noise, leaving me with ~2,000 high-quality, "verified" examples where both the human (me, years ago) and the AI agreed.
+
+## Phase 3: Training the Little Guy
+
+For the local model, I chose **google/t5gemma-2-270m**. This is a Seq2Seq model, which fits the "Text-to-JSON" task perfectly, and it's tiny (270M parameters), meaning it can run on almost anything.
+
+### The Stack
+- **Library**: `transformers`, `peft`, `bitsandbytes`
+- **Technique**: **LoRA** (Low-Rank Adaptation). I targeted all linear layers (`q_proj`, `k_proj`, `v_proj`, etc.) with `r=16`.
+- **Optimization**: `AdamW` with linear decay.
+
+### Pitfall #1: The "Loss is 0" Initial Panic
+My first training run showed a loss of exactly `0.000` essentially immediately. In deep learning, if it looks too good to be true, it's a bug.
+It turned out to be a syntax error in my arguments passed to the `Trainer` (or rather, my custom loop). Once fixed, the loss looked "healthy"—starting high and decaying noisily.
+
+### Pitfall #2: Stability vs. Noise
+The loss curve was initially extremely erratic. The batch size on my GPU was limited (Physical Batch Size = 4).
+**The Fix**: I implemented **Gradient Accumulation** (accumulating over 8 steps) to simulate a batch size of 32. This smoothed out the optimization landscape significantly.
+![S3 File](/images/technical-deep-dive-llm-categorization/eedb3be8259a4a70aa7029b78a029364.png)
+
+### Pitfall #3: Overfitting
+With a small dataset (~2k samples), overfitting is a real risk. I employed a multi-layered defense strategy:
+
+1.  **Data Quality First**: The "Consistency Filtering" phase was the most critical step. By discarding ambiguous samples where the teacher model disagreed with history, I prevented the model from memorizing noise.
+2.  **Model Regularization**:
+    *   **LoRA Dropout**: I set `lora_dropout=0.1`, randomly dropping 10% of the trainable adapter connections during training to force robust feature learning.
+    *   **Gradient Clipping**: We capped the gradient norm at `1.0`. This prevents the "exploding gradient" problem and keeps weight updates stable.
+    *   **AdamW**: Using the AdamW optimizer adds decoupled weight decay, implicitly penalizing overly complex weights.
+
+I also set up a rigorous evaluation loop (10% validation split, eval every 50 steps) to monitor the `Train Loss` vs `Eval Loss` in real-time. The final curves showed them tracking downwards together, confirming generalization.
+
+## Phase 4: Results and The "Travel" Edge Case
+
+The distilled model is surprisingly capable. It learned the JSON schema very well. Although I included a regex fallback in the inference script as a safety net, the model generates valid JSON the vast majority of the time.
+
+### Head-to-Head: Local Model vs Gemini-Flash
+
+I ran a blind evaluation on 20 random unseen transactions.
+- **Gemini-3-Flash Accuracy**: 90% (18/20)
+- **Local T5-Gemma-2 Accuracy**: 85% (17/20)
+
+The gap is surprisingly small. In fact, the local model sometimes outperformed the API because it was fine-tuned on *my* specific data distribution.
+
+**Win for Local Model:**
+> **Transaction**: `XX RANCH #1702`
+> **Local Prediction**: `Groceries` (Correct)
+> **API Prediction**: `Gas` (Incorrect)
+> **Local Reasoning**: " XX RANCH refers to a well-known supermarket chain.
+> **API Reasoning**: "XX RANCH is a known convenience store and gas station chain."
+> **Analysis**: The local model "knows" (from training data) that XX Ranch is a Asian grocery store I frequent, whereas the general-purpose API assumed it was a gas station based on the name pattern.
+
+**Win for API (World Knowledge):**
+> **Transaction**: `LOVE'S #0792`
+> **Local Prediction**: `Dining` (Hallucination)
+> **API Prediction**: `Travel-Gas` (Correct)
+> **Local Reasoning**: "Love's is a well-known restaurant chain, which falls under the Dining category."
+> **API Reasoning**: "Love's is a well-known gas station chain, and the transaction occurred during a trip to Moab, categorizing it as travel-related fuel."
+> **Analysis**: The API knows "Love's" is a major gas station chain. The small local model lacks this world knowledge and hallucinates it as a restaurant, highlighting the pure "Knowledge Gap" between a 270M and a 70B+ model. Additionally, Gemini Flash has **Google Search grounding** enabled, allowing it to verify real-world entities in real-time—a capability our isolated local model intrinsically lacks.
+
+### Surprise Win: JSON Stability
+
+One pleasant surprise was the **format adherence**. I initially feared I'd need constrained generation tools like `outlines` or a simplified schema for a 270M parameter model. However, the distilled T5-Gemma model followed the complex JSON schema (including nested fields) with near-perfect reliability, proving that specific structure can be learned effectively through fine-tuning alone.
+
+### Key Lesson: The "Noisy Ground Truth" Trap
+
+Since this is a **distillation (SFT)** pipeline, not Reinforcement Learning, the model has no way to "unlearn" bad habits via negative rewards. It relies entirely on the quality of the teacher's reasoning.
+
+> **Transaction**: `[TRAVEL] SWEETHOME KITCHEN`
+> **Local Prediction**: `Dining`
+> **API Prediction**: `Travel-Dining`
+> **Local Reasoning**: "The description 'SWEETHOME KITCHEN' indicates a restaurant or dining establishment, which falls under the Dining category."
+> **API Reasoning**: "The transaction is for a kitchen/restaurant and occurred while the user was traveling to Pfeiffer Big Sur SP, making it a travel-related dining expense."
+
+In this case, the API correctly used the calendar context ("User is in Big Sur"). The local model missed this link. This highlights that simply having the data isn't enough—the *reasoning* in the training set must explicitly force the model to look at the context, or it will revert to simple pattern matching (Kitchen = Dining).
+
+## Conclusion
+
+We often think we need 70B parameter models for everything. usage shows that for a specific, well-defined task with consistent formatting, a **270M parameter model**—fine-tuned on high-quality, distilled data—can punch way above its weight class.
+
+The key was **data quality over quantity**. By using the commercial model to "verify" my historical data, I created a dataset that was cleaner than either source alone.