diff --git a/.image_mappings/technical-deep-dive-llm-categorization.txt b/.image_mappings/technical-deep-dive-llm-categorization.txt new file mode 100644 index 0000000..3369f6c --- /dev/null +++ b/.image_mappings/technical-deep-dive-llm-categorization.txt @@ -0,0 +1 @@ +image-1b23344ea5541d156e5ac20823d12d7c6723b691.png|eedb3be8259a4a70aa7029b78a029364.png|e0fb329f437f21bc3385472bfeb91597 diff --git a/content/posts/technical-deep-dive-llm-categorization.md b/content/posts/technical-deep-dive-llm-categorization.md new file mode 100644 index 0000000..573daf1 --- /dev/null +++ b/content/posts/technical-deep-dive-llm-categorization.md @@ -0,0 +1,141 @@ +--- +title: "From Gemini-3-Flash to T5-Gemma-2 A Journey in Distilling a Family Finance LLM" +date: 2025-12-08 +draft: false +--- + + +Running a family finance system is surprisingly complex. What starts as a simple spreadsheet often evolves into a web of rules, exceptions, and "wait, was this dinner or *vacation* dinner?" questions. + +For years, I relied on a rule-based system to categorize our credit card transactions. It worked... mostly. But maintaining `if "UBER" in description and amount > 50` style rules is a never-ending battle against the entropy of merchant names and changing habits. + +Recently, I decided to modernize this stack using Large Language Models (LLMs). This post details the technical journey from using an off-the-shelf commercial model to distilling that knowledge into a small, efficient local model (`google/t5gemma-2-270m`) that runs on my own hardware while maintaining high accuracy. + +## Phase 1: The Proof of Concept with Commercial LLMs + +My first step was to replace the spaghetti code of regex rules with a prompt. I used **Gemini-3-Flash** (via `litellm`) as my categorization engine. + +The core challenge was context. A transaction like `MCDONALDS` could be: +- **Dining**: A quick lunch during work. +- **Travel-Dining**: A meal while on a road trip. + +To solve this, I integrated my **private Google Calendar** (via `.ics` export). The prompt doesn't just see the transaction; it sees *where I was* and *what I was doing* on that day. + +### The "God Prompt" +The system prompt was designed to return strict JSON, adhering to a schema of Categories (e.g., `Dining`, `Travel`, `Bills`) and Sub-Categories (e.g., `Travel` -> `Accommodation`). + +```json +{ + "Category": "Travel", + "Travel Category": "Dining", + "Reasoning": "User is on 'Trip: 34TH ARCH CANYON 2025', distinguishing this from regular dining." +} +``` + +This worked well. The "Reasoning" field even gave me explanations for why it flagged something as `Entertainment` vs `Shopping`. But relying on an external API for every single transaction felt like overkill for a personal project, and I wanted to own the stack. + +## Phase 2: Distilling Knowledge + +I wanted to train a smaller model to mimic Gemini's performance. But I didn't want to manually label thousands of transactions. + +### Consistency Filtering +I had a massive CSV of historical transactions (years of data). However, that data was "noisy"—some manual labels were outdated or inconsistent. + +I built a **Distillation Pipeline** (`distill_reasoning.py`) that uses the Teacher Model (Gemini) to re-label the historical data. But here's the twist: I only added a data point to my training set if the **Teacher's prediction matched the Historical Ground Truth**. + +```python +# Pseudo-code for consistency filtering +teacher_pred = gemini.categorize(transaction) +historical_label = row['Category'] + +if teacher_pred.category == historical_label: + # High confidence sample! + training_data.append({ + "input": format_transaction(transaction), + "output": teacher_pred.to_json() + }) +else: + # Discard: Either history is wrong OR teacher hallucinated. + log_fail(transaction) +``` + +This filtered out the noise, leaving me with ~2,000 high-quality, "verified" examples where both the human (me, years ago) and the AI agreed. + +## Phase 3: Training the Little Guy + +For the local model, I chose **google/t5gemma-2-270m**. This is a Seq2Seq model, which fits the "Text-to-JSON" task perfectly, and it's tiny (270M parameters), meaning it can run on almost anything. + +### The Stack +- **Library**: `transformers`, `peft`, `bitsandbytes` +- **Technique**: **LoRA** (Low-Rank Adaptation). I targeted all linear layers (`q_proj`, `k_proj`, `v_proj`, etc.) with `r=16`. +- **Optimization**: `AdamW` with linear decay. + +### Pitfall #1: The "Loss is 0" Initial Panic +My first training run showed a loss of exactly `0.000` essentially immediately. In deep learning, if it looks too good to be true, it's a bug. +It turned out to be a syntax error in my arguments passed to the `Trainer` (or rather, my custom loop). Once fixed, the loss looked "healthy"—starting high and decaying noisily. + +### Pitfall #2: Stability vs. Noise +The loss curve was initially extremely erratic. The batch size on my GPU was limited (Physical Batch Size = 4). +**The Fix**: I implemented **Gradient Accumulation** (accumulating over 8 steps) to simulate a batch size of 32. This smoothed out the optimization landscape significantly. +![S3 File](/images/technical-deep-dive-llm-categorization/eedb3be8259a4a70aa7029b78a029364.png) + +### Pitfall #3: Overfitting +With a small dataset (~2k samples), overfitting is a real risk. I employed a multi-layered defense strategy: + +1. **Data Quality First**: The "Consistency Filtering" phase was the most critical step. By discarding ambiguous samples where the teacher model disagreed with history, I prevented the model from memorizing noise. +2. **Model Regularization**: + * **LoRA Dropout**: I set `lora_dropout=0.1`, randomly dropping 10% of the trainable adapter connections during training to force robust feature learning. + * **Gradient Clipping**: We capped the gradient norm at `1.0`. This prevents the "exploding gradient" problem and keeps weight updates stable. + * **AdamW**: Using the AdamW optimizer adds decoupled weight decay, implicitly penalizing overly complex weights. + +I also set up a rigorous evaluation loop (10% validation split, eval every 50 steps) to monitor the `Train Loss` vs `Eval Loss` in real-time. The final curves showed them tracking downwards together, confirming generalization. + +## Phase 4: Results and The "Travel" Edge Case + +The distilled model is surprisingly capable. It learned the JSON schema very well. Although I included a regex fallback in the inference script as a safety net, the model generates valid JSON the vast majority of the time. + +### Head-to-Head: Local Model vs Gemini-Flash + +I ran a blind evaluation on 20 random unseen transactions. +- **Gemini-3-Flash Accuracy**: 90% (18/20) +- **Local T5-Gemma-2 Accuracy**: 85% (17/20) + +The gap is surprisingly small. In fact, the local model sometimes outperformed the API because it was fine-tuned on *my* specific data distribution. + +**Win for Local Model:** +> **Transaction**: `XX RANCH #1702` +> **Local Prediction**: `Groceries` (Correct) +> **API Prediction**: `Gas` (Incorrect) +> **Local Reasoning**: " XX RANCH refers to a well-known supermarket chain. +> **API Reasoning**: "XX RANCH is a known convenience store and gas station chain." +> **Analysis**: The local model "knows" (from training data) that XX Ranch is a Asian grocery store I frequent, whereas the general-purpose API assumed it was a gas station based on the name pattern. + +**Win for API (World Knowledge):** +> **Transaction**: `LOVE'S #0792` +> **Local Prediction**: `Dining` (Hallucination) +> **API Prediction**: `Travel-Gas` (Correct) +> **Local Reasoning**: "Love's is a well-known restaurant chain, which falls under the Dining category." +> **API Reasoning**: "Love's is a well-known gas station chain, and the transaction occurred during a trip to Moab, categorizing it as travel-related fuel." +> **Analysis**: The API knows "Love's" is a major gas station chain. The small local model lacks this world knowledge and hallucinates it as a restaurant, highlighting the pure "Knowledge Gap" between a 270M and a 70B+ model. Additionally, Gemini Flash has **Google Search grounding** enabled, allowing it to verify real-world entities in real-time—a capability our isolated local model intrinsically lacks. + +### Surprise Win: JSON Stability + +One pleasant surprise was the **format adherence**. I initially feared I'd need constrained generation tools like `outlines` or a simplified schema for a 270M parameter model. However, the distilled T5-Gemma model followed the complex JSON schema (including nested fields) with near-perfect reliability, proving that specific structure can be learned effectively through fine-tuning alone. + +### Key Lesson: The "Noisy Ground Truth" Trap + +Since this is a **distillation (SFT)** pipeline, not Reinforcement Learning, the model has no way to "unlearn" bad habits via negative rewards. It relies entirely on the quality of the teacher's reasoning. + +> **Transaction**: `[TRAVEL] SWEETHOME KITCHEN` +> **Local Prediction**: `Dining` +> **API Prediction**: `Travel-Dining` +> **Local Reasoning**: "The description 'SWEETHOME KITCHEN' indicates a restaurant or dining establishment, which falls under the Dining category." +> **API Reasoning**: "The transaction is for a kitchen/restaurant and occurred while the user was traveling to Pfeiffer Big Sur SP, making it a travel-related dining expense." + +In this case, the API correctly used the calendar context ("User is in Big Sur"). The local model missed this link. This highlights that simply having the data isn't enough—the *reasoning* in the training set must explicitly force the model to look at the context, or it will revert to simple pattern matching (Kitchen = Dining). + +## Conclusion + +We often think we need 70B parameter models for everything. usage shows that for a specific, well-defined task with consistent formatting, a **270M parameter model**—fine-tuned on high-quality, distilled data—can punch way above its weight class. + +The key was **data quality over quantity**. By using the commercial model to "verify" my historical data, I created a dataset that was cleaner than either source alone. diff --git a/static/images/technical-deep-dive-llm-categorization/eedb3be8259a4a70aa7029b78a029364.png b/static/images/technical-deep-dive-llm-categorization/eedb3be8259a4a70aa7029b78a029364.png new file mode 100644 index 0000000..bbef16a Binary files /dev/null and b/static/images/technical-deep-dive-llm-categorization/eedb3be8259a4a70aa7029b78a029364.png differ