This commit is contained in:
eric
2026-02-04 06:20:15 +00:00
parent bd862cb238
commit 7de3b87680
38 changed files with 173 additions and 104 deletions

View File

@@ -10,7 +10,7 @@ For years, I relied on a rule-based system to categorize our credit card transac
<a class=heading-link href=#phase-1-the-proof-of-concept-with-commercial-llms><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>My first step was to replace the spaghetti code of regex rules with a prompt. I used <strong>Gemini-3-Flash</strong> (via <code>litellm</code>) as my categorization engine.</p><p>The core challenge was context. A transaction like <code>MCDONALDS</code> could be:</p><ul><li><strong>Dining</strong>: A quick lunch during work.</li><li><strong>Travel-Dining</strong>: A meal while on a road trip.</li></ul><p>To solve this, I integrated my <strong>private Google Calendar</strong> (via <code>.ics</code> export). The prompt doesn&rsquo;t just see the transaction; it sees <em>where I was</em> and <em>what I was doing</em> on that day.</p><h3 id=the-god-prompt>The &ldquo;God Prompt&rdquo;
<a class=heading-link href=#the-god-prompt><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>The system prompt was designed to return strict JSON, adhering to a schema of Categories (e.g., <code>Dining</code>, <code>Travel</code>, <code>Bills</code>) and Sub-Categories (e.g., <code>Travel</code> -> <code>Accommodation</code>).</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-json data-lang=json><span style=display:flex><span>{
<span class=sr-only>Link to heading</span></a></h3><p>The system prompt was designed to return strict JSON, adhering to a schema of Categories (e.g., <code>Dining</code>, <code>Travel</code>, <code>Bills</code>) and Sub-Categories (e.g., <code>Travel</code> -> <code>Accommodation</code>).</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none><code class=language-json data-lang=json><span style=display:flex><span>{
</span></span><span style=display:flex><span> <span style=color:#7ee787>&#34;Category&#34;</span>: <span style=color:#a5d6ff>&#34;Travel&#34;</span>,
</span></span><span style=display:flex><span> <span style=color:#7ee787>&#34;Travel Category&#34;</span>: <span style=color:#a5d6ff>&#34;Dining&#34;</span>,
</span></span><span style=display:flex><span> <span style=color:#7ee787>&#34;Reasoning&#34;</span>: <span style=color:#a5d6ff>&#34;User is on &#39;Trip: 34TH ARCH CANYON 2025&#39;, distinguishing this from regular dining.&#34;</span>
@@ -19,7 +19,7 @@ For years, I relied on a rule-based system to categorize our credit card transac
<a class=heading-link href=#phase-2-distilling-knowledge><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h2><p>I wanted to train a smaller model to mimic Gemini&rsquo;s performance. But I didn&rsquo;t want to manually label thousands of transactions.</p><h3 id=consistency-filtering>Consistency Filtering
<a class=heading-link href=#consistency-filtering><i class="fa-solid fa-link" aria-hidden=true title="Link to heading"></i>
<span class=sr-only>Link to heading</span></a></h3><p>I had a massive CSV of historical transactions (years of data). However, that data was &ldquo;noisy&rdquo;—some manual labels were outdated or inconsistent.</p><p>I built a <strong>Distillation Pipeline</strong> (<code>distill_reasoning.py</code>) that uses the Teacher Model (Gemini) to re-label the historical data. But here&rsquo;s the twist: I only added a data point to my training set if the <strong>Teacher&rsquo;s prediction matched the Historical Ground Truth</strong>.</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=color:#8b949e;font-style:italic># Pseudo-code for consistency filtering</span>
<span class=sr-only>Link to heading</span></a></h3><p>I had a massive CSV of historical transactions (years of data). However, that data was &ldquo;noisy&rdquo;—some manual labels were outdated or inconsistent.</p><p>I built a <strong>Distillation Pipeline</strong> (<code>distill_reasoning.py</code>) that uses the Teacher Model (Gemini) to re-label the historical data. But here&rsquo;s the twist: I only added a data point to my training set if the <strong>Teacher&rsquo;s prediction matched the Historical Ground Truth</strong>.</p><div class=highlight><pre tabindex=0 style=color:#e6edf3;background-color:#0d1117;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none><code class=language-python data-lang=python><span style=display:flex><span><span style=color:#8b949e;font-style:italic># Pseudo-code for consistency filtering</span>
</span></span><span style=display:flex><span>teacher_pred <span style=color:#ff7b72;font-weight:700>=</span> gemini<span style=color:#ff7b72;font-weight:700>.</span>categorize(transaction)
</span></span><span style=display:flex><span>historical_label <span style=color:#ff7b72;font-weight:700>=</span> row[<span style=color:#a5d6ff>&#39;Category&#39;</span>]
</span></span><span style=display:flex><span>
@@ -73,4 +73,4 @@ It turned out to be a syntax error in my arguments passed to the <code>Trainer</
2016 -
2026
Eric X. Liu
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/6100dca">[6100dca]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>
<a href="https://git.ericxliu.me/eric/ericxliu-me/commit/45629c5">[45629c5]</a></section></footer></main><script src=/js/coder.min.6ae284be93d2d19dad1f02b0039508d9aab3180a12a06dcc71b0b0ef7825a317.js integrity="sha256-auKEvpPS0Z2tHwKwA5UI2aqzGAoSoG3McbCw73gloxc="></script><script defer src=https://static.cloudflareinsights.com/beacon.min.js data-cf-beacon='{"token": "987638e636ce4dbb932d038af74c17d1"}'></script></body></html>