How a 0.03ms model learned to tag user posts almost as well as a multi-gigabyte language model
Here is the honest version of how this project started.
We had 41,000 real user posts. We needed to tag each one with interest categories — things like “Economy”, “Health”, “Environment”. Multiple tags per post, because real posts don’t fit neatly into one box.
The “smart” solution was obvious: throw a large language model at it. Let the LLM read each post and decide what it’s about. Done.
And it worked. The LLM was great at this. It understood context, handled short or messy posts, and produced sensible labels even when the text was a few words of slang or an emoji-heavy neighborhood update.
The problem? Each batch of 5 posts took about 12 seconds. Running the full 41,000 posts would take days. And if we ever wanted this in production — tagging posts in real time as users publish them — an LLM sitting behind an API endpoint was not going to cut it.
So we asked a different question.
What if we use the LLM to generate our training labels, then train a much smaller, faster model on those labels?
That’s the experiment this post is about.
The Setup
The LLM we used was Gemma 4 (specifically gemma4:e4b), running locally via Ollama.
We asked it to assign 2 to 5 interest labels to each post from a fixed set of 20 categories. The model was doing this offline, as a one-time labeling job — not in production.
The resulting labeled dataset had 1,670 posts with an average of 3.36 labels per post.
Then we trained four different lightweight models on those LLM-generated labels and measured how close each one could get.
Why This Even Makes Sense
Before getting into numbers, let me explain the core idea — because “distillation” sounds complicated but the intuition is simple.
An LLM is expensive to run. It needs gigabytes of RAM, takes seconds per request, and costs money at scale.
But an LLM is very good at labeling. It understands nuance, handles weird edge cases, and rarely makes obviously wrong calls.
So the trick is: let it label once, then train a tiny model that learned from those labels. The tiny model doesn’t need to be as smart as the LLM. It just needs to be good enough for your specific task.
Think of it like hiring a very expensive consultant to create a training manual, then using that manual to train cheaper, faster staff.
The Models I Tried
1. TF-IDF (word n-grams) + Logistic Regression
The classic baseline. Count word frequencies, feed them to a linear classifier. Micro F1: 0.58 — a solid starting point.
2. Sentence Embeddings + Logistic Regression
Here I tried something more “modern”: encode the whole post as a dense semantic vector using a pre-trained transformer, then classify on top of that. Micro F1: 0.46 — actually worse than the basic baseline.
3. Sentence Embeddings + XGBoost
Maybe the problem was the classifier, not the embeddings? Swapped Logistic Regression for XGBoost. Micro F1: 0.46 — same story. No improvement.
4. TF-IDF (word + character n-grams) + Logistic Regression
This one added character-level features on top of word features. Micro F1: 0.64 — best result, by a clear margin.
The Surprising Winner
The best model wasn’t the one using transformer embeddings. It was the one that paid attention to how words are spelled, not just what they mean.
Why does that matter? Because our posts are messy: slang and informal writing, inconsistent spelling, repeated punctuation (“!!!!”), emojis mid-sentence, and neighborhood-specific shorthand. Character n-grams capture the texture of noisy text in a way that general-purpose semantic embeddings don’t. Those embeddings were trained on cleaner data and don’t know what to do with “hava çok sıcaakk 🥵” or repeated punctuation.
The Number That Matters Most: Speed
This is where the gap becomes impossible to ignore. From the training logs:
- Gemma 4 (LLM): ~12 seconds per batch of 5 posts → ~2,400 ms per post
- TF-IDF + Logistic Regression: 302 test posts predicted in 0.01 seconds → ~0.033 ms per post
That’s a 73,000× speed difference.
To make this concrete: if you had a social platform with 10,000 posts published per hour, the LLM approach would need over 6 hours just to tag one hour’s worth of posts. The traditional model tags that same hour in under 2 seconds.
And this isn’t just about speed. It’s about:
- No GPU required at inference time. The TF-IDF model is pure CPU, a few megabytes, loads instantly.
- No API costs. Once trained, it runs free.
- Deterministic. Same input always gives same output. No temperature, no randomness, easy to debug.
- Deployable anywhere. A Python script and a joblib file. That’s the entire production system.
The Results in Full
Evaluated on a held-out test set of 302 posts:
Threshold-based: Micro F1 0.6412 | Macro F1 0.6342 | Exact match 0.0695
Top-k (k=3): Micro F1 0.6063 | Macro F1 0.5967 | Exact match 0.0960
The threshold-based approach wins on F1. In production, threshold-based is the right call — it adapts to how many categories actually fit the post, rather than always picking exactly 3.
Does the Test Split Actually Represent the Data?
Before trusting these numbers, it’s worth checking whether the test split looks like the full labeled dataset. Each dot in the chart below is one interest category. The dashed line is perfect alignment. Points close to it mean the split is representative — and they mostly are.
Where the Model Still Struggles
Short, context-free posts are genuinely hard. A post that says “Good morning neighbors” carries almost no signal. The LLM labeled it based on platform context; the small model can only see words.
Posts that fit many categories at once are tricky. A local community event can be Culture + Socializing + Neighborhood Pride all at once. The model sometimes gets two out of three.
Platform-specific interpretations are hard to generalize. Some interest assignments reflect a judgment call about what this particular community cares about.
A 0.64 F1 score in multi-label classification with 20 classes and noisy, short social posts is genuinely useful — but it’s not perfect.
So What’s the Actual Takeaway?
There’s a pattern here worth naming clearly:
Use the expensive, heavy model to do the hard thinking once. Then distill that thinking into something you can actually deploy.
The LLM is good at labeling. It’s bad at production. The traditional model is not as smart, but it learned from the LLM’s labels, and it runs 73,000× faster.
The right question for many real-world NLP problems isn’t “which model is best?” — it’s “which model can I actually afford to run at the scale and speed I need?”
For interest tagging on a social platform, a file that fits on a USB drive and runs in 0.033ms answered that question better than a multi-gigabyte language model waiting on GPU memory.
What’s Next
- A manually-reviewed gold set. Right now the “ground truth” is what Gemma 4 said. A human-reviewed subset would tell us how much error lives in the teacher versus the student.
- A hybrid system. Use the traditional model by default. For low-confidence predictions, fall back to the LLM. Best of both worlds.
- Image features. Many posts have a photo. A simple image captioning step could add signal for posts where the text is too short to classify on its own.
Quick Stats
- Raw posts in corpus: 41,094
- LLM-labeled training posts: 1,670
- Interest categories: 20
- Average labels per post: 3.36
- Best model: TF-IDF word+char + Logistic Regression
- Best F1: 0.6412 micro / 0.6342 macro
- LLM inference time: ~2,400 ms/post
- Traditional model inference time: ~0.033 ms/post
- Speed advantage: 73,000×
And finally, Thanks to Ali Alpaslan and Engin Tutlu at AYT Technologies for their support on this project.
We Let a Resource-Hungry LLM Do the Hard Work — Then Fired It
