When the dataset is small and the failure modes are describable in words, a vision LLM with a tight prompt beats a fine-tune. With a fraction of the eval overhead. A note on when to reach for which.
Pallet pooling depots receive thousands of returned pallets a day. A small fraction have damaged blocks. The four wooden cubes that bear the load. And have to be pulled before they go back into circulation. Damaged blocks aren't visually subtle: cracks, missing chunks, deep splinters. A human can call it in a second. The question is how to do it at depot throughput without hiring twenty more humans.
The instinct is to train a small CNN: take 50,000 images, label each block as OK/damaged, train a model that runs on cheap hardware. This works. It's also a six-month project with a labeling vendor, an MLOps pipeline, and a retraining cadence the depot doesn't want to own.
Skip the training. Feed an image of each block face to a vision LLM. Ask, in the prompt, what counts as damage: visible crack longer than X cm, missing chunk larger than Y, etc. Provide three or four few-shot examples. Labeled images of borderline cases. Directly in the prompt. Get back a structured JSON: { status: "ok" | "damaged", confidence: 0-1, reason: string }.
Two things matter for this to work cheaply at depot throughput. First, the prompt itself is the model. Every depot's damage criteria is slightly different, and the prompt can encode that without retraining. Second, the structured output makes the eval loop trivial: run the model on a held-out set of human-labeled images, count agreements, iterate the prompt where it disagrees.
Sub-100ms latency requirements (vision LLMs are slower than a CNN running on-device). Criteria that can't be verbalized (subtle defects only a trained eye catches). Massive throughput where the per-image cost stacks up faster than a one-time training investment would.