The TTB COLA Registry has millions of label images. The problem? All that visual information is locked up in JPEGs. The TTB doesn't extract text from labels or scan barcodes. They just store the images.
We run every image through computer vision to pull out the useful stuff. Here's how it works.
What we extract with Google Vision AI
Every label image goes through Google's Vision AI. We get three main things out of it:
- All the text on the label (OCR), so you can search for words that appear on products
- Barcodes, both UPC and EAN, with their exact position on the image
- Image metadata like dimensions and quality scores
So far we've processed 4.6 million images and pulled out over 470,000 barcodes. Those barcodes are gold for matching products across systems.
Why TTB categories aren't enough
The TTB gives you three options: wine, malt beverage, or distilled spirits. That's it. Fine for regulatory purposes. Not so useful if you're trying to analyze the bourbon market or track hard seltzer trends.
We use AI to read the label text and assign more granular categories. A bourbon might get tagged as:
Liquor > Whiskey > Bourbon > Single Barrel
Now you can filter at whatever level makes sense for your analysis. Want all whiskey? Easy. Just single barrel bourbons? Also easy.
Other attributes we pull from labels
Categories are just the start. We also extract:
- ABV (alcohol percentage)
- Container size, normalized to standard units
- Tasting notes, when they're on the label
- Container type (bottle, can, keg, box)
- Age statements for spirits
Not every label has all of these. Some labels are minimal; others have paragraphs of marketing copy. We extract what's there.
How accurate is it?
Look, AI isn't perfect. OCR gets confused by fancy fonts. Categorization models sometimes guess wrong on edge cases. A "whiskey-style" non-alcoholic product might get tagged as whiskey.
That's why we're transparent about provenance. AI-derived fields are clearly marked in exports and the interface. You can always pull up the original label image and check for yourself. The images are right there next to the data.
For most use cases, the AI extraction is accurate enough to be useful. For business-critical decisions, verify against the source.
The pipeline runs daily
New COLAs get approved every day. Our enrichment pipeline picks them up within 24 hours. You get fresh data with all the AI fields populated, not just the raw TTB fields.
We're also adding new extraction capabilities based on what customers ask for. If you have a use case that needs something we don't currently extract, let us know. I'm always interested in hearing what would make this data more useful.