Back to Blog April 2026

Data Modeling an Alcohol Product Catalog

Here I'll outline a data model for alcohol products, and discuss every nuance I know in the space.

What is an alcohol product?

I'll define an alcohol product as an item you can find on a retailer's shelf. Alcohol products are a nice little niche because they share many characteristics, and their differences are not so vast that they can't be lumped together.

The entity model

An alcohol product isn't a single entity. It's a composite of several related concepts, each adding one dimension of specificity until you arrive at the thing on the shelf.

Company
  └─ Brand
       └─ Liquid ─────── category, ABV, region, producer
            └─ Vintage ── year (nullable)
                 └─ Container ── volume, type, material
                      ├─ Barcode(s)
                      ├─ Label image(s)
                      └─ Packaging ── count, format
                           └─ Product image(s)
                           = "a product"

Liquid

Bud Light is Bud Light. Many of the most interesting consumer-oriented dimensions of a product live at this grain. The liquid has a category, a producer, a region of origin, an ABV, and so on. We could say it has a density, but not a volume. It's the idea of the liquid.

Category deserves a special mention because alcohol categorization is genuinely messy. The TTB's own class/type taxonomy doesn't map to how consumers or retailers think about products. Is a hard seltzer a beer, a malt beverage, or a flavored malt beverage? The answer depends on who you ask — the TTB, a retailer's planogram, or a shopper. Any catalog needs to decide whether to adopt the regulatory taxonomy, invent its own, or maintain a mapping between both.

Vintage

A liquid has many vintages. Wine liquids (and some spirits and beers) are further distinguished by the year they were produced.

A liquid can come in many vintages. It can also come in NULL vintages — Bud Light is just Bud Light. Or even a mix of specified and unspecified.

Vintage is essentially a thin junction between a liquid and a year. There's not much to say about 2021 the year that applies to all associated liquids. Some platforms model this out into regional seasonal dimensions so they can say that year X in region Y had certain characteristics. You could do the same for specific producers, but that's beyond the scope of the "product catalog" described here.

Container

A liquid-vintage has many containers. Bud Light comes in a 12 oz can and a 16 oz can and a 12 oz bottle and even a 12 oz bottle with a regional sports team themed label.

The key dimensions of a container are:

  • Volume
  • Volume units
  • Type (can, bottle, keg)
  • Material (glass, plastic, aluminum)

You could go farther, but there are diminishing returns to delineating different types of plastic bottles, no matter who the catalog's intended audience is. It may make sense to maintain containers as a separate table unlike vintages. There are several dimensions to track that would be ugly to denormalize.

And excitingly for COLA Cloud: a container has label image(s).

Packaging

A "six pack" is a type of packaging that contains six containers. A single bottle of wine is a 1-pack. And a four-pack can come in a box, or joined using can rings.

The key dimensions of packaging are:

  • Number of containers
  • Type (box, can rings, carrier)
  • Product images

With packaging, we've finally reached an item you find on a retailer's shelf: "a product."

The variety pack problem. Can a packaging contain multiple liquid-vintage-containers? This increases complexity considerably, but the only alternative is to call "Variety Pack Seltzer" a liquid, which seems similarly incorrect. In practice, treating the variety pack as its own liquid — with its own brand association, category, and label — is a reasonable compromise. You lose the relationship between the pack and its constituent flavors, but you avoid a many-to-many join that most access patterns don't need.

Pragmatism over completeness

If the entity model above made you nervous about scope, good. Data modeling encourages a completeness of thought: enumerate every possible feature of a product. But in the real world, information is never fully available, and not every feature creates value for your business.

Ultimately the worth of a data model is what you actually do with it — not the infinite reconfigurability. Plenty of successful alcohol marketplaces have shipped without properly modeling vintages. Their customers noticed occasionally, but it wasn't a dealbreaker.

If you're trying to sell products online, you might get better results putting all your effort into photos, and very little into ABVs and vintages. Mapping the whole space is a waste of time. Unknowns need to be prioritized.

I will add, however, that we increasingly live in a landscape of cheap, automatic digital omniscience. In 2022 if you wanted to increase the coverage of details in your catalog, you likely had to pay a real person to spend time running Google searches and typing into spreadsheets. This can now be automated with AI for several orders of magnitude less cost, and likely with more accurate results.

If the answer exists digitally, then accessing it is really just a dollar cost.

Barcodes

This is the first real foray from "a catalog of products" toward a component of a larger system that needs to integrate with other systems.

Ideally, one or more barcodes relate to a combination of liquid-vintage-container-packaging, and uniquely identify them. That would be a great world to live in. The key dimensions of a barcode are its type (UPC is one of several) and value.

QR codes could arguably fall under this same umbrella. They can appear on a container's label or packaging, and typically resolve to a URL. QR codes can also contain a data payload, like a name and phone number.

Unfortunately, UPC barcodes are a privately organized system — a company called GS1 allocates numbers for a fee. The biggest enforcers of sanitary UPC practices are major retailers like Costco, who won't stock products that don't follow sane UPC conventions because it would be too much of an operational headache to sort them out.

Here are several UPC bad practices, in descending order of infamy:

  1. Seasonal beers sharing a UPC year-round. Different liquids, same UPC.
  2. Vintages sharing the same UPC across years. Different vintage, same UPC.
  3. Container UPC matching the packaging UPC. Cans joined with can rings where each individual can has the same barcode as the pack.
  4. Truncation and/or padding of leading and trailing zeros in point-of-sale systems.
  5. Truncation of leading zeros by Excel auto-detection. An absolute classic.
  6. Self-assigned barcodes. Producers who make up their own UPC rather than registering with GS1.
  7. International format differences. Europe uses EAN, which is different from — and arguably superior to — UPC.

Product photos

In ecommerce, photos are the most valuable single piece of information on a page. They communicate to the user that this product really does exist in a way that ABV and a description fail to.

As mentioned in the packaging section, a packaging has photos. A photo might depict solely a container, but that can be considered the NULL packaging. And a container also has label images, which can serve as an image of last resort.

They feel a little like barcodes. Ultimately the access pattern for images may involve fallbacks: from the specific packaging photo, to any container photo, to a label image, to anything associated with the liquid — paired with a notice that the depiction may be inaccurate. Photos are valuable enough to warrant such treatment.

Brand and company

This is the least fleshed-out area of this post. This could be structured in terms of brands belonging to companies belonging to conglomerates, and cope with brands changing hands as they often do in this industry.

The minimal implementation would be to cover brands, and relate liquids to brands. A nice simple relationship. Several dimensions of a liquid could be relocated to the brand. A brand has a geographic relationship that generally supersedes that of a liquid. A brand also has a year when it was established, and a collection of products that a consumer might be interested in. Note that brands do change hands — an acquisition can rename or reorganize a brand, which means the liquid-to-brand relationship isn't as permanent as it first appears.

If you're building a B2B service that serves supplier companies, then a company model becomes very important — perhaps even more so than liquids. Useful dimensions of a company might include their number of brands, number of products, age, and ownership structure.

Identity and deduplication

This is arguably the hardest problem in building an alcohol catalog, and it cuts across every entity described above. There is no universal product identifier for alcohol.

UPCs come close, but as the barcodes section illustrates, they're unreliable. A liquid has no natural key at all — "Bud Light" is identifiable by name, but a Côtes du Rhône from a small producer might appear in your data under several spellings, with or without accents, sometimes under the producer's name and sometimes under the importer's.

In practice, identity resolution becomes a matching problem. You define a set of candidate keys (UPC, brand + product name + volume, permit number + serial number) and build a pipeline that proposes matches, scores confidence, and surfaces ambiguity for review. The catalog's deduplication strategy will shape more downstream decisions than any individual entity's schema.

How COLAs fit in

If "a product" is a combination of entity relationships, then where does a COLA fit in?

The data model described above is the scaffolding of a product catalog. Once the model is defined, the actual data needs to be filled in, and this is where COLAs come into play.

The registry

The COLA Registry contains over 2.9 million label approvals for alcohol products sold in the United States. It grows by approximately 3,000 approvals each week. Every COLA contains a product label, which fits into this model at the granularity of a container:

┌──────────────────── COLA Record ────────────────────┐
│                                                       │
│  Brand name, permit info   ────────→  Brand          │
│  Class/type, product name  ────────→  Liquid         │
│  Vintage (if present)      ────────→  Vintage        │
│  Container size            ────────→  Container      │
│  Label images              ────────→  Images         │
│  Barcodes (via OCR)        ────────→  Barcodes       │
│                                                       │
│  Note: Packaging is NOT in a COLA record             │
└───────────────────────────────────────────────────────┘

The label contains information about the liquid, vintage, container, brand, and company. Labels often contain UPC and QR barcodes — incidental, not required by the TTB, but they provide a high degree of confidence for catalog integration.

Catalog integration

The most basic integration is a UPC join: match COLA records against existing catalog entries on barcode value. This gets you label images, structured product details, and a regulatory paper trail for every match. Match rates vary — expect higher coverage for wine and spirits than for beer, where UPC hygiene is worse and in-state-only products are more common.

A more ambitious integration would use COLA data to insert new records into a catalog. Newly approved COLAs represent products that likely haven't reached the market yet. For wine and spirits, packaging is almost always a single bottle, so a COLA maps nearly 1:1 to a product. For beer, you'll need to account for the fact that a single liquid might appear across many container and packaging combinations, only one of which is represented by the COLA.

COLA caveats

There are a couple caveats to be aware of when likening COLAs to products.

Not every product requires a COLA. There are law practices that specialize in these details, so take my list as the biggest generalities. This is data advice, not legal advice!

  1. Non-material label changes. Changes to graphics, colors, vintage years, or UPCs do not require a new COLA.
  2. Minor container size differences. Slightly different container sizes do not need separate COLAs.
  3. In-state-only products. Products sold only within one US state. The TTB is a federal agency, after all.

Imports create duplicates. Imported products are submitted to the TTB by their US-based importers. Duplicate COLAs can be created when different importers submit applications for the same imported product's label. Imported products make up a large portion of the dataset (and the market itself), and they need to be considered with extra care.

Mail-in submissions still exist. Around 0.3% of approvals will display as large scans of physical documents, rather than structured electronic submissions. In these scans, the "label images" are often physically attached to the sheet of paper. COLA Cloud's AI enrichment pipeline processes these to extract structured data, but they're inherently noisier than their electronic counterparts.

Closing thoughts

Alcohol products are deceptively complex to model. The entity hierarchy from liquid down to packaging is intuitive enough, but the real challenges — taxonomy, identity, barcode hygiene, brand ownership — are the kind that only surface once real data hits the schema. Start with the simplest model that serves your access patterns, and let the edge cases tell you where to invest next.