Data Modeling an Alcohol Product Catalog

By Jay Sobel · April 22, 2026

This is the data model I reach for when I think about alcohol products, plus the caveats that make the model annoying in real life.

What is an alcohol product?

For this post, an alcohol product is an item you can find on a retailer's shelf. Alcohol is a useful niche because most products share the same broad shape, even though the details get messy quickly.

The entity model

An alcohol product isn't a single entity. It's a composite of several related concepts, each adding specificity until you arrive at the thing someone can buy.

Company
  └─ Brand
       └─ Liquid ─────── category, ABV, region, producer
            └─ Vintage ── year (nullable)
                 └─ Container ── volume, type, material
                      ├─ Barcode(s)
                      ├─ Label image(s)
                      └─ Packaging ── count, format
                           └─ Product image(s)
                           = "a product"

Liquid

Bud Light is Bud Light. Many of the most interesting consumer-oriented dimensions of a product live at this grain. The liquid has a category, a producer, a region of origin, an ABV, and so on. We could say it has a density, but not a volume. It's the idea of the liquid.

Category is where the first mess shows up. The TTB's own class/type taxonomy doesn't map to how consumers or retailers think about products. Is a hard seltzer a beer, a malt beverage, or a flavored malt beverage? The answer depends on who you ask — the TTB, a retailer's planogram, or a shopper. Any catalog needs to decide whether to adopt the regulatory taxonomy, invent its own, or maintain a mapping between both.

Vintage

A liquid has many vintages. Wine liquids (and some spirits and beers) are further distinguished by the year they were produced.

A liquid can come in many vintages. It can also come in NULL vintages — Bud Light is just Bud Light. Or even a mix of specified and unspecified.

Vintage is essentially a thin junction between a liquid and a year. There's not much to say about "2021" that applies to every liquid with that vintage. Some platforms model this out into regional seasonal dimensions so they can say that year X in region Y had certain characteristics. You could do the same for specific producers, but that's beyond the scope of the "product catalog" described here.

Container

A liquid-vintage has many containers. Bud Light comes in a 12 oz can and a 16 oz can and a 12 oz bottle and even a 12 oz bottle with a regional sports team themed label.

The key dimensions of a container are:

Volume
Volume units
Type (can, bottle, keg)
Material (glass, plastic, aluminum)

You could keep going, but there are diminishing returns to delineating different types of plastic bottles, no matter who the catalog's intended audience is. It may make sense to maintain containers as a separate table unlike vintages. There are several dimensions to track that would be ugly to denormalize.

And excitingly for COLA Cloud: a container has label image(s).

Packaging

A "six pack" is a type of packaging that contains six containers. A single bottle of wine is a 1-pack. And a four-pack can come in a box, or joined using can rings.

The key dimensions of packaging are:

Number of containers
Type (box, can rings, carrier)
Product images

With packaging, we've finally reached an item you find on a retailer's shelf: "a product."

The variety pack problem. Can a packaging contain multiple liquid-vintage-containers? This increases complexity considerably, but the only alternative is to call "Variety Pack Seltzer" a liquid, which seems similarly incorrect. In practice, treating the variety pack as its own liquid — with its own brand association, category, and label — is a reasonable compromise. You lose the relationship between the pack and its constituent flavors, but you avoid a many-to-many join that most access patterns don't need.

Pragmatism over completeness

If the entity model above made you nervous about scope, good. Data modeling encourages completism: enumerate every possible feature of a product. Production systems punish that instinct. Information is never fully available, and not every feature creates value for your business.

The worth of a data model is what you can do with it — not how many theoretical futures it supports. Plenty of successful alcohol marketplaces have shipped without properly modeling vintages. Customers notice sometimes, but it usually is not the thing that decides whether the business works.

If you're trying to sell products online, you might get better results putting all your effort into photos, and very little into ABVs and vintages. Mapping the whole space is a waste of time. Unknowns need to be prioritized.

The calculus has changed, though. In 2022, increasing detail coverage in your catalog usually meant paying a person to run Google searches and type into spreadsheets. A lot of that work can now be automated with AI for dramatically less money, and often with better consistency.

If the answer exists digitally, then accessing it is really just a dollar cost.

Barcodes

Barcodes are where the catalog stops being a tidy internal model and starts touching other people's systems.

Ideally, one or more barcodes relate to a combination of liquid-vintage-container-packaging, and uniquely identify that thing. That would be a nice world to live in. The key dimensions of a barcode are its type (UPC is one of several) and value.

QR codes can sit under the same umbrella. They can appear on a container's label or packaging, and typically resolve to a URL. QR codes can also contain a data payload, like a name and phone number.

Unfortunately, UPC barcodes are a privately run system — a company called GS1 allocates numbers for a fee. The biggest enforcers of sanitary UPC practices are major retailers like Costco, who won't stock products that don't follow sane UPC conventions because it would be too much of an operational headache to sort them out.

Here are several UPC bad practices, in descending order of infamy:

Seasonal beers sharing a UPC year-round. Different liquids, same UPC.
Vintages sharing the same UPC across years. Different vintage, same UPC.
Container UPC matching the packaging UPC. Cans joined with can rings where each individual can has the same barcode as the pack.
Truncation and/or padding of leading and trailing zeros in point-of-sale systems.
Truncation of leading zeros by Excel auto-detection. An absolute classic.
Self-assigned barcodes. Producers who make up their own UPC rather than registering with GS1.
International format differences. Europe uses EAN, which is different from — and usually cleaner than — UPC.

Product photos

In ecommerce, photos are the most valuable single piece of information on a page. They do a job ABV and description text cannot: prove to the shopper that this exact thing exists.

As mentioned in the packaging section, a packaging has photos. A photo might depict solely a container, but that can be considered the NULL packaging. And a container also has label images, which can serve as an image of last resort.

They feel a little like barcodes. The practical access pattern often needs fallbacks: from the specific packaging photo, to any container photo, to a label image, to anything associated with the liquid — paired with a notice that the depiction may be inaccurate. Photos are valuable enough to deserve that machinery.

Brand and company

This is the least fleshed-out area of this post. You could model brands belonging to companies belonging to conglomerates, and then deal with brands changing hands, which happens constantly in this industry.

The minimal implementation would be to cover brands, and relate liquids to brands. A nice simple relationship. Several dimensions of a liquid could be relocated to the brand. A brand has a geographic relationship that generally supersedes that of a liquid. A brand also has a year when it was established, and a collection of products that a consumer might be interested in. Note that brands do change hands — an acquisition can rename or reorganize a brand, which means the liquid-to-brand relationship isn't as permanent as it first appears.

If you're building a B2B service that serves supplier companies, then a company model becomes very important — perhaps even more so than liquids. Useful dimensions of a company might include their number of brands, number of products, age, and ownership structure.

Identity and deduplication

This is probably the hardest problem in building an alcohol catalog, and it cuts across every entity described above. There is no universal product identifier for alcohol.

UPCs come close, but as the barcodes section illustrates, they're unreliable. A liquid has no natural key at all — "Bud Light" is identifiable by name, but a Côtes du Rhône from a small producer might appear in your data under several spellings, with or without accents, sometimes under the producer's name and sometimes under the importer's.

In practice, identity resolution is a matching problem. You define a set of candidate keys (UPC, brand + product name + volume, permit number + serial number) and build a pipeline that proposes matches, scores confidence, and surfaces ambiguity for review. The catalog's deduplication strategy will shape more downstream decisions than any individual entity's schema.

How COLAs fit in

If "a product" is a combination of entity relationships, then where does a COLA fit in?

The data model described above is just scaffolding. Once the model exists, you still need rows. This is where COLAs come in.

The registry

The COLA Registry contains over 2.9 million label approvals for alcohol products sold in the United States. It grows by approximately 3,000 approvals each week. Every COLA contains a product label, which fits into this model at the granularity of a container:

┌──────────────────── COLA Record ────────────────────┐
│                                                       │
│  Brand name, permit info   ────────→  Brand          │
│  Class/type, product name  ────────→  Liquid         │
│  Vintage (if present)      ────────→  Vintage        │
│  Container size            ────────→  Container      │
│  Label images              ────────→  Images         │
│  Barcodes (via OCR)        ────────→  Barcodes       │
│                                                       │
│  Note: Packaging is NOT in a COLA record             │
└───────────────────────────────────────────────────────┘

The label contains information about the liquid, vintage, container, brand, and company. Labels often contain UPC and QR barcodes — incidental, not required by the TTB, but they provide a high degree of confidence for catalog integration.

Catalog integration

The most basic integration is a UPC join: match COLA records against existing catalog entries on barcode value. This gets you label images, structured product details, and a regulatory paper trail for every match. Match rates vary — expect higher coverage for wine and spirits than for beer, where UPC hygiene is worse and in-state-only products are more common.

A more ambitious integration would use COLA data to insert new records into a catalog. Newly approved COLAs represent products that likely haven't reached the market yet. For wine and spirits, packaging is almost always a single bottle, so a COLA maps nearly 1:1 to a product. For beer, you'll need to account for the fact that a single liquid might appear across many container and packaging combinations, only one of which is represented by the COLA.

COLA caveats

When comparing COLAs to products, keep a few caveats in view.

Not every product requires a COLA. Regulatory lawyers spend whole careers on these details, so take this list as broad data guidance, not legal advice.

Non-material label changes. Changes to graphics, colors, vintage years, or UPCs do not require a new COLA.
Minor container size differences. Slightly different container sizes do not need separate COLAs.
In-state-only products. Products sold only within one US state. The TTB is a federal agency, after all.

Imports create duplicates. Imported products are submitted to the TTB by their US-based importers. Duplicate COLAs can be created when different importers submit applications for the same imported product's label. Imported products make up a large portion of the dataset (and the market itself), and they need to be considered with extra care.

Mail-in submissions still exist. Around 0.3% of approvals will display as large scans of physical documents, rather than structured electronic submissions. In these scans, the "label images" are often physically attached to the sheet of paper. COLA Cloud's AI enrichment pipeline processes these to extract structured data, but they're inherently noisier than their electronic counterparts.

Closing thoughts

Alcohol products are deceptively complex to model. The entity hierarchy from liquid down to packaging is intuitive enough. The real problems — taxonomy, identity, barcode hygiene, brand ownership — only become obvious once real data hits the schema. Start with the simplest model that serves your access patterns, and let the data tell you where the model needs more structure.