Semantics don't stop at the lake

Part 1 ended on a claim that is easy to nod at and hard to build. The data does not need to be in one place. The semantics do.

This is where the claim meets the engineering, and the hard part is not the rules. Lifting “active customer” out of a warehouse query is a morning’s work. The hard part is everything the rule depends on, and most of that sits in the lake you are reaching past.

A semantic layer is usually a description of tables

In most organizations the semantic layer is a set of definitions inside the warehouse. Models, a metrics layer, a stack of views. “Active customer” is a WHERE clause, “qualified pipeline” is a join, and it all works because everything it governs sits still where it is defined. The definition and the data share an address.

An event on the integration plane shares neither. When the service platform emits “ticket escalated,” nothing in that payload knows what your warehouse decided an account is. The definition lives in the lake while the data is already moving past it, which is why a table-bound semantic layer cannot govern data it cannot see. So do not copy the definitions outward, because copies drift. Make them callable instead, and then reckon with what it takes to feed them.

A definition is only as good as the data it runs on

Make “active customer” callable and you have lifted the rule out of the query. The rule still needs inputs the event does not carry: ninety days of purchase history, a churn flag, a contract status. All of it lives in the lake, and a callable definition does not help if the data it reads is hours behind.

This is the problem feature stores were built for: serve the same definition at rest and in flight, from two stores fed by the same logic. The offline store computes from the lake, and the online store serves the same features at the edge. One definition, two stores.

That world also learned something this series should not pretend away. The stores drift. The same feature reads one value online and another offline, and the name for it is training-serving skew. You do not eliminate it. You measure it, bound it, and decide how much each decision can tolerate.

A definition that answers synchronously in the event path also makes its uptime your integration plane’s uptime. When it is down, does the event wait, drop, or pass through ungoverned? Decide that before you ship.

Resolve identity in flight, and expect to revise it

This is the piece I most often see underestimated. Take a real signal: a customer escalates for the third time this week. On the integration plane that is three events, each carrying whatever identifier the service platform holds. A contact. An email. An account string.

In the lake, identity resolves at rest. Overnight you match records into a golden customer using everything: the full population, transitive links, late records that redraw the cluster. By morning the three escalations resolve to one account, or to three, and you know which.

The edge does not have until morning. It decides now, on one event, whether this is the customer that escalated twice already. Sometimes it is right. Sometimes it cannot be, because the third ticket came from a contact at a subsidiary the lake already merged, and only the lake knows that. So the honest version is not “same logic, same answer.” The edge resolves with its best information now, and the lake may revise it later, which means the edge’s answer is provisional and the lake’s is the record. Build for the revision.

When the two planes disagree

They will. Stay with the escalation. The edge, reading three tickets as one account, fires a save play that alerts the account team and pauses the renewal email. Overnight the lake reconciles and finds three different customers. The signal that triggered the action was never real. That is the dead-account list from “What an Agent Is,” in real time instead of in a demo.

So name the rule before you need it. The lake is the system of record, the edge a provisional actor. The bridge does not promise one truth at one instant. It promises the same definition in two places, a bounded window in which they can differ, and a way to reconcile when they do. Eventual consistency is the model whether you plan for it or not. Planning for it means the system notices when an edge decision and the later record disagree, and acts: reverse, flag, or escalate to a human.

If your honest answer to “what happens when they disagree” is “they will not,” you have not built the bridge. You have hidden the seam and called it solved.

Version the meaning, not just the schema

Point-to-point integrations rot two ways. The structural way is easy to catch: a field renamed, a type changed. A schema registry with compatibility checks handles that, and you should have one.

The semantic way is the one that hurts. The schema does not change. The field still says “closed,” but the team upstream has changed what “closed” means, and now three systems downstream are wrong with nothing failing a check, because the shape of the data never moved. A schema registry will not catch this, and it is not built to.

So the versioned artifact cannot only be the schema. It has to be the definition: assertions about meaning, tests that fail when “closed” starts behaving differently, an owner who signs off when the meaning changes. This is the discipline that made the lake trustworthy in the first place, with definitions owned and changes deliberate. A governed definition has somewhere to keep its meaning and something to version. A scattered one does not.

Put the decision where its signal lives, and where you can trust it

Perishability tells you where the signal arrives first. Some signals decay in minutes: an escalation, a deal going quiet, a delivery that just missed its window. The integration plane sees those before the lake does. That is a reason to act there.

It is not reason enough. The edge is the freshest data you have and the thinnest. Bad data produces confident, wrong action at rest, with the full picture in hand. At the edge, with less context and a provisional identity, that risk goes up, not down. So the second test is confidence. When a perishable signal reads high-confidence, act at the edge. When it reads low-confidence, tell a human and do not act. That is earn-trust-first, applied to where the decision runs.

And first, ask the cheaper question. Could you just make the lake fresh? If the signal can land in seconds and the decision tolerates that, stream it in and keep one governed home. One place you trust beats two you keep in sync. Build the bridge only when the signal lives app-to-app and never cleanly reaches the lake, or when the action must fire before any lake could catch up. Do not build it to paper over a batch pipeline you could have made faster.

The test

So the test is not the one I would have written first. It is not “identical answers at every instant.” It is this: the agent asks whether something is a customer, an account at risk, a deal still real, and the answer comes from the same definition whether it reads a row at rest or an event in flight. When the planes disagree, and they will, the system knows which one is the record and how to reconcile.

The lake was never the problem, and neither was the boundary. The problem was letting the semantics stop at it.

They travel as a service, not a schema: a definition you can call, an identity you can resolve and revise, a meaning you can version. Build those once. After that it does not much matter where the agent is standing.

ai
agentic-ai
data-strategy
integration
semantic-layer

All insights