IoT & OT Security

Why Data Matters with AI

AI security is only as good as its data. ORDR's CTO explains why fragmented, stale asset data is the hidden reason most AI security tools quietly fail—and what good data actually requires.

THE SHORT VERSION
Every conversation about AI in security right now focuses on the reasoning layer, the agents, the copilots, the conversational interfaces. Almost no one is talking about the foundation those reasoning layers are supposed to sit on. That foundation is the data layer, and it is where most AI security products quietly fail. You cannot throw shallow, duplicated, fragmented, stale data at an AI and expect intelligent answers. You get confident wrong answers, faster. The work of building a clean, accurate, comprehensive data layer is unglamorous, and it must be earned in the field in the world’s best hospitals, the largest manufacturers, the largest airlines, and the largest retail banks. That is the work ORDR has done, and the result is what makes everything above the data layer actually work.

The Premise No One Wants to Talk About

Almost every AI security product launched in the last 18 months has been built on the same implicit assumption: that the underlying data about the customer’s environment is good enough for the AI to reason over. It usually isn’t. And when it isn’t, no amount of model sophistication makes up for it.

This is not a controversial claim if you have ever tried to operationalize asset data in a large enterprise. The data is messy. The same device shows up three different ways across three different tools. The names disagree. The attributes overlap. Devices appear and disappear from inventory without explanation. The CMDB says one thing; the vulnerability scanner says another, the EDR says a third, and none of them are wrong; they are each seeing a different slice of the same environment through their own narrow lens.

This is the foundation most AI security tools are now being built on. A reasoning engine sitting on top of fragmented, duplicated, partially stale data. The model will dutifully produce an answer. The answer will look authoritative. And it will be wrong in ways that are hard to detect until something breaks.

The industry needs to take this seriously, because the gap between “we have AI” and “the AI is actually trustworthy” is almost entirely a data problem. And the data problem cannot be solved by throwing more data at the model.

What “Good Data” Actually Means

Before we can talk about how it’s built, we need to be precise about what good data is in this context. It is not just “we have an inventory.” An inventory tells you that something exists. Good data tells you what it is, how it behaves, what it can reach, who owns it, and how it relates to everything else, with enough fidelity that an AI can reason over it without producing nonsense. Concretely, that means five things:

  • Identity at the make, model, and firmware level. “Medical device” is not identity. “Siemens SOMATOM CT scanner running Windows 10 IoT, firmware 8.4.2” is identity. The difference is the difference between a generic firewall policy that breaks clinical workflow and a tailored one that doesn’t.
  • Behavioral truth. What does this specific device actually talk to, on which protocols, at what cadence, for what purpose? Not what the vendor's brochure says it should do, but what it is observed doing in production. Without behavioral truth, segmentation is guesswork.
  • Topology and reachability. Where does this device sit in the network graph? Who can reach it from inside? Who can reach it from outside? A vulnerability on an unreachable device is a different problem from the same vulnerability on a reachable one. Without reachability, prioritization is theater.
  • Continuity of identity over time. A device that disconnects, gets a new IP, swaps an adapter, randomizes its MAC, or moves from wired to wireless is still the same device. If the data layer can’t track that, every reconnection produces a duplicate; the inventory inflates, and the AI is reasoning over phantoms.
  • Organizational and operational context. Which department owns this device? Which clinical or business function does it serve? Who is on the hook for patching it? Without this, even a perfect technical recommendation has nowhere to land.

Most security tools claim to deliver some of this. Almost none deliver all five at scale, on heterogeneous installed-base environments, with the accuracy required for AI to reason over the data without producing confident wrong answers. The reason is that getting all five right is genuinely hard, and the work to get there is the kind of unglamorous engineering investment that does not produce a flashy demo on its own.

Why Real Environments Produce Bad Data

To understand what good data takes, you must understand what real environments do to data. They are not stable. They are not orderly. They generate duplicates, gaps, contradictions, and noise as a matter of normal operation. A short tour of the mess:

Devices come and go. Virtual machines, containers, smartphones, contractor laptops, IoT gadgets, guest Wi-Fi devices — modern environments are full of ephemeral assets that appear for hours or days and disappear. Scanners with cyclical schedules miss them. Agent-based tools cannot be installed on most of them. By the time the scan finishes, the device is gone, or it has changed IP, or it has been replaced by something else.

The same device shows up many ways. A Windows laptop that switches between wired and wireless connections produces two inventory entries. A medical device with multiple physical interfaces produces three. An Apple device with MAC randomization produces a new identity every time it joins the network. A remote worker’s laptop on a different IP every morning creates a new record every day. A network switch stack managed as individual switches duplicates itself in monitoring tools. None of these are bugs. They are the normal behavior of normal devices in normal environments, and every one of them inflates the inventory with duplicates of the same physical thing.

Different tools see different things and call them different names. EDR sees a laptop one way. MDM sees the same laptop another way. AD has a third record. The vulnerability scanner has a fourth. The CMDB has a fifth, entered manually two years ago and never updated. When these are joined together, the same device appears five times — sometimes with conflicting attributes, sometimes with missing ones, sometimes encoded in different formats. The OS field alone can be “OS,” “OS Version,” “OS Running,” or “OS Type,” with different case sensitivity and different string conventions across tools.

Data migration creates ghost records. When an organization moves from one asset system to another, or merges two systems after an acquisition, the residue is always duplicates and orphans — records that point to devices that no longer exist, devices that exist twice, devices that exist but are not linked to their organizational context.

Cloud and VDI environments add their own churn. VMware, Nutanix, AWS, Citrix, and similar platforms create and destroy machine instances continuously. Short-lived instances accumulate in inventory if not aggressively purged. Long-lived instances drift in configuration without updates if not actively re-observed.

This is not a corner case. This is the steady-state behavior of every large enterprise environment we have ever seen. Bad data is not the result of bad practice — it is the natural state of an asset inventory left alone. Producing good data requires continuous, active work against an environment that constantly degrades the data quality. That work is what most security platforms are skipping when they bolt AI on top.

You Cannot Solve This by Throwing Data at AI

There is a comforting idea floating around the industry: that AI is now powerful enough to make sense of messy data on its own. Just feed it the inventory, the scanner output, the EDR data, the CMDB dumps, the cloud metadata, and let the model figure it out.

This is wrong, and the reason it is wrong matters.

When you give an AI a fragmented, duplicated, contradictory data set, the AI doesn’t fix the data — it averages over it. It produces an answer that looks plausible, with a tone of confidence, and the answer is sometimes right and sometimes spectacularly wrong, and the user has no way to tell which is which. The hallucination is not a model bug. It is the predictable behavior of any reasoning system asked to reason over inputs that are themselves contradictory.

In a chat assistant, that’s a minor problem. The user asks a follow-up question and clarifies. In security, it’s a serious problem. The AI confidently tells you to segment a device that no longer exists. It recommends isolating a system that is actually three duplicates of the same machine. It generates a vulnerability prioritization based on a reachability assumption derived from a stale topology snapshot. Each of these answers looks correct on the surface, and each of them produces a different kind of operational damage when acted on.

The harder truth is this: the more capable the AI, the more dangerous bad data underneath becomes. A weak AI on bad data produces obvious nonsense that people ignore. A strong AI on bad data produces confident nonsense that people act on. The reasoning layer doesn’t make the data problem smaller. It makes the data problem more consequential.

Good data is what makes AI work. There is no shortcut around it. And good data, at the scale and complexity of a modern enterprise, is itself an AI problem — but it is an AI problem that requires deep domain expertise to solve, not just generic models trained on text.

How ORDR Builds the Data Layer

This is the work ORDR has done from the beginning. Before AI in security was a marketing term, before reasoning layers became the default product pitch, we were building the layer underneath — the unified device, behavior, and topology graph that proactive AI now sits on. We built it where the consequences of getting it wrong are highest: in the world’s best hospitals, the largest manufacturers, the largest global airlines, and the largest retail banks. Environments with the strictest regulatory constraints, the highest device diversity, and the lowest tolerance for operational disruption. Five things make our data layer different from what most platforms in the market deliver.

1. True device identity, established by AI over a seven-layer hierarchy

Every device in the ORDR platform is resolved to a true identity — a globally unique fingerprint that persists across IP changes, MAC randomization, adapter swaps, and network reconnections. The resolution is done by a predictive AI model trained on ORDR’s asset knowledge base, which has accumulated profiles on tens of millions of devices observed in production across the world’s best hospitals, the largest global airlines, the largest manufacturers, the largest retail banks, and critical infrastructure operators across the US, Europe, and Asia.

Underneath that identity is a seven-layer hierarchy that organizes every device by Bucket, Group, Category, Sub-category, Profile, Device Type, and Device Instance. This isn’t organizational scaffolding for its own sake — it is the structure that makes AI-driven deduplication actually work. When the model considers whether two records represent the same device, it does so with full context about what kind of device this is, how devices in this class typically behave, and which attributes are dispositive for identity versus incidental. A generic deduplication algorithm matching on MAC address alone would miss most of the real-world duplication scenarios we see in customer environments. The seven-layer hierarchy is what catches them.

2. Behavioral baselines from real packet-level observation

Inventory alone is not data. What turns inventory into actionable data is behavior; what each device actually does in production. ORDR observes every device at the packet level for weeks at a time, learning its real communication patterns: which destinations it talks to, on which protocols, at what cadence, with which counterparties, for what apparent purpose.

Most platforms claiming behavioral baselines are inferring it from NetFlow, from scanner output, or from third-party integrations. That is not the same thing. Packet-level observation captures the protocols, the sequence, the timing, and the content cues that distinguish a legitimate clinical workflow from anomalous activity. NetFlow tells you that two endpoints talked. Packet-level inspection tells you what they said, which is the only signal that lets you decide whether the conversation was supposed to happen.

Behavioral truth observed this way is what makes the data layer reliable enough for AI to reason over. It is also what makes downstream segmentation safe because the policy is anchored in what the device actually needs to talk to, not what someone’s threat model says it might want to talk to.

3. AI-driven deduplication with semantic similarity

Deduplication is where most asset platforms quietly fail at scale. Simple identifier-matching — by MAC, by serial number, by FQDN — works at small scale and falls apart in real enterprise environments where devices change identifiers continuously through normal operation. The Windows laptop on wired and wireless. The Apple device with MAC randomization. The remote worker on a new IP every morning. The medical device with multiple physical interfaces. The cloud instance that recycled its hostname yesterday. Each of these breaks identifier-based deduplication.

ORDR uses predictive large language models against the asset knowledge base, combined with vector database techniques for semantic similarity. The model doesn’t look for exact matches; it looks for equivalence classes. Two records that share no exact identifier but match strongly on device class; observed behavior, organizational context, and topological position are recognized as the same device. This is what catches the duplicates that simple algorithms miss, and it is only possible because the knowledge base behind it has been built from observing devices in the most demanding environments in the world: large hospital systems with hundreds of medical device types, global airlines with operational technology spanning continents, manufacturing floors with mixed legacy and modern equipment, retail banks with strict regulatory boundaries on every asset. Each of those environments taught the model something a generic dataset would never contain.

This is the kind of problem AI is genuinely well-suited to solve. It is also a problem AI cannot solve generically; it requires the domain expertise embedded in the training data, the hierarchical structure, and the behavioral baselines that come only from observing the real thing, in real environments, where the cost of being wrong is operational.

4. Dynamic Data Exchange (DDX) for schema normalization and attribute prioritization

When data flows in from multiple tools — EDR, MDM, AD, vulnerability scanners, CMDB, cloud platforms — each tool offers its own version of the truth. They disagree. The EDR says the device is running Windows 11; the MDM says Windows 10. The CMDB says it was assigned to John; AD says Susan. The vulnerability scanner reports a different OS version than either. Resolving these conflicts requires deciding, for each attribute, which source is authoritative — and the right answer is not the same for every attribute or every customer.

ORDR’s Dynamic Data Exchange (DDX) mechanism is the engine that handles this. It allows customers to configure attribute priority orders that match their specific environment — EDR is authoritative for OS, AD is authoritative for ownership, the vulnerability scanner is authoritative for installed software versions, and so on. Underneath, DDX supports regular expression parsing, schema translation, and granular field mapping so that the same logical attribute coming from different tools in different formats lands in the same place in the central data model. Raw data is preserved for auditability; normalized data is what the platform reasons over.

This is unglamorous engineering. It is also the difference between a data layer that produces consistent insights across customer environments and one that produces different answers depending on which tool the customer happened to deploy first.

5. Continuous purging and organizational hierarchy mapping

A clean data layer doesn’t stay clean on its own. Devices retire. Cloud instances spin down. Guest Wi-Fi visitors leave. Random-MAC sessions end. Without active maintenance, the inventory bloats with stale records, degrading both performance and decision quality.

ORDR runs continuous purging with customer-configurable retention policies, with particularly aggressive handling of the highest-churn categories: random MACs, guest Wi-Fi devices, and transient cloud instances. Beyond purging, ORDR maintains an explicit organizational hierarchy that maps each device to a department, a cost center, an ownership chain, and a functional role. This is not just metadata. It is what makes the data layer operationally useful: when the platform produces a vulnerability remediation recommendation, it knows exactly which team is on the hook to act on it, and the work routes itself accordingly. Without organizational context, even a perfect technical recommendation has nowhere to go.

The Domain Expertise That Makes It Work

There is a temptation, watching what foundation models can now do, to assume that any AI problem is now solvable by anyone with access to a sufficiently capable model. This is partly true and substantially misleading.

Deduplicating asset data is an AI problem, yes. But it is an AI problem where the model is only as good as the domain knowledge baked into the training data, the hierarchy that organizes the inputs, and the behavioral baselines the model has learned to recognize. A generic LLM applied to a raw asset inventory will produce something that looks like deduplication and will be wrong in ways that experienced practitioners recognize immediately. It will merge devices that should be distinct, separate devices that should be merged, and assign confidence scores to both kinds of error indistinguishably.

What ORDR has, that a generic model does not, is field experience in the environments where getting this wrong has the highest cost. We have observed medical devices in the world’s best hospitals — imaging modalities, infusion pumps, patient monitors operating under FDA constraints where a wrong segmentation policy can disrupt a stroke workup. We have observed operational technology in the largest manufacturers — PLCs, HMIs, industrial controllers where availability is non-negotiable, and misclassification can stop production lines. We have observed corporate endpoints and back-office systems in the largest retail banks — environments under continuous regulatory scrutiny, where data sovereignty and segmentation are audit requirements rather than aspirations. We have observed mixed IT, IoT, and OT environments in global airlines — with footprints that span continents, time zones, and regulatory regimes. We know what a Siemens SOMATOM CT scanner looks like on the wire and what it doesn’t. We know which destinations a Philips PACS workstation legitimately talks to and which ones would be anomalous. We know that Apple’s MAC randomization produces a specific signature that distinguishes a real new device from a returning known device. We know that a network switch stack reports itself differently in SNMP than in NetFlow than in vendor APIs, and we know how to reconcile those three views into a single accurate record.

None of this is in a training set you can download. It lives in the heads of the people who have built ORDR alongside these customers, and in the structured form of the device intelligence graph, the asset knowledge base, the seven-layer hierarchy, and the behavioral profiles we have accumulated by running in the most demanding environments in the world. This is the moat. It is also the part of the AI security stack that cannot be replicated in 18 months, no matter how much VC funding is poured into the attempt.

Domain expertise applied through AI techniques is what makes good data possible. AI techniques applied without domain expertise produce confident garbage. The combination is rare, and it is what we have built — in the field, in the environments where the wrong answer has consequences far beyond a security ticket.

Why This Compounds Upward

Everything ORDR does at the data layer compounds into what happens at the layers above. We have written elsewhere that AI security has three layers — data, reasoning, and action. The depth of each layer compounds the value of the layers above and below it. The data layer is where the compounding starts.

Good data lets the reasoning layer ask better questions. When a CISO asks, “give me a segmentation plan for radiology,” the reasoning layer can only produce a defensible answer if the underlying data knows which devices are in radiology, what they actually talk to, which of them have FDA recalls, which switches they’re connected to, and which departments are responsible for them. Bad data at the bottom propagates upward as confident wrong answers. Good data at the bottom propagates upward as decisions that hold up under audit.

Good data also lets the action layer enforce safely. Pushing a segmentation policy directly to switches and firewalls only works if the data layer has correctly identified which devices, which protocols, which destinations are involved — and if the simulation that runs before enforcement is reasoning over an accurate model of the environment. The closed-loop architecture only delivers value when the foundation it closes the loop on is real.

This is the architectural reality the industry is going to have to confront in the next two years. Reasoning layers are commoditizing fast — the foundation models are getting better; the prompting techniques are converging; the natural-language interfaces are catching up. What does not commoditize, what cannot be built in 18 months. What determines whether the AI on top is real or aspirational is the data layer underneath. That is where the durable value lives. That is what ORDR has built.

The Bottom Line

Every AI security pitch you hear in the next two years will compete on the reasoning layer. The agents, the copilots, the conversational interfaces, and the natural-language planners. Most of those pitches will be running on data the vendor doesn’t own and didn’t build, and the gap between the demo and the production reality will be enormous.

The platforms that hold up under serious evaluation will be the ones that did the unglamorous work first. Identity at the make-and-model level. Behavior observed at the packet level. Deduplication powered by domain expertise and semantic similarity, not blind identifier matching. Schema normalization across the mess of real-world tooling. Organizational context, continuous purging, and a unified graph that connects all of it. This work doesn’t make a flashy demo, but it makes every demo above it work.

Throwing data at AI doesn’t solve security problems. Good data, organized by people who have seen this problem at scale for a decade, processed through AI techniques designed for this specific domain, validated continuously against real environments; that is what makes AI work for security. It is also what we have been building at ORDR since long before “AI security” became a category.

Anyone evaluating AI security platforms in 2026 should ask one question above all the others: where does your data come from, and how do you know it is accurate? If the answer is some version of “we integrate with the customer’s existing tools and let the AI sort it out,” the platform is asking you to take a leap of faith that the data underneath is good enough. It usually isn’t. And the AI on top — no matter how impressive its demo — is only going to produce confident wrong answers faster.

Good data is the foundation. Everything else is built on top of it. That is the part of the AI security story most of the industry isn’t telling, and it is the part that will decide which platforms still matter five years from now.

Pandian Gnanaprakasam is the Co-Founder and CTO of ORDR. He has been building the device intelligence and data foundations that proactive AI security depends on, in the most demanding regulated environments in the world.

ShareLinkedInX