Blockchain Analytics Data Quality: 10 Due-Diligence Questions
When a compliance team, regulator, or investigator acts on blockchain analytics output, the downstream consequences of bad data are severe: wasted resources on false leads, missed sanctions exposure, and, in the worst case, a single incorrect attribution that unravels an entire investigation or triggers a wrongful customer exit. Chainalysis has published a framework of ten questions that any organisation should put to a blockchain analytics provider before relying on its data for AML, sanctions screening, or enforcement work. The questions go well beyond feature comparisons and coverage claims. They probe methodology, evidentiary standards, and the safeguards that separate rigorous analysis from confident-sounding guesswork.
Why Data Quality Is the Real Risk in Blockchain Analytics
Blockchain analytics tools are only as useful as the conclusions they can actually support. A provider may claim broad coverage across dozens of chains, but if its attribution logic is opaque or its clustering methodology has never faced independent scrutiny, compliance teams are essentially building decisions on unverified assertions.
The stakes are concrete. Incorrect attribution can discredit hundreds of related insights at once. A cluster that collapses under examination can undermine an enforcement action that took months to build. Regulators and courts increasingly want to understand not just what a tool concluded, but how it reached that conclusion and whether that process has held up under external testing.
The gap between coverage claims and actual rigour
Many providers describe their methodology in general terms during sales conversations. The questions below are designed to move past that layer. If a provider cannot answer them clearly and specifically, that is itself a meaningful signal about the reliability of its outputs.
Ten Questions to Ask Before You Rely on the Data
Chainalysis groups its due-diligence questions around four core themes: clustering methodology, labelling and attribution, legal and external validation, and machine learning oversight. The following breakdown maps each question to its practical compliance implication.
Clustering methodology
The first area covers how a provider groups addresses together to infer common ownership. Some techniques establish common ownership deterministically, others do so probabilistically. Both have legitimate uses, but a compliance analyst needs to know which is being applied and when, so they can calibrate how much weight to place on the output.
Providers should also be able to identify the known blind spots in their techniques. CoinJoin transactions, for example, need to be recognised and excluded from UTXO co-spending heuristics; otherwise the clustering logic produces false positives. A rigorous provider has mapped these edge cases and built explicit protections, rather than assuming errors are infrequent.
Chain architecture matters too. Bitcoin and Ethereum operate on fundamentally different transaction models. Grouping techniques that work well on one chain do not automatically transfer to another. If a provider uses identical terminology across chains without explaining how the underlying method adapts, that warrants a direct follow-up question.
Labelling and attribution standards
A label confirmed by law-enforcement-seized data carries very different evidentiary weight than one derived from a single uncorroborated report. Compliance teams should understand exactly what sources underpin the labels they see, and whether those sources can be disclosed or at least characterised by reliability tier.
Equally important is independence between clustering and labelling. If removing a label from an address cluster causes the cluster itself to fall apart, neither the grouping nor the label stands on its own. The two conclusions need to be independently supportable.
A subtler but important question concerns user versus custodian distinction. When a customer deposits crypto at an exchange, the deposit address belongs to the customer in one sense but is controlled by the exchange. Failing to distinguish between who uses an address and who ultimately controls it produces attribution errors that can cascade through downstream analysis. The same logic applies to nested entities, where one business relies on another's custodial infrastructure. Understanding control, not just interaction, is the standard a credible provider should meet. This connects directly to the broader question of independent reconciliation practices auditors are now scrutinising when they review how firms source and validate on-chain data.
Legal and external validation
Legal proceedings are among the most demanding tests a methodology can face. A clustering or attribution method that has satisfied the Daubert standard in a US federal court has been examined for scientific validity, peer review, error rates, and general acceptance. That is a categorically different thing from a method that has never been challenged in any adversarial setting.
Equally revealing is how a provider responds when external validation becomes possible. When law enforcement seizes wallet infrastructure and empirical ground truth becomes available, a provider that welcomes that comparison is demonstrating confidence in its methods. One that avoids it is not. Understanding how blockchain analysis supports fraud recovery and enforcement outcomes depends directly on whether the underlying data has been stress-tested against real-world results.
Machine learning oversight
Machine learning is effective at identifying patterns at scale. The risk arises when probabilistic ML outputs are treated as confirmed facts rather than as signals that require further validation. If a provider cannot explain clearly where in its workflow ML is being used, and cannot confirm that those outputs are labelled distinctly from evidence-based conclusions, errors can propagate through attribution at speed.
For any specific cluster, a provider should be able to reconstruct how it was built and identify the evidence that supports it. If that audit trail is not available, the reliability of the cluster is unknown, regardless of how confident the interface looks.
What These Questions Signal About Provider Quality
A provider that can answer all ten questions clearly, with specifics rather than generalities, is demonstrating transparency and accountability. A provider that deflects, provides only high-level answers, or cannot explain its methodology for a particular cluster on request is signalling limitations that compliance teams should factor into their risk assessment.
The same evidentiary standards that underpin a sound investigation should underpin the tools that feed it. Procurement decisions that treat blockchain analytics as a commodity rather than a methodological choice are a compliance risk in their own right.
What does "deterministic" versus "probabilistic" clustering mean in practice?
Deterministic clustering uses on-chain rules that produce a definitive conclusion, for example that two addresses must share a single owner based on how they appear together in a transaction. Probabilistic clustering infers likely common ownership based on statistical patterns but cannot rule out alternative explanations. Compliance teams should know which method underpins any given attribution so they can apply appropriate confidence levels.
Why does the user versus custodian distinction matter for sanctions screening?
A deposit address at an exchange is technically controlled by the exchange even though it is associated with a specific customer. If a tool misattributes control, a compliance team could flag or clear the wrong party. Getting this distinction right is especially important in nested-entity structures where multiple layers of custody are involved.
What is the Daubert standard and why is it relevant to blockchain analytics?
The Daubert standard is a US federal threshold for admissible expert evidence. A court applying it will examine whether the methodology has been tested, whether it has a known or estimable error rate, whether it has been peer reviewed, and whether it is generally accepted in the relevant field. A blockchain analytics methodology that has passed Daubert scrutiny has faced a level of independent challenge that most have not.
How should firms treat ML-generated attribution outputs differently from evidence-based ones?
ML outputs should be treated as probabilistic signals that require corroboration, not as confirmed facts. Providers should label ML-derived conclusions separately and clearly so that analysts can apply appropriate scepticism and seek additional evidence before acting on them in a compliance decision.
Can a single incorrect attribution really undermine an entire investigation?
Yes. Because clustering links addresses together, a wrong attribution can cascade: if address A is incorrectly linked to a cluster, every insight derived from that cluster inherits the error. In enforcement contexts, opposing counsel can use one demonstrated inaccuracy to challenge the reliability of the provider's methodology across all related outputs.
Source: Chainalysis
FAQ
Deterministic clustering uses on-chain rules that produce a definitive conclusion, for example that two addresses must share a single owner based on how they appear together in a transaction. Probabilistic clustering infers likely common ownership based on statistical patterns but cannot rule out alternative explanations. Compliance teams should know which method underpins any given attribution so they can apply appropriate confidence levels.
A deposit address at an exchange is technically controlled by the exchange even though it is associated with a specific customer. If a tool misattributes control, a compliance team could flag or clear the wrong party. Getting this distinction right is especially important in nested-entity structures where multiple layers of custody are involved.
The Daubert standard is a US federal threshold for admissible expert evidence. A court applying it will examine whether the methodology has been tested, whether it has a known or estimable error rate, whether it has been peer reviewed, and whether it is generally accepted in the relevant field. A blockchain analytics methodology that has passed Daubert scrutiny has faced a level of independent challenge that most have not.
ML outputs should be treated as probabilistic signals that require corroboration, not as confirmed facts. Providers should label ML-derived conclusions separately and clearly so that analysts can apply appropriate scepticism and seek additional evidence before acting on them in a compliance decision.
Yes. Because clustering links addresses together, a wrong attribution can cascade: if address A is incorrectly linked to a cluster, every insight derived from that cluster inherits the error. In enforcement contexts, opposing counsel can use one demonstrated inaccuracy to challenge the reliability of the provider's methodology across all related outputs.
