Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
decentralized-science-desci-fixing-research
Blog

Why On-Chain Data Lakes Will Remain Ponds Without Standards

A cynical look at the fragmented state of on-chain data. Without enforced interoperability standards, every protocol's repository is a useless silo, dooming DeSci and advanced DeFi. We analyze the problem and the nascent solutions.

introduction
THE DATA

Introduction: The Great Data Swamp

On-chain data is a fragmented, unusable mess because the industry lacks universal standards for structuring and querying it.

Data is not information. Raw transaction logs from Ethereum or Solana are a swamp of low-level events. Without a universal schema, every team builds custom parsers for the same data, wasting billions in engineering hours.

The indexing problem is a standards problem. The fragmented landscape of The Graph, Covalent, and proprietary RPC providers proves the point. Each creates its own data model, forcing applications into vendor lock-in and preventing composable analytics.

Evidence: Over 80% of a DeFi protocol's backend code is data plumbing. Aave's risk models and Uniswap's fee optimization require teams to rebuild the same EVM event decoders from scratch.

thesis-statement
THE DATA POND PROBLEM

The Core Argument: Interoperability is Non-Negotiable

Without universal data standards, on-chain data lakes remain isolated ponds, crippling cross-chain analytics and composability.

Data Silos Are Inevitable. Every L2 and alt-L1 creates its own data model. An Arbitrum NFT and a Solana NFT are fundamentally different objects. This fragmentation makes aggregated analytics, like tracking a wallet's total DeFi exposure, a manual, error-prone integration nightmare.

Standards Precede Scale. The internet scaled because of TCP/IP, not faster modems. Similarly, interoperability protocols like LayerZero and Axelar solve asset transfer, but not data coherence. Without a semantic layer, data lakes from The Graph or Goldsky remain isolated ponds.

Composability Demands Consistency. A cross-chain lending protocol cannot price collateral if the underlying asset's price feed on Polygon differs from its Avalanche feed. Oracle networks like Chainlink standardize price data, proving the model works but highlighting the gap for all other data types.

Evidence: The Graph indexes over 40 blockchains, but cross-chain querying requires building a separate subgraph for each chain and manually stitching results. This is the technical debt of no standards.

deep-dive
THE STANDARDS PROBLEM

Anatomy of a Data Pond: Why Silos Persist

On-chain data lakes are destined to remain isolated ponds without universal standards for indexing, formatting, and querying.

The Indexing Problem: Every chain uses a unique state model, forcing indexers like The Graph to deploy custom subgraphs for each environment. This creates fragmented data pipelines that cannot be composed, turning a potential lake into a collection of isolated ponds.

Formatting Incompatibility: Raw transaction logs from Ethereum, Solana, and Cosmos SDK chains are structurally incompatible. Without a unified data schema, cross-chain analytics for protocols like Uniswap or Aave require bespoke, brittle normalization layers.

Query Language Fragmentation: The ecosystem is split between SQL (Dune, Flipside), GraphQL (The Graph), and proprietary APIs. This query language war forces analysts to learn multiple systems, increasing the cost of comprehensive analysis.

Evidence: The Graph hosts over 1,000 subgraphs, but less than 5% are multi-chain. This metric proves that custom per-chain work is the default, not the exception, making a universal data lake economically unviable without standards.

ON-CHAIN DATA LAKE INFRASTRUCTURE

Standard Showdown: Protocols, Promises, and Trade-offs

Comparison of leading data lake solutions, highlighting the critical role of standards in enabling composability and preventing vendor lock-in.

Core Feature / MetricGoldskyThe GraphSubsquidIdeal Standard

Data Query Language

GraphQL

GraphQL

GraphQL

SQL (Postgres-compatible)

Data Provenance

Proprietary Indexing

Subgraph Indexing

Substrate/EVMs

On-Chain Attestation

Cross-Chain Query

Query Latency (P95)

< 1 sec

2-5 sec

1-3 sec

< 500 ms

Historical Data Access

Full History

Subgraph-defined

Full History

Full History + Pruning

Data Schema Mutability

Managed Service

Immutable Subgraph

Mutable Dataset

Versioned Schema

Native Composability

Primary Use Case

Real-time Apps

DApp Data API

Analytics & Backfills

Universal Data Layer

counter-argument
THE FRAGMENTATION ARGUMENT

Steelman: Maybe Ponds Are Fine?

A defense of the current fragmented state of on-chain data, arguing that specialized, isolated data lakes are a feature, not a bug.

Specialization drives efficiency. A monolithic data lake for all of crypto is a fantasy. An Ethereum L1 archive node serves a different purpose than a Solana validator's data plane. The query patterns for a DeFi analyst using Dune Analytics differ fundamentally from a Flashbots MEV searcher's real-time mempool stream. Forcing a single standard creates a lowest-common-denominator API that satisfies no one.

Competition creates better tools. The current ecosystem of The Graph, Covalent, Goldsky, and direct RPC providers like Alchemy and QuickNode is a competitive market. This forces continuous innovation in indexing speed, data freshness, and query language design. A mandated standard would stifle this, creating a data monopoly that slows progress and centralizes control over information access.

The cost of standardization is prohibitive. Enforcing a universal schema across thousands of protocols with unique state machines is a coordination nightmare. The governance overhead to update a standard for each new EIP-4844 or Celestia blob would exceed the benefit. Protocol teams will always optimize for their own use cases first, making any top-down standard instantly obsolete.

Evidence: Look at the failure of universal blockchain APIs. Every major RPC provider has a proprietary API. The Ethereum JSON-RPC spec is a bare minimum; real innovation happens in the proprietary endpoints of Alchemy's Transact API or QuickNode's Marketplace. The market voted with its wallet for specialized performance over standardized mediocrity.

case-study
WHY ON-CHAIN DATA LAKES WILL REMAIN PONDS

Case Studies in Connectivity & Isolation

Without universal standards, isolated data silos cripple composability and prevent the emergence of a true on-chain data economy.

01

The Oracle Problem: A Fragmented Truth

Every major DeFi protocol runs its own oracle or relies on a single source like Chainlink, creating data silos and systemic risk. Without a standard for attestation and delivery, cross-protocol composability is brittle.

  • Key Consequence: Liquidations fail or cascade across protocols due to stale/divergent price feeds.
  • Key Limitation: Custom integration for each new data type (e.g., weather, sports) stifles innovation.
$10B+
TVL at Risk
~10
Major Oracle Feeds
02

The MEV Searcher's Dilemma

Searchers operate in the dark, building private mempools and relying on fragmented data from RPC providers like Alchemy and Infura. This creates an information asymmetry that centralizes profit and harms end-users.

  • Key Consequence: Jito and Flashbots dominate because they control superior data access, not just better algorithms.
  • Key Limitation: No standard for real-time, permissionless access to global state and pending transactions.
$1B+
Annual MEV Extracted
~500ms
Data Advantage
03

Cross-Chain Is A Messy Graph

Projects like LayerZero, Axelar, and Wormhole build proprietary messaging layers, forcing developers to choose sides. This fragments liquidity and security models, turning the multi-chain vision into a walled garden archipelago.

  • Key Consequence: A protocol must deploy and maintain N separate integrations for N bridges, a combinatorial explosion of overhead.
  • Key Limitation: No universal standard for verifiable message passing and state attestation between any two chains.
50+
Active Bridges
$2B+
Bridge Hacks (2022-24)
04

The Indexer Monopoly Problem

The Graph dominates on-chain data indexing but uses a proprietary subgraph language. This creates vendor lock-in and limits query flexibility, making complex, real-time analytics pipelines impossible.

  • Key Consequence: Developers cannot perform ad-hoc, SQL-like joins across protocols (e.g., Uniswap + Aave user behavior) without massive engineering effort.
  • Key Limitation: Indexed data is not a live, queryable stream but a cached snapshot, breaking real-time applications.
40k+
Subgraphs
~2s
Indexing Lag
05

ZK Proofs: Proving Everything, Sharing Nothing

ZK rollups like zkSync and StarkNet generate massive computational integrity proofs but treat verified state as a private output. This creates verified data silos where the proof of correctness is not itself a portable, composable data asset.

  • Key Consequence: A proven fact on one ZK rollup cannot be natively trusted or used by a smart contract on another chain without a separate, costly bridging protocol.
  • Key Limitation: No standard format for the output of a ZK proof to be consumed as universal data.
100KB+
Proof Size
~10min
Verification Time
06

The Intent-Based Dead End

Solving systems like UniswapX, CowSwap, and Across rely on solvers competing in private. Without a standard for expressing and fulfilling intents on a public data layer, these systems remain closed auctions rather than open markets.

  • Key Consequence: User intent (e.g., "swap X for Y at best price") is not a first-class, discoverable object that any solver can compete to fulfill.
  • Key Limitation: The lack of a public intent mempool prevents true price discovery and solver decentralization.
$10B+
Monthly Volume
-90%
MEV Saved
future-outlook
THE STANDARDS GAP

The Path to an Ocean: Predictions for 2024-2025

Without universal data standards, on-chain data lakes will remain fragmented ponds, limiting composability and utility.

Data lakes remain isolated ponds without universal schemas. Every protocol like Uniswap V3 or Aave defines its own event logs, forcing analysts to write custom parsers for each. This fragmentation prevents the creation of a unified on-chain data ocean.

The solution is not more indexing. Projects like The Graph and Covalent solve for querying, not semantic consistency. A standard like EIP-7484 for structured events is the prerequisite for a shared data layer that applications can build upon.

Evidence: The lack of standards is why Dune Analytics dashboards require constant maintenance. A schema change in a major protocol like Compound breaks thousands of queries, demonstrating the fragility of the current ad-hoc system.

takeaways
THE DATA LAKE ILLUSION

TL;DR for Builders and Investors

On-chain data is abundant but trapped in siloed, non-standardized formats, preventing the composable intelligence required for the next wave of applications.

01

The Query Fragmentation Problem

Every major protocol—Uniswap, Aave, Compound—stores data in unique schemas. Building a cross-protocol dashboard requires stitching together dozens of custom subgraphs and RPC calls, a process that is slow, brittle, and expensive to maintain.

  • Result: ~80% dev time spent on data plumbing, not product logic.
  • Opportunity Cost: Missed alpha from cross-chain and cross-protocol correlations.
80%
Dev Time Wasted
100+
Custom Schemas
02

The Solution: Universal Schemas (Like ERC-20 for Data)

Adopt a canonical schema for core primitives: token transfers, liquidity events, governance votes. This is the data layer equivalent of ERC-20.

  • Enables: Instant composability. A DEX aggregator can query all pools uniformly.
  • Drives Value: Analytics platforms like Nansen, Dune become more powerful, attracting more users and fees to the protocols they index.
10x
Faster Dev
ERC-20
Analogy
03

The Indexer Cartel Risk

Without open standards, data access is controlled by a few centralized indexers (The Graph) or infrastructure providers. This creates a single point of failure and rent extraction.

  • Vulnerability: $10B+ DeFi TVL relies on a handful of indexing services.
  • Strategic Move: Protocols that publish standardized data streams become more resilient and attractive to builders, reducing platform risk.
$10B+
TVL at Risk
1-3
Dominant Indexers
04

The Investor Lens: Data Moats vs. Data Swamps

Investors currently bet on protocols with perceived data moats. In reality, these are data swamps—large but unusable by others. The real value accrual shifts to the standard-setters.

  • Back: Projects building EIPs for data (EIP-7507), or universal adapters like Goldsky.
  • Avoid: Protocols that treat their data as a proprietary fortress; they will be bypassed.
EIP-7507
Key Standard
Moat → Swamp
Paradigm Shift
05

The Performance Lie: "Raw Data is Enough"

Providing raw block data via RPC is not a data product. The gap is in structured, real-time state. Applications need to know the current price of a Uniswap V3 position, not parse 100 logs.

  • Requirement: Sub-100ms access to derived state.
  • Who Wins: Infra that delivers structured streams (e.g., Chainbase, Subsquid) over raw block pipelines.
<100ms
Latency Need
Derived State
Real Product
06

Actionable Blueprint for Builders

  1. Instrument Your Protocol: Emit events using emerging standards (e.g., ERC-7507 for positions).
  2. Publish a Public Schema: Make your data model open and versioned.
  3. Support Multiple Indexers: Don't rely on a single The Graph subgraph. Foster competition. This turns your protocol from a data pond into a node in the universal data lake.
3 Steps
Blueprint
ERC-7507
First Step
ENQUIRY

Get In Touch
today.

Our experts will offer a free quote and a 30min call to discuss your project.

NDA Protected
24h Response
Directly to Engineering Team
10+
Protocols Shipped
$20M+
TVL Overall
NDA Protected Directly to Engineering Team
On-Chain Data Lakes Are Ponds Without Interoperability Standards | ChainScore Blog