Data collection & labelling is big business
Today’s deep dive is brought to you by DIN, a data intelligence network. Enjoy!
During the California Gold Rush of the mid-1800s, thousands of people chased the promise of untapped wealth in a new frontier.
People who had never been wealthy suddenly found themselves with fortunes, stories of rags to riches became commonplace, and entire industries and cities sprang up to support the rush. Infrastructure developed at a breathtaking pace, reshaping the American landscape.
The parallels with Crypto AI are hard to ignore.
Most Crypto AI products today are still in development or running on testnets—indicators that we’re firmly in the infrastructure build-out phase.
Investors and builders are laying the groundwork, positioning themselves for the potential growth surge. The tools, networks, and protocols being established now are the foundation for what could become a sprawling, decentralised AI ecosystem.
If the analogy holds, we’re witnessing the early stages of a digital gold rush—one that could be just as transformative as its 19th-century counterpart.
So imagine my surprise when I stumbled across a Crypto AI project claiming over 700,000 daily active users. Not monthly—daily. In a field as nascent as this, such user metrics are virtually unheard of. Naturally, I had to dig in and figure out what was actually happening under the hood.
This project? DIN, a “Data Intelligence Network”
Source: Andy Scherpenbeg
I’ve been closely watching data networks in Crypto AI, and it’s clear they’re addressing a critical pain point in the AI landscape: access to valuable datasets.
Today, many of the most valuable data sources are tightly controlled by centralized entities, who charge steep fees to access them.
For example:
The message is clear: Corporates recognize that data is the new battleground, and they’re locking down control to maximize profit.
Crypto offers a potential solution—a way to break free from the centralized stranglehold over valuable datasets.
Crypto data networks take a fundamentally different approach, aiming to build high-quality, decentralized datasets without the bottlenecks of traditional models. By leveraging tokens, these networks can incentivize large-scale data labelling efforts, motivating individuals to contribute to mass data collection or even organising efforts to scrape the web for training data (did someone say… Grass?).
Meanwhile, blockchains provide transparency, creating a framework to track data ownership and provenance. This ensures that contributors are fairly compensated whenever their data is used, establishing a new paradigm where data value is shared rather than monopolized.
DIN is one of the teams that’s tackling the data problem head-on.
At its core, DIN is a data layer that collects and validates both on-chain and off-chain data—while using the blockchain as the settlement layer.
The big idea? Give ownership of data back to users and let them earn rewards for what they contribute to the system.
At first glance, this diagram might seem complex, but let’s break it down.
There are three main actors in the DIN network:
To get a better sense of how the data collectors and validators work, let’s dive into xData, DIN’s primary live product today.
xData is DIN’s flagship platform designed to collect, organize, and store data from social media platforms like X—without relying on the API. It operates on a decentralized network, ensuring ownership and privacy for users. It was launched in April 2024 on opBNB (a Layer-2 on BNB Chain).
xData’s Chrome Extension
xData makes data collection for users fun and rewarding through gamified mechanics. Here’s a quick look at how it works:
The permissionless nature of xData means that any user worldwide can participate in data collection and annotation and earn rewards/income, regardless of nationality. For now, data collection happens off-chain, with tagged tweets stored on BNB green field, a decentralized data layer on BNB Chain.
The next natural question is: how do you ensure the quality and integrity of user-submitted data? After all, someone could run an AI bot to randomly tag tweets that don’t match the specified labels just to maximize their wafer earnings.
Data labelling isn’t always straightforward either. Tweets often include nicknames, slang, and cultural references—for instance, Bitcoin is often referred to as “big biscuit” in Mandarin-language tweets.
This is where data validation comes into play.
Chipper nodes are DIN’s AI-driven data validation and processing nodes, responsible for validating and vectorizing data, while also enabling users to earn tokens ($xDIN and $DIN).
Behind the scenes, each user-operated node actually runs a small AI model locally to validate that the tweet’s content matches the attached label before storing it in the decentralized data layer. Users can operate these nodes on standard PCs, without the need for costly hardware setups.
The AI models validators use continuously improve as they process more validated data, allowing the network to get smarter and more accurate over time.
Currently, DIN handles all data validation in-house, but the goal is to decentralize the validation process. Active testing for nodes is currently underway. Users can run the node software on their local devices to test the network, with bug bounties in place as DIN prepares for its mainnet and token launch in the coming weeks.
Although not live yet, computation nodes represent DIN’s future plans for storing data privately and securely. Here’s how they’re planned to work:
No official announcement has been made, but in our research, we uncovered a DIN token on the BNB Chain testnet. This hints at future blockchain developments—potentially a sidechain or Layer 2 solution on BNB Chain.
DIN might feel like a new player, but the project’s origins trace back to late 2021. Initially launched as “Web3Go,” it began as an on-chain data analytics platform within the Polkadot ecosystem, securing grants from the Web3 Foundation and working with clients like Moonbeam and Oak Network.
In 2022, the team broadened its reach to the BNB Chain ecosystem, joining Binance Lab’s MVB incubator and securing investment to develop a “multi-chain open data analytics platform.”
By July 2023, they saw the writing on the wall: generative AI was booming, and the need for robust data infrastructure was becoming more pressing than ever. The team shifted gears to build a comprehensive “data intelligence layer for AI,” aligning their mission with the data demands of AI innovation. This evolution culminated in May 2024, when Web3Go officially rebranded to DIN, marking a bold new focus on data as the next wave of AI advancement.
Source: BNB Chain DappBay
Source: BNB Chain DappBay
According to DappBay, DIN has been holding steady with an average of >700,000 daily users over the month of October and>1.2M daily transactions. The majority of transactions are related to xData users having to make an on-chain transaction every 24 hours to activate their xData app and earn points.
Source: BNB Chain DappBay
DIN consistently ranks among the top 10 dApps on BNB Chain, and on many days, it’s the #1 app by users on the network. While I haven’t tracked the BNB Chain ecosystem as closely as chains like Solana and Base, this is no small feat—especially considering BNB Chain’s longevity and strong backing from Binance.
To put things in context, I took a look at some of the other top-ranked apps on BNB Chain to see what’s driving engagement there:
According to the team, DIN has collected and labeled over 100M tweets so far, with a user base exceeding 30M across opBNB and Mantle.
What stands out here is DIN’s ability to generate a massive, real-time dataset of relevant tweets quickly, leveraging its substantial user base. This process doesn’t rely on the X API at all.
While xData currently focuses on Twitter, the team plans to expand the data collection and labelling platform to other sources like Reddit, Facebook, Instagram, and essentially any user data platform with high-value information. To me, this is where the real gold lies.
Reiki is another product by DIN that ties neatly into the ongoing AI agent meta—in fact, DIN might have been ahead of its time, given the latent consumer interest in AI agents from what we saw with Truth Terminal and GOAT in recent weeks.
In January 2024, DIN launched Reiki, a platform where users could create AI agents (mainly chatbots) without coding experience. Users could also integrate their own knowledge base, allowing them to build engaging, personalized chatbots reminiscent of MyShell.
After launch, the platform quickly gained traction, becoming the #1 product on Product Hunt.
Reiki also gave creators several ways to monetize their bots, participate in reward programs, and even mint their bots as NFTs—adding a fun layer of ownership to the experience. Notably, BNB Chain’s Discord knowledge support bot is powered by Reiki.
Although the platform has been largely deprecated for now, the DIN team hasn’t ruled out bringing it back after their token launch. If revived, Reiki could provide an additional utility for the token and a way for AI agent creators to leverage the data collected by xData.
In August-September 2024, DIN held a Chipper Node sale, raising $2.5M from node sales. These chipper nodes will allow users to run validation software on their local devices, using models to ensure data is labelled accurately. The sale was a success, with 25,112 Tier 2 nodes—priced at $99 each—completely selling out.
Pre-TGE, xData users can convert their wafers (points) into xDIN—a pre-airdrop token. However, there will be a conversion fee ranging from 5–30%, with those fees distributed to Chipper Node owners. This conversion mechanism isn’t live yet but is expected to begin once node “pre-mining” is live later this month.
At TGE, users will receive DIN (tradable token) airdrop based on their proportion of xDIN held, fully released with no complex lock-up mechanisms.
After the TGE, 25% of the total DIN token supply will be reserved for Chipper Node rewards. Half of this allocation will unlock in the first year, with the remaining emissions halving yearly.
Note that this is a relatively fast unlock compared to other projects conducting node sales, where node rewards are distributed gradually over 3–4 years.
Validator nodes will likely need to stake DIN tokens to participate in the network. In return, they’ll earn rewards for validating data, but they'll face slashing penalties if their outputs are inaccurate.
On the other end, data consumers must spend DIN tokens to access the network’s data. Since most Web2 businesses are still hesitant to engage with crypto, the company will need to facilitate these transactions to bridge the gap between traditional enterprises and the decentralized network.
We’re still awaiting detailed DIN tokenomics, which should be released closer to the TGE.
The core team at DIN brings together talent from Columbia University, University College London, and the University of Stuttgart, with a decade of expertise in AI and blockchain.
DIN’s founder, Hao Ding, holds a Master’s degree in Information Technology from the University of Stuttgart. Before diving into crypto, Hao served as Director of Research Development at the Suzhou Institute of Artificial Intelligence in China. He then took on the role of Vice-President at Litentry, an identity oracle network, before founding Web3Go.
I had the pleasure of meeting Hao in person, and we had great conversations about the future of AI. His belief? Data will be at the heart of it all. The DIN team currently consists of 16 members, primarily engineers.
DIN participated in Binance Lab’s MVB 5 accelerator program and raised $4M in a seed round in July 2023, led by Binance Labs, HashKey, NGC, and Shima Capital. In August 2024, DIN secured another $4M in funding, with participation from Manta Network, Moonbeam Network, Ankr, and Maxx Capital, bringing its total fundraising to $8M.
Source: https://sacra.com/c/scale-ai/
Data collection and labelling is big business.
Scale AI is the best known player in this space, reporting annual recurring revenues of ~$1B. This is fuelled by heavy demand from foundational AI model companies like OpenAI, Anthropic, and Cohere who are Scale’s main customers. As of May 2024, it was valued at a whopping $14B.
Let’s take a closer look at Scale AI’s business model.
Scale relies on a large, distributed workforce for its data labelling tasks, which involve manual tagging of videos, sorting photos, and transcribing audio.
It employs ~240,000 workers across several countries, actively recruiting in regions with high unemployment rates and lower costs of living. Kenya, for instance, has become a key recruitment hub in Africa, with in-person “boot camps” in Nairobi and targeted paid advertisements to attract workers.
The labelling process typically has two layers: a first layer of annotators who label data from scratch and a second layer of quality controllers who review the work, add missing annotations and correct errors. It’s very human-intensive, but it works because human labour costs are low, and its clients are willing to pay significant money.
Now, imagine scaling this model (pun intended) through decentralized networks. A global, permissionless workforce incentivized by tokens could allow anyone to participate, while a distributed network of validators ensures data accuracy and quality. Decentralization could open up new possibilities for scaling data labelling, turning it into a truly global, democratized process.
DIN’s primary advantage today lies in its large, engaged community—built up over two years of focused community-building efforts. With this network, DIN can rapidly mobilize data collection based on specific criteria. The challenge, however, is identifying where the true data demand lies, directing its users to collect and label the right datasets, and building sustainable revenue streams to support long-term growth.
Right now, much of the user engagement is driven by anticipation of token rewards once the token goes live. But if the team can’t drive sufficient demand for the token later on, usage could drop off as initial interest fades. Creating this demand will require speculative interest and establishing a market of data consumers eager to buy these datasets.
DIN isn’t the only crypto team vying for a share of this market—projects like Sapiens, Grass, and Masa are also in the race. But the pie is substantial. Take GRASS, for instance, which currently has a market cap of $2.5 billion, underscoring the scale of opportunity in this sector.
One path for DIN to differentiate and stand out could be training and deploying proprietary AI models for data validation, reducing dependence on human labour. This automation-first approach could streamline operations, enhance scalability, and give DIN an edge over competitors still relying heavily on manual processes.
Data networks represent one of the most exciting frontiers at the intersection of AI and crypto. Unlike traditional centralized models, crypto-powered data networks leverage decentralized participation and incentives to build high-quality datasets at scale.
DIN is positioning itself as an early mover in this space, and it’ll be fascinating to see how the project develops. This is DIN’s opportunity to seize. I often tell people: data networks are one of the smartest areas to be building in right now.
Crypto is reshaping how data is collected, validated, and monetized—building the foundation for a new, decentralized data economy.
Cheers,
Teng Yan
Quick links for DIN:
This research was sponsored by DIN, with Chain of Thought receiving funding for this initiative. All insights and analysis are our own. We uphold strict standards of objectivity in all our viewpoints.
To learn more about our approach to sponsored Deep Dives, please see our note here.
We’re grateful to our research partners for helping us keep our Crypto AI research free and accessible to all.