Can Collective Wisdom reduce hallucinations & bias in AI?
Hallucinations: an experience involving the apparent perception of something not present.
Andrej Karpathy calls AI “dream machines.” He believes that hallucinations—those moments when AI confidently generates things that aren’t real—are a feature, not a bug. It’s futile to try to eliminate them entirely. And honestly, there’s something poetic about that.
# On the "hallucination problem"
I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
We direct their dreams with prompts. The prompts start the dream, and based on the… x.com/i/web/status/1…
— Andrej Karpathy (@karpathy)
1:35 AM • Dec 9, 2023
Large language models (LLMs) are an artist, a creator. It dreams in code, generates ideas out of thin air, and spins meaning from data. But for AI to move from beautiful daydreams to practical, everyday applications, we must rein in those hallucinations.
Error rates for LLMs remain high across many tasks—often hovering around 30%. At that level, LLMs still require a human-in-the-loop to reach a usable standard of accuracy.
But when we hit that elusive 99.x% accuracy—where outputs are reliable without human oversight—magic happens. That’s the threshold where AI achieves human-level reliability, unlocking an endless universe of use cases previously out of reach.
Reaching that level of precision, however, is no small feat. It demands relentless engineering effort and innovation.
The story of Mira starts here. But before we dive in, let’s take a moment to talk about LLM development—and why verifications are shaping up to be the next big thing in AI.
LLM development is the latest iteration in the deep learning journey—distinct from the traditional software development practices we’ve honed over the past 50+ years. LLMs, which have only been around for about three years, flip the script completely, moving from deterministic thinking (if X, then Y) to probabilistic reasoning (if X, then… maybe Y?).
This means the infrastructure for an AI-driven world demands an entirely new set of tools and workflows. Yet many of these tools are still locked inside the research labs that created the LLMs.
The good news is that these tools are starting to trickle out into the public domain, opening up a world of possibilities for developers everywhere.
At the tail end of this new workflow lies a critical piece of the puzzle: evaluations & verifications. Today, our spotlight lands on these. They answer a fundamental question: Is the AI working well?
Trust is the foundation of any great AI product.
As AI becomes an increasingly integral part of our lives, the technology itself remains fragile. Mistakes happen, and when they do, trust erodes quickly. Users expect AI to be accurate, unbiased, and genuinely helpful, but without reliable systems in place to ensure that, frustration mounts—and frustration leads to churn.
This is where verifications come into play.
Verifications act as a safeguard. They are the quality assurance layer developers rely on to refine outputs and build systems that users can trust.
Mira is tackling a core Web2 problem with the trustless transparency of crypto. By leveraging a decentralized network of verifier nodes, Mira ensures that AI outputs are accurately and independently verified.
Let’s say you have a paragraph of output from an LLM about the city of Paris. How do you verify that it is accurate? It is hard to do so because there is so much nuance around everything from claims to the structure of the content to the writing style.
This is where Mira steps in.
Mira’s vision is bold: to create a layer-1 network that delivers trustless, scalable, and accurate verification of AI outputs. By harnessing collective wisdom, Mira reduces biases and hallucinations while proving how blockchain can truly enhance AI.
Early results are promising. In a recent study published on Arxiv, Mira demonstrated that using multiple models used to generate outputs and requiring consensus significantly boosts accuracy. Precision reached 95.6% with three models, compared to 73.1% for a single model output.
Two key design elements power Mira’s approach:
AI-generated outputs range from simple statements to sprawling essays, thanks to the near-zero cost of content generation. But this abundance of complexity creates a challenge: how do you ensure the accuracy of such diverse outputs?
Mira’s solution is simple: break it down.
Mira transforms complex AI-generated content into smaller, digestible pieces that AI models can objectively review in a process called sharding.
By standardising outputs and breaking them into discrete, verifiable claims, Mira ensures every piece can be evaluated consistently, eliminating the ambiguity that often plagues evaluations.
For example, consider this compound statement:
“Photosynthesis occurs in plants to convert sunlight into energy, and bees play a critical role in pollination by transferring pollen between flowers.”
On the surface, it seems simple to verify. But when handed to multiple models, interpretation quirks might lead to different answers. Mira’s content transformation via sharding solves this by splitting the statement into two independent claims:
Once sharded, each claim undergoes binarization, where it’s converted into a multiple-choice question. These questions are distributed to a network of nodes running AI models. Using Mira’s ensemble verification method, the models collaborate to evaluate and confirm the validity of each claim.
Currently, Mira’s content sharding and binarization capabilities are focused on text inputs. By early 2025, these processes will expand to support multimodal inputs, such as images and videos
Mira has developed an advanced verification system that combines the strengths of multiple AI models to assess the quality of AI outputs.
Let’s unpack that.
Traditional automated evaluations often rely on a single large language model (LLM), like GPT-4, as the ultimate arbiter of quality. While functional, this approach has significant flaws: it’s costly, prone to bias, and limited by the quirks and “personality” inherent in models.
Mira’s breakthrough is a shift from reliance on a single massive model to leveraging an ensemble of diverse LLMs. This ensemble excels in tasks where factual accuracy is more important than creative flair, reducing error rates and delivering more reliable, consistent verifications.
Ensemble techniques have been well-studied in machine learning tasks like classification, and Mira is now bringing this to verification.
At the heart of Mira’s system is the Panel of LLM verifiers (PoLL)—a collaborative network of models that work together to verify outputs. Think of it as a diverse panel of experts weighing in on a decision rather than leaving it to a single, potentially biased judge.
And this is not just wishful thinking—it’s grounded in research. Take a look at the chart below:
Accuracy changes of different evaluation judges compared to human judges. PoLL (group of models, far right) showed the smallest spread in scores compared to human judges.
A Cohere study published in April 2024 demonstrated that a panel of three smaller models—GPT-3.5, Claude-3 Haiku, and Command R—aligned more closely with human judgments than GPT-4 alone. Remarkably, this ensemble method was also 7x cheaper.
Mira is now putting this research into action, deploying its ensemble verification method at scale. The internal results they have shared so far are compelling:
• Error rates reduced from 80% to 5% for complex reasoning tasks.
• 5x improvements in speed and cost compared to human verification.
This is no small feat. By employing consensus mechanisms, Mira’s diverse ensemble of models effectively filters out hallucinations and balances individual model biases. Together, they deliver something greater than the sum of their parts: verifications that are faster, cheaper, and more aligned with our needs.
To recap, Mira’s verification system is built on two fundamental design principles:
Maintaining a diverse set of models is essential for high-quality outputs, making Mira’s design ideal for a decentralised architecture. Eliminating single points of failure is crucial for any verification product.
Mira uses a blockchain-based approach to ensure no single entity can manipulate outcomes. The premise is simple: AI-generated outputs should be verified just like blockchain state changes.
Verification happens through a network of independent nodes, with operators economically incentivised to perform accurate verifications. By aligning rewards with honesty, Mira’s system discourages bad actors and ensures reliable results.
Here’s how it works:
Mira ensures data confidentiality by breaking input data into smaller pieces, ensuring no single node has access to the complete dataset.
For additional security, Mira supports dynamic privacy levels, allowing users to adjust the number of shards based on data sensitivity. While higher privacy levels require more sharding (and thus higher costs), they provide added confidentiality for users handling sensitive information.
Every verification a node performs is recorded on the blockchain, creating a transparent and auditable record of the verification process. This immutable ledger ensures trust and accountability that traditional, non-blockchain-based approaches cannot achieve.
This sets a new standard for secure and unbiased AI verification.
In Mira’s decentralised network, honest work is rewarded.
Experts can deploy specialised AI models via node software and earn tokens for accurate verifications. AI developers, in turn, pay fees per verification, creating a self-sustaining economic loop between demand and supply.
This approach bridges real value from Web2 workflows into the Web3 ecosystem, directly rewarding participants such as inference providers and model creators.
But incentives come with challenges. In any decentralised system, bad actors will try to exploit the network, submitting fake results to earn rewards without doing the work.
So, how do we make sure nodes are actually performing their tasks accurately and honestly?
To maintain integrity, Mira employs Proof-of-Verification—a mechanism inspired by Bitcoin’s proof-of-work but designed for AI. Instead of mining blocks, nodes must prove they’ve completed verification tasks to participate in the consensus process.
Here’s how it works:
Proof-of-Verification creates a balanced system in which nodes are economically motivated to perform high-quality verifications. This mechanism ensures that the network remains secure and reliable over time.
Here’s the question: If Mira’s approach is so effective, why isn’t everyone doing it?
The answer lies in the trade-offs and complexities of implementing such a system in the real world. Achieving the perfect balance between fast, accurate evaluations and managing the intricacies of multiple models is no small feat.
One of Mira’s biggest hurdles is latency. While using ensembles of models allows verifications to run in parallel, synchronising results and reaching consensus introduces delays. The process is only as fast as the slowest node.
Currently, this makes Mira ideal for batch processing of AI outputs—use cases where real-time results aren’t required. As the network grows with more nodes and compute availability, the long-term goal is to achieve real-time verifications, expanding Mira’s applicability to a wider range of scenarios.
Beyond latency, other challenges include:
Engineering Complexity: Orchestrating evaluations across multiple models and ensuring the consensus mechanism operates smoothly demands significant engineering effort.
Higher Compute Requirements: Even when using smaller models, running them together in ensembles increases computational demands.
Good Consensus Mechanism Design: The way consensus is achieved—through majority voting, weighted scoring, or other methods—plays a critical role in the system’s reliability. In ambiguous cases, ensembles may struggle to align, leading to inconsistent results.
Mira's API integrates easily with any application, similar to OpenAI’s GPT-4o. It is agnostic to consumer and B2B applications, making it a versatile solution for various use cases. Today, over a dozen applications use Mira’s infrastructure.
On the consumer side, Mira is already powering AI verification for several early-stage AI apps:
Delphi Oracle is the latest and perhaps most advanced integration. This AI-powered research assistant allows Delphi Digital members to engage directly with research content, ask questions, clarify points, integrate price feeds, and adjust the content to various levels of complexity.
Delphi Oracle leverages Mira Network’s verification technology to deliver reliable and accurate responses. By verifying responses across multiple models, Mira reduces hallucination rates from ~30% to under 5%, ensuring a strong foundation of trust.
At the core of Delphi Oracle is a high-performance query router
This smart routing system, combined with intelligent caching, ensures optimal performance by balancing latency, cost, and quality.
Mira’s testing revealed that smaller, cost-effective models could handle most queries almost as well as larger models. This has resulted in a 90% reduction in operational costs, all while maintaining the high-quality responses users expect.
Though many of these consumer apps are still early, they highlight Mira’s ability to integrate seamlessly and support large, active user bases. It’s not hard to imagine thousands of applications plugging into Mira’s ecosystem—so long as the developer experience remains simple and the value proposition stays clear.
On the B2B front, Mira is zeroing in on specialised integrations in industries where trust and precision are paramount, with an initial focus on healthcare and education.
Key applications include:
Mira’s ultimate goal is to offer natively verified generations—where users simply connect via an API, just like OpenAI or Anthropic, and receive pre-verified outputs before they’re returned.
They aim to replace existing model APIs by providing highly reliable versions of existing models (e.g., Mira-Claude-3.5-Sonnet or Mira-OpenAI-GPT-4o), enhanced with built-in, consensus-based reliability.
Generative AI is on a rocket ship. According to Bloomberg, the market is projected to grow at a jaw-dropping 42% CAGR, with revenue surpassing $1 trillion by 2030. Within this massive wave, tools that improve the speed, accuracy, and reliability of AI workflows will capture a meaningful slice.
As more enterprises integrate LLMs into their workflows—ranging from customer support chatbots to complex research assistants—the need for robust model verifications becomes more pressing.
Organisations will be seeking tools that can (1) measure model accuracy and reliability, (2) diagnose prompt and parameter inefficiencies, (3) continuously monitor performance and drift, and (4) ensure compliance with emerging regulatory frameworks around AI safety.
Sound familiar? It’s a playbook we’ve seen before with MLOps (short for “Machine Learning Operations”). As machine learning scaled in the 2010s, tools for deploying, tracking, and maintaining models became essential, creating a multi-billion-dollar market. With the rise of generative AI, LLMOps is following the same trajectory.
Capturing even a small slice of the trillion-dollar market could push this sub-sector to $100B+ by 2030.
Several Web2 startups are already positioning themselves, offering tools to annotate data, fine-tune models, and evaluate performance:
• Braintrust ($36M raised)
• Vellum AI ($5M raised)
• Humanloop ($2.8M raised)
These early movers are laying the groundwork, but the space is fluid. In 2025, we are likely to see a proliferation of startups in this sector. Some may specialise in niche evaluation metrics (e.g., bias detection, and robustness testing), while others broaden their offerings to cover the entire AI development lifecycle.
Larger tech incumbents—such as major cloud providers and AI platforms—will likely bundle evaluation features into their offerings. Last month, OpenAI introduced evaluations directly on its platform. To stay competitive, startups must differentiate through specialization, ease of use, and advanced analytics.
Mira isn’t a direct competitor to these startups or incumbents. Instead, it’s an infrastructure provider that seamlessly integrates with both via APIs. The key? It just has to work.
Mira’s initial market size is tied to LLMOps, but its total addressable market will expand to all of AI because every AI application will need more reliable outputs.
From a game theory perspective, Mira is in a unique situation. Unlike other model providers like OpenAI, who are locked into supporting their own systems, Mira can integrate across models. This positions Mira as the trust layer for AI, offering reliability that no single provider can match.
Mira’s 2025 roadmap aims to balance integrity, scalability, and community participation on its path to full decentralisation:
In the early stage, vetted node operators ensure network reliability. Well-known GPU compute providers serve as the first wave of operators, handling initial operations and laying a strong foundation for growth.
Mira introduces designed duplication, where multiple instances of the same verifier model process each request. While this increases verification costs, it’s essential for identifying and removing malicious operators. By comparing outputs across nodes, bad actors are caught early.
In its mature form, Mira will implement random sharding to distribute verification tasks. This makes collusion economically unviable and strengthens the network’s resilience and security as it scales.
Here Mira will offer natively verified generations. Users will connect via API, similar to OpenAI or Anthropic, and receive pre-verified outputs—reliable, ready-to-use results without additional validation.
In the coming months, Mira is gearing up for several major milestones:
Mira is expanding opportunities for community involvement through its Node Delegator Program. This initiative makes supporting the network accessible to everyone—no technical expertise is required.
The process is simple: You can rent compute resources and delegate them to a curated group of node operators. Contributions can range from $35 to $750, and rewards are offered for supporting the network. Mira manages all the complex infrastructure, so node delegators can sit back, watch the network grow, and capture some upside.
You can use the following code exclusive to Chain of Thought readers (300 invites only, fastest fingers first) to whitelist yourself for the delegator program: COTR0
Today, Mira has a small but tight team that is largely engineering-focused.
There are 3 co-founders:
Together, they combine investment acumen, technical innovation, and product leadership to Mira’s vision for decentralised AI verification. Mira raised a $9M seed round in July 2024, led by BITKRAFT and Framework Ventures.
It’s refreshing to see a Crypto AI team tackling a fundamental Web2 AI problem—making AI better—rather than playing speculative games in crypto’s bubble.
The industry is waking up to the importance of verifications. Relying on “vibes” is no longer enough. Every AI application and workflow will soon need a proper verification process—and it’s not a stretch to imagine future regulations mandating these processes to ensure safety.
Mira’s approach leverages multiple models to independently verify outputs, avoiding reliance on a single centralised model. This decentralised framework enhances trust and reduces the risks of bias and manipulation.
And let’s consider what happens if we get to AGI in the next few years (a real possibility).
As Anand Iyer from Canonical points out, if AI can subtly manipulate decisions and code, how can we trust the systems testing for these behaviours? Smart people are thinking ahead. Anthropic’s research underscores the urgency, highlighting evaluations as a critical tool to identify potentially dangerous AI capabilities before they escalate into problems.
By enabling radical transparency, blockchains add a powerful layer of protection against rogue AI systems. Trustless consensus mechanisms ensure that safety evaluations are verified by thousands of independent nodes (like on Mira), drastically the risk of Sybil attacks.
Mira is chasing a huge market with clear demand for a solution that works. But the challenges are real. Improving latency, precision, and cost efficiency will require relentless engineering effort and time. The team will need to consistently demonstrate that their approach is measurably better than existing alternatives.
The core innovation lies in Mira’s binarization and sharding process. This “secret sauce” promises to address scalability and trust challenges. For Mira to succeed, this technology needs to deliver on its promise.
In any decentralised network, token and incentive design are make-or-break factors. Mira’s success will depend on how well these mechanisms align participant interests while maintaining network integrity.
While the details of Mira’s tokenomics remain under wraps, I expect the team to reveal more as the token launch approaches in early 2025.
“We’ve found that engineering teams who implement great evaluations move significantly faster – up to 10 times faster – than those who are just watching what happens in production and trying to fix them ad-hoc,”
Ankur Goyal, Braintrust
In an AI-driven world, trust is everything.
As models become more complex, reliable verifications will underpin every great AI product. They help us tackle hallucinations, eliminate biases, and ensure AI outputs align with users' actual needs.
Mira automates verifications, cutting costs and reliance on human intervention. This unlocks faster iterations, real-time adjustments, and scalable solutions without bottlenecks.
Ultimately, Mira aims to be the API for trust—a decentralised verification framework that every AI developer and application can depend on for verified answers.
It’s bold, ambitious, and exactly what the AI world needs.
Thanks for reading,
Teng Yan
This research deep dive was sponsored by Mira, with Chain of Thought receiving funding for this initiative. All insights and analysis are our own. We uphold strict standards of objectivity in all our viewpoints.
To learn more about our approach to sponsored Deep Dives, please see our note here.
This report is intended solely for educational purposes and does not constitute financial advice. It is not an endorsement to buy or sell assets or make financial decisions. Always conduct your own research and exercise caution when making investment choices.