
The Genesis of a Revolutionary Change: From Attention to Power Retention
In 2017, the introduction of the transformer architecture through the groundbreaking paper “Attention Is All You Need” by Google marked a pivotal moment in the realm of artificial intelligence. It reshaped how machines interpret data, giving rise to powerful models like GPT, Claude, and Llama, all of which employ the attention mechanism. This innovation, while transformative, faces growing concerns regarding its computational and memory demands, particularly with the context length expansion.
Enter Manifest AI’s Brumby-14B-Base, a retrained variant of Qwen3 that aims to tackle these limitations head-on. Deviating from traditional attention mechanisms, Brumby harnesses a novel technique called Power Retention, designed to efficiently manage information across extended contexts without inflating memory requirements.
This article delves into the intricacies of this architectural shift, exploring its implications, practical applications, and potential to revolutionize the AI landscape.
Understanding Power Retention: A Paradigm Shift
Redefining Contextual Understanding
Traditional transformers leverage attention to assess the relevance of information across a sequence, but at what cost? The quadratic scaling of compute and memory with sequence length has become untenable. Power Retention reimagines this process by integrating a recurrent state update mechanism, akin to Recurrent Neural Networks (RNNs).
Power Retention maintains a memory matrix, updating it dynamically with incoming data and learned gating signals. This process compresses historical data into a fixed-size latent state, eliminating the need for exhaustive pairwise comparisons inherent in attention-based models. The result? Constant-time per-token computation, irrespective of context length.
Balancing Efficiency and Expressive Power
While Power Retention fundamentally diverges from conventional transformers, it preserves the expressive capabilities that make attention models successful. By involving tensor powers of the input, it captures higher-order dependencies, enabling the model to maintain long-term contextual relationships efficiently.
This breakthrough not only ensures computational efficiency but also retains the transformative potential of attention, facilitating complex reasoning across extensive temporal or logical constructs.
Efficient Retraining: Leveraging Existing Models
The Cost-Effective Approach
Manifest AI’s approach to training Brumby-14B-Base stands out for its cost-effectiveness. Utilizing only 60 hours on 32 Nvidia H100 GPUs, the model was retrained at just $4,000. This is a testament to the power of building upon existing transformer models, as starting from scratch would be significantly more expensive.
Jacob Buckman, founder of Manifest AI, emphasizes that leveraging pre-existing model architecture is crucial for accelerating the adoption of new paradigms like Power Retention. This strategy demonstrates that attention-free systems can achieve parity with transformers at a fraction of the cost.
Retraining: A Harmonious Transition
During retraining, Brumby adapted from Qwen3’s attention-oriented architecture to the retention-based system. This transformation involved recalibrating existing weights to align with the new computational framework. The process, reminiscent of teaching a pianist to play guitar, required approximately 3,000 training steps.
The rapid convergence of Brumby’s training loss to that of Qwen3 highlights the efficiency of this transition. By the retraining phase’s end, Brumby had not only caught up but also surpassed Qwen3’s accuracy on several benchmarks, underscoring the potential of retention-based systems to inherit and enhance transformer capabilities.
Benchmarking Brumby: Performance Insights
Competing with Transformer Titans
On standard evaluation tasks, Brumby-14B-Base consistently matches or outperforms its transformer counterparts. Its prowess is particularly evident in mathematical reasoning and long-context tasks, areas where attention models typically struggle.
For instance, on the GSM8K and HellaSwag benchmarks, Brumby demonstrated performance on par with or superior to Qwen3, GLM-4.5-Air, and Nemotron Nano. These results suggest that retention-based architectures may possess inherent advantages in handling tasks requiring extended temporal or logical dependencies.
Hardware Efficiency: A Game Changer
Power Retention’s hardware efficiency stands out as a significant advantage. The architecture’s reliance on local matrix operations for state updates translates to linear complexity in sequence length during inference. This efficiency is further enhanced by Manifest AI’s Vidrial CUDA framework, achieving hardware utilization rates up to 85%.
Compared to FlashAttention2 and Mamba, another emerging post-transformer architecture, Power Retention expends fewer floating-point operations and less memory, demonstrating remarkable potential for long-context processing without exotic hardware.
Economic and Strategic Implications
The Economics of AI Transformation
Brumby-14B-Base’s development cost has captured widespread attention, highlighting a potential paradigm shift in AI economics. With the ability to train large models at a fraction of traditional costs, the implications for smaller organizations and open research are profound.
As Buckman notes, the retraining cost scales favorably with model size, suggesting that larger models could be retrained economically, paving the way for democratized AI experimentation and innovation.
Deployment and Integration Potential
The integration of Power Retention into existing transformer models is designed for simplicity. A straightforward modification in architecture code allows companies to leverage the retention mechanism, reclaiming the model’s original performance with minimal effort.
This streamlined process not only facilitates faster training and inference but also permits broader adoption of attention-free architectures, enhancing AI’s accessibility and scalability.
The Future of AI: Beyond Transformers
Manifest AI’s Vision
Manifest AI envisions a future where AI models transcend the limitations of current architectures, embodying continuous, efficient modeling of intelligent processes. The Brumby-14B release represents a step towards this vision, challenging the transformer monoculture and fostering architectural diversity.
By focusing on the intelligent processes behind human output, Manifest AI seeks to redefine AI’s potential, driving innovation and exploration in the field.
Industry Reception and Public Debate
Brumby’s launch has sparked discussions and debates within the AI community. Critics have raised concerns about the framing of Manifest AI’s claims, particularly regarding the $4,000 training cost. However, the broader implications of attention-free architectures and their potential to disrupt the transformer era remain a central focus.
Buckman acknowledges the discussions, reiterating that Brumby’s release signifies the beginning of a new era in AI, with Power Retention representing just one step in a longer journey.
Conclusion: Charting New AI Frontiers
Brumby-14B-Base, and its novel Power Retention architecture, marks a pivotal moment in the evolution of AI. By challenging the dominance of transformers and offering a cost-effective, efficient alternative, Manifest AI has opened the door to new possibilities for AI research and development.
The implications extend beyond technical advancements, promising to lower barriers for entry, foster innovation, and spur renewed theoretical exploration. As the AI community continues to explore and refine retention-based systems, the potential for groundbreaking advancements remains immense.
As we look to the future, questions linger: How will Power Retention influence AI’s trajectory? What new applications will emerge from this architectural shift? Only time will tell, but one thing is certain—AI’s landscape is poised for transformative change.
In closing, I encourage readers to explore how these developments could impact their own work and consider participating in the ongoing conversation. Could your next project benefit from attention-free architectures? What new frontiers might you explore with these tools at your disposal? The future of AI beckons, and your role in shaping it could be pivotal.
