Top AI Conferences and Virtual Events of 2023
Deep Learning For Natural Language Processing
Last updated 10th July 2024: Online ordering is currently unavailable due to technical issues. We apologise for any delays responding to customers while we resolve this. For further updates please visit our website https://www.Cambridge.Org/news-and-insights/technical-incident
What's Next After Transformers
I talk with Recursal AI founder Eugene Cheah about RWKV, a new architecture that
This essay is a part of my series, "AI in the Real World," where I talk with leading AI researchers about their groundbreaking work and how it's being applied in real businesses today. You can check out previous conversations in the series here.
I recently spoke with Eugene Cheah, a builder who's working to democratize AI by tackling some of the core constraints of transformers. The backbone of powerhouse models like GPT and Claude, transformers have fueled the generative AI boom. But they're not without drawbacks.
Enter RWKV (Receptance Weighted Key Value), an open-source architecture that Eugene and his team at Recursal AI are commercializing for enterprise use. Their goal is ambitious but clear: make AI more cost-effective, scalable, and universally accessible, regardless of a user's native language and access to compute.
Eugene's journey from nurturing RWKV's open-source ecosystem to founding Recursal AI reflects the potential he sees in this technology. In our conversation, he explains the technical challenges facing transformers and details how RWKV aims to overcome them. I left with a compelling picture of what a more democratic future for AI might look like – and what it would take to get there.
Here are my notes.
Is attention really all you need?Introduced in the 2017 paper "Attention is All You Need" by a group of Google Brain researchers, transformers are a form of deep learning architecture designed for natural language processing (NLP). One of their key innovations is self-attention: a mechanism that captures relationships between words regardless of their position in a sequence. This breakthrough has led to numerous advanced models, including BERT, GPT, and Claude.
Yet, despite their power, transformers face significant hurdles in cost and scalability. For each token (roughly equivalent to short word or a part of a longer word) processed, transformers essentially recalculate all their calculations. This leads to quadratic scaling costs as the context length increases. In other words, doubling the input length quadruples the amount of compute required.
This inefficiency translates into enormous demands on compute. While exact figures are hard to come by, OpenAI reportedly uses over 300 Azure data centers just to serve 10% of the English-speaking market. Running transformers in production can cost hundreds of thousands or even millions of dollars per month, depending on their scale and usage.
Despite these steep scaling costs, transformers maintain their dominant position in the AI ecosystem. Stakeholders across all levels of the AI stack have invested substantial resources to build the infrastructure necessary to run these models in production. This investment has created a form of technological lock-in, resulting in strong resistance to change.
As my colleague Jaya explained: "The inertia around transformer architectures is real. Unless a new company bets big, we'll likely see incremental improvements rather than architectural revolutions. This is partly due to the massive investment in optimizing transformers at every level, from chip design to software frameworks. Breaking this inertia would require not just a superior architecture, but a willingness to rebuild the entire AI infrastructure stack."
Faced with such a herculean lift, most stakeholders opt for the familiar. Of course, this status quo is not set in stone. Eugene and the RWKV community certainly don't seem to think so.
RWKV: a potential alternative?Instead of the all-to-all comparisons of transformers, RWKV uses a linear attention mechanism that's applied sequentially. By maintaining a fixed state between tokens, RWKV achieves more efficient processing with linear compute costs. Eugene claims that this efficiency makes RWKV 10 to 100 times cheaper to run than transformers, especially for longer sequences.
RWKV's benefits extend beyond compute efficiency. Its recurrent architecture means it only needs to store and update a single hidden state vector for each token. Compare this to transformers, which must juggle attention scores and intermediate representations for every possible token pair. The memory savings here could be substantial.
RWKV's performance compared to transformers remains a topic of active research and debate in the AI community. Its approach, while innovative, comes with its own set of challenges. The token relationships it builds, while more efficient to compute, aren't as rich as those in transformers. This can lead to difficulties with long-range dependencies and information retrieval. Moreover, RWKV is more sensitive to the order of input tokens, meaning small changes in how a prompt is structured can significantly alter the model's output.
Promising early signsRWKV isn't just a concept on paper: it's being used in real applications today. Eugene cites a company processing over five million messages daily using RWKV for content moderation, achieving substantial cost savings compared to transformer-based alternatives.
Beyond cost-cutting, RWKV also promises to level the linguistic playing field. Its sequential processing method reduces the English-centric bias in many transformer-based models, which stems from their training data and tokenization methods, as well as the benchmarks by which they're judged. Currently, RWKV models can handle over 100 languages with high proficiency: a significant step toward more inclusive AI.
While direct comparisons are challenging due to differences in training data, the early results are impressive. Eugene reports that RWKV's 7B parameter model (trained on 1.7 trillion tokens) matches or outperforms Meta's LLaMA 2 (trained on 2 trillion tokens) across a variety of benchmarks, particularly in non-English evals. These results hint at superior scaling properties compared to transformers, though more research is needed to confirm this conclusively.
Beyond encouraging evals, RWKV also has the potential to break us out of the "architecture inertia" described by my partner Jaya. Eugene explains that RWKV's design allows for relatively simple integration into existing AI infrastructures. Training pipelines designed for transformers can be adapted for RWKV with minimal tweaks. Preprocessing steps like text normalization, tokenization, and batching also remain largely unchanged.
The primary adjustment needed when using RWKV comes at inference time. Unlike transformers, which handle each input separately, RWKV manages hidden states across time steps. To accommodate this, developers have to modify how hidden states are managed and passed through the model during inference. While this requires some changes to inference code, it's a relatively manageable adaptation—more of a shift in approach than a complete overhaul.
Implications for the AI fieldBy improving efficiency and reducing costs, RWKV has the potential to broaden access to AI. Here are a few of the implications that Eugene highlighted:
1. Unleashing innovation through lower costsCurrent transformer-based models pose prohibitive costs, particularly in developing economies. This financial hurdle stifles experimentation, limits product development, and constrains the growth of AI-powered businesses. By providing a more cost-effective alternative, RWKV could level the playing field, allowing a more diverse range of ideas and innovations to flourish.
This democratization extends to academia as well. The exponential growth in compute costs driven by transformers has hampered research efforts, particularly in regions with limited resources. By lowering these financial barriers, RWKV could catalyze more diverse contributions to AI research from top universities in India, China, and Latin America, for instance.
2. Breaking language barriersLess than 20% of the world speaks English, yet, as discussed above, most transformer-based models are biased toward it . This limits users and applications, particularly in regions with multiple dialects and linguistic nuances.
RWKV's multilingual strength could be used to build products that solve these local problems. The Eagle 7B model, a specific implementation of RWKV, has shown impressive results on multilingual benchmarks, making it a potential contender for local NLP tasks. Eugene shared an example of an RWKV-powered content moderation tool capable of detecting bullying across multiple languages, illustrating the potential for more inclusive and culturally attuned AI applications.
3. Enhancing AI agent capabilitiesAs we venture further into the realm of AI agents and multi-agent systems, the efficiency of token generation becomes increasingly crucial. As agents converse, collaborate, and call external tools, these complex systems often generate thousands of tokens before returning an output to the user. RWKV's more efficient architecture could significantly enhance the capabilities of these agentic systems.
This efficiency gain isn't just about speed; it's about expanding the scope of what's possible. Faster token generation could allow for more complex reasoning, longer-term planning, and more nuanced interactions between AI agents.
4. Decentralizing AIThe concentration of AI power in the hands of a few tech giants has raised valid concerns about access and control. Many enterprises aspire to run AI models within their own environments, yet this goal often remains out of reach. RWKV's efficiency could make this aspiration a reality, allowing for a more decentralized AI ecosystem.
What's next for RWKV?While the potential of RWKV is clear, its journey from promising technology to industry standard is far from guaranteed.
Currently, Eugene is focused on raising capital and securing the substantial compute power needed for larger training runs. He aims to keep pushing the boundaries of RWKV's model sizes and performance, and potentially expand into multimodal capabilities—combining text, audio, and vision into unified models. In parallel, the RWKV community is working on improving the quality and diversity of training datasets, with a particular emphasis on non-English languages.
Eugene is also excited about exploring other alternative architectures, such as diffusion models for text generation. His openness reflects a broader trend in the AI community: a recognition that the path forward requires novel ideas for model design.
While the long-term viability of these new architectures remains to be seen, democratizing AI is certainly a worthy goal. Lower costs, better multilingual capabilities, and easier deployment could enable AI to be used in a much wider range of applications and contexts, accelerating the pace of innovation in the field.
For founders interested in exploring these possibilities, Eugene recommends the RWKV Discord and wiki, as well as the EleutherAI Discord.
Claude Projects Vs ChatGPT AI Performance Compared
In recent months two AI models have been leading the way providing users with exceptional results. If you had not already guessed these large language models are produced by Anthropic in the form of Claude and OpenAI in the form of ChatGPT. Both are state-of-the-art AI models designed to provide intelligent assistance across various domains.
Claude Projects vs ChatGPT Advantages of ChatGPT:These models are equipped with innovative natural language processing techniques to understand and generate human-like text, allowing them to engage in meaningful conversations and tackle complex problems. But which is more suited to your everyday needs? While they share a common foundation, the unique features and enhancements incorporated into each model give rise to their distinct capabilities and suitability for different scenarios. In the video below AI advantage takes a look at both comparing the pros and cons of each to help you understand in more detail the differences between Claude Projects vs ChatGPT and which one would be most suited to your needs.
Here are a selection of other articles from our extensive library of content you may find of interest on the subject of Anthropic's new Claude Projects :
Advanced FeaturesAt the core of both ChatGPT and Claude lies the power of Generative Pre-trained Transformers (GPTs). These sophisticated language models have been trained on vast amounts of diverse text data, allowing them to generate coherent and contextually relevant responses based on the input they receive. This enables them to engage in natural conversations, answer questions, and provide insights across a wide range of topics.
In addition to their language generation capabilities, both models offer the convenience of Projects. This feature allows users to organize and manage their interactions with the AI in a structured manner. By creating separate projects, you can compartmentalize different tasks, conversations, or workflows, ensuring a more focused and efficient experience.
Advantages of ChatGPTChatGPT stands out for its versatile set of tools and features that cater to various use cases:
Despite its impressive capabilities, ChatGPT does have some limitations to consider:
Claude Projects brings its own set of advantages to the table:
While Claude Projects offers several benefits, it also has some drawbacks to keep in mind:
Understanding the strengths and limitations of each model is crucial for determining when to use them effectively:
When working with Claude Projects, it's important to be aware of the search function limitations. While Claude offers powerful language generation capabilities, its search functionality may not be as comprehensive as other tools specifically designed for information retrieval.
To get the most out of your AI interactions, consider leveraging custom instructions. Both ChatGPT and Claude allow you to provide specific guidelines and preferences to tailor the AI's responses to your needs. By investing time in crafting clear and detailed instructions, you can ensure more accurate and relevant outputs.
Another aspect to consider is the difference in project archiving between the two models. ChatGPT and Claude handle the storage and retrieval of past projects differently, which can impact how you organize and access your work over time. Familiarizing yourself with each model's archiving system will help you make informed decisions about long-term project management.
In conclusion, both ChatGPT and Claude offer powerful tools for a wide range of applications. By understanding their unique features, strengths, and limitations, you can make informed decisions when selecting the most appropriate model for your specific needs. Whether you require versatile tools, superior coding capabilities, or more natural language generation, these AI models provide the flexibility and intelligence to enhance your productivity and achieve your goals. By leveraging their capabilities effectively, you can unlock new possibilities and streamline your workflows in the ever-evolving landscape of artificial intelligence.
Video Credit: Source
Filed Under: Top NewsLatest Geeky Gadgets Deals
Disclosure: Some of our articles include affiliate links. If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.
Comments
Post a Comment