Complete, end-to-end AI developer toolkit with evaluations, tracing and monitoring, scoring, human feedback, and guardrails supports entire generative AI workflow
Weights & Biases, the AI developer platform, today announced the general availability of W&B Weave at the AWS re:Invent annual conference. Weave helps developers evaluate, monitor, and iterate continuously to deliver high-quality and performant generative AI applications. Weave is a lightweight, developer-friendly toolkit that supports the entire generative AI workflow from experimentation to production and systematic iteration.
Since the emergence of large language models (LLMs) and their transformative potential, enterprises have been exploring ways to apply LLMs to improve their internal business operations and enhance how they serve their customers. While creating a generative AI demo can be easy, moving to full-scale production with high-quality and performant applications is hard because LLMs are non-deterministic by nature.
Because of this, a new experimental developer workflow is required—one that Weave is purpose-built to support. The core components of this workflow are:
- Evaluations: Without an evaluation framework developers are just guessing whether their generative AI application is improving its accuracy, latency, cost, and user experience. Weave offers rigorous, visual evaluations to move beyond vibe checks and spreadsheets. As developers try different techniques like prompt engineering, RAG (Retrieval Augmented Generation), agents, fine-tuning, and changing LLM providers, Weave evaluations help them understand which techniques improve their application. Weave allows developers to group evaluations into leaderboards featuring the best performers and share this learning across their organization. To evaluate models and prompts without jumping into code, Weave offers a playground to quickly iterate on prompts and see how the LLM response changes.
- Tracing and monitoring: With a single line of code, developers can use the Weave Python and JavaScript/TypeScript SDKs to automatically log all the inputs, outputs, code, and metadata in their applications at the granular level. As LLMs become multi-modal, Weave also supports images and audio in addition to text and code. Weave acts as an AI system of record, organizing all the data into a trace tree that developers easily navigate and analyze to debug issues. Customers need to monitor AI application quality in production but running scorers on production machines for monitoring can take too much processing power and disrupt live application performance. Weave online evaluations on live incoming production traces execute asynchronously without impacting the production environment, allowing developers to separate evaluations from core application processing. Weave online evaluations will be available in Q1.
- Scoring: Weave offers pre-built LLM-based scorers for common metrics like hallucination rate and context relevance so developers can jumpstart their evaluations without starting from scratch. For more advanced evaluations, developers can plug-in third party scorers or customize their own. Weave supports LLMs scoring other LLMs, known as LLM-as-a-Judge. Developers can fine-tune LLMs for the specific attributes they want to evaluate for their application and then use those scores in Weave.
- Human feedback: LLM-based scorers need to be augmented with human feedback for robust evaluations, especially for outputs that are qualitative such as style, tone, and brand voice. Weave lets developers collect feedback directly from users in production or their internal domain experts and use that feedback to build high-quality evaluation datasets. Users can give thumbs-up or thumbs-down ratings, add emojis to express their sentiment, and comment with free-form text. With the Weave annotation template builder, developers can tailor the labeling interface so labelers know which elements to focus on, ensuring consistent annotations while improving the efficiency and quality of datasets.
- Guardrails: Due to the non-deterministic nature of LLMs, AI can sometimes behave inappropriately or leak private data. Malicious actors may attempt to jailbreak the system or inject malicious prompts. Enterprises need to protect their brand and safeguard the user experience. Weave offers out-of-the-box filters to detect these harmful outputs and prompt attacks. Once an issue is detected, pre and post hooks help trigger safeguards. Weave guardrails will be available in preview in Q1 next year.
“We’ve been working with customers for a year building Weave based on their feedback on the challenges of getting LLM powered applications into production,” said Lukas Biewald, CEO and co-founder at Weights & Biases. “We focused on making it easy for developers to get started with one line of code that traces all your LLM calls, use pre-built scorers or customize your own, and then quickly be able to iterate guided by rich visual evaluations to improve the accuracy, latency, cost, and user experience of their application. We’re excited to now make Weave generally available to all developers, whether they are developing internal text-based applications for their employees or high volume production applications incorporating rich media for their customers.”
“I love Weave for a bunch of reasons and it all goes back to trust,” said Mike Maloney, CDO and co-founder at Neuralift AI. “From day one the reporting on all our input json and input tokens in Weave was fantastic, and now they have added features such as rich evaluation visualizations. Weave has helped us set a baseline for how the different LLM providers perform for our application and guide us on whether to switch the underlying model. Weave is featured heavily in how we aim to continuously build a high quality applied AI product.”
Weave is framework and LLM agnostic so developers do not need to write any code to work with popular AI frameworks and LLMs, including Amazon Bedrock. To learn more about how to use Weave with Amazon Bedrock to evaluate LLMs for a text summarization use case, visit this tutorial. Weave is now generally available both as a multi-tenant SaaS Cloud deployment option or as a single-tenant AWS Dedicated Cloud deployment option for enterprises with sensitive use cases requiring data residency, compute and storage isolation, private connectivity, and data-at-rest encryption.
To learn more about Weave’s ability to deliver generative AI applications with confidence for the enterprise, visit Weights & Biases at Booth #1520 on the AWS re:Invent Expo floor from December 2-5 or https://wandb.ai/site/weave. Developers can get started with a single line of code at http://wandb.me/weave and start tracing their AI applications immediately.
About Weights & Biases
Weights & Biases is the AI developer platform powering the generative AI industry. Over 1,300 organizations worldwide — including AstraZeneca, Canva, NVIDIA, Snowflake, Square, Toyota, and Wayve — and more than 30 foundation model builders, such as OpenAI, Meta, and Cohere, rely on Weights & Biases as their system of record for training and fine-tuning AI models and developing AI applications with confidence. Headquartered in San Francisco with a global presence, Weights & Biases is backed by leading investors, including Coatue, Felicis Ventures, BOND, Insight Partners, Bloomberg Beta, and NVIDIA.
View source version on businesswire.com: https://www.businesswire.com/news/home/20241202944496/en/