Adaptive Stress Testing for Language Model Toxicity
Jan 20 2025
Length: 15 mins
Podcast

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to Cart failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from wish list failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Adaptive Stress Testing for Language Model Toxicity

Listen for free

View show details

Summary
This episode explores ASTPrompter, a novel approach to automated red-teaming for large language models (LLMs). Unlike traditional methods that focus on simply triggering toxic outputs, ASTPrompter is designed to discover likely toxic prompts – those that could naturally emerge during regular language model use. The approach uses Adaptive Stress Testing (AST), a technique that identifies likely failure points, and reinforcement learning to train an "adversary" model. This adversary generates prompts that aim to elicit toxic responses from a "defender" model, but importantly, these prompts have a low perplexity, meaning they are realistic and likely to occur, unlike many prompts generated by other methods.

Show more Show less

Show more Show less

What listeners say about Adaptive Stress Testing for Language Model Toxicity

Average Customer Ratings

Reviews - Please select the tabs below to change the source of reviews.

Audible.ca reviews

Amazon Reviews

No reviews are available

Report a review on Amazon