Skip to content

default to stagehand LLM clients for evals #669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

seanmcguire12
Copy link
Member

@seanmcguire12 seanmcguire12 commented Apr 15, 2025

why

  • we shouldn't use aiSDK (or any other external LLM clients) for evals by default, especially not in CI. We should be testing against our own custom clients
  • we still want to be able to wrapped external clients so that we can easily get telemetry in braintrust

what changed

  • added a new CLI arg useExternalClients that defaults to false. if you want to run evals with external clients (to get telemetry, you can set -useExternalClients=true

test plan

  • this is it

Copy link

changeset-bot bot commented Apr 15, 2025

⚠️ No Changeset found

Latest commit: 7ac680c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR introduces a significant change to Stagehand's evaluation system by making internal LLM clients the default while maintaining external client support through a new CLI flag.

Key changes:

  • Added new --useExternalClients CLI flag (defaults to false) in evals/args.ts
  • Implemented createLLMClient utility function in evals/utils.ts to handle both internal and external client creation
  • Changed default evaluation model to gpt-4o-mini in taskConfig.ts
  • Added comprehensive model provider support (OpenAI, Google, Anthropic, Groq, Cerebras) with proper API key handling
  • Introduced CreateLLMClientOptions interface in types/evals.ts for type safety and configuration

Potential issues:

  • Code duplication in model name parsing between external and internal client paths
  • Inconsistent handling of Together.ai models compared to other providers
  • Missing validation for required API keys when specific providers are selected

5 file(s) reviewed, 6 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +271 to +273
modelName: input.modelName,
useExternalClients: parsedArgs.useExternalClients === true,
logger: (msg) => logger.log(msg),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Strict comparison with boolean could cause issues if parsedArgs.useExternalClients is undefined. Consider using !!parsedArgs.useExternalClients instead.

Comment on lines +199 to +209
if (modelName.includes("/")) {
return new CustomOpenAIClient({
modelName,
client: wrapOpenAI(
new OpenAI({
apiKey: togetherKey,
baseURL: "https://api.together.xyz/v1",
}),
),
});
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: This block is duplicated from the external clients section. Consider extracting Together.ai model handling into a separate function to avoid duplication.

model: wrapAISDKModel(anthropic(modelName)),
});
} else if (isGroqModel) {
const groqModel = modelName.substring(modelName.indexOf("/") + 1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Potential error if '/' is not found in modelName. Add null check before substring operation.

Suggested change
const groqModel = modelName.substring(modelName.indexOf("/") + 1);
const slashIndex = modelName.indexOf("/");
const groqModel = slashIndex === -1 ? modelName : modelName.substring(slashIndex + 1);

types/evals.ts Outdated
Comment on lines 85 to 90
openAiKey?: string;
googleKey?: string;
anthropicKey?: string;
groqKey?: string;
cerebrasKey?: string;
togetherKey?: string;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider grouping API keys into a separate interface/type to improve maintainability as new providers are added

@seanmcguire12 seanmcguire12 marked this pull request as draft April 16, 2025 01:03
@seanmcguire12 seanmcguire12 marked this pull request as ready for review April 16, 2025 19:48
@seanmcguire12 seanmcguire12 requested a review from kamath April 16, 2025 19:49
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

(updates since last review)

This PR continues to refine the LLM client integration in Stagehand's evaluation system with several important updates:

  • Expanded the default evaluation models to include claude-3-5-sonnet-latest, gpt-4o-mini, and gpt-4o in taskConfig.ts
  • Added comprehensive model support across providers (Google, Anthropic, OpenAI, Together, Groq, Cerebras) with ALL_EVAL_MODELS list
  • Implemented provider-specific model filtering in getModelList() function
  • Removed direct Groq/Cerebras client imports in favor of accessing through Together API
  • Added error handling for Groq/Cerebras models when external clients are disabled

These changes build upon the previous implementation by providing more robust model support while maintaining the new external client opt-in behavior.

5 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +35 to 38
} else if (arg.startsWith("-useExternalClients=")) {
const val = arg.split("=")[1]?.toLowerCase();
parsedArgs.useExternalClients = val === "true";
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: No default value set for useExternalClients. Should explicitly initialize to false in parsedArgs object to match PR description

Suggested change
} else if (arg.startsWith("-useExternalClients=")) {
const val = arg.split("=")[1]?.toLowerCase();
parsedArgs.useExternalClients = val === "true";
} else {
useExternalClients: false,
leftover: [],

anthropicKey,
togetherKey,
}: CreateLLMClientOptions): LLMClient {
const isOpenAIModel = modelName.startsWith("gpt");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The OpenAI model detection logic was changed to remove || modelName.includes("/"), but this check is still used later in the code. This could cause inconsistent behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants