default to stagehand LLM clients for evals #669

seanmcguire12 · 2025-04-15T19:52:23Z

why

we shouldn't use aiSDK (or any other external LLM clients) for evals by default, especially not in CI. We should be testing against our own custom clients
we still want to be able to wrapped external clients so that we can easily get telemetry in braintrust

what changed

added a new CLI arg useExternalClients that defaults to false. if you want to run evals with external clients (to get telemetry, you can set -useExternalClients=true

test plan

this is it

changeset-bot · 2025-04-15T19:52:26Z

⚠️ No Changeset found

Latest commit: 7ac680c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

greptile-apps

PR Summary

This PR introduces a significant change to Stagehand's evaluation system by making internal LLM clients the default while maintaining external client support through a new CLI flag.

Key changes:

Added new --useExternalClients CLI flag (defaults to false) in evals/args.ts
Implemented createLLMClient utility function in evals/utils.ts to handle both internal and external client creation
Changed default evaluation model to gpt-4o-mini in taskConfig.ts
Added comprehensive model provider support (OpenAI, Google, Anthropic, Groq, Cerebras) with proper API key handling
Introduced CreateLLMClientOptions interface in types/evals.ts for type safety and configuration

Potential issues:

Code duplication in model name parsing between external and internal client paths
Inconsistent handling of Together.ai models compared to other providers
Missing validation for required API keys when specific providers are selected

_{5 file(s) reviewed, 6 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-04-15T19:52:53Z

evals/index.eval.ts

+            modelName: input.modelName,
+            useExternalClients: parsedArgs.useExternalClients === true,
+            logger: (msg) => logger.log(msg),


logic: Strict comparison with boolean could cause issues if parsedArgs.useExternalClients is undefined. Consider using !!parsedArgs.useExternalClients instead.

evals/args.ts

greptile-apps · 2025-04-15T19:53:00Z

evals/utils.ts

+      if (modelName.includes("/")) {
+        return new CustomOpenAIClient({
+          modelName,
+          client: wrapOpenAI(
+            new OpenAI({
+              apiKey: togetherKey,
+              baseURL: "https://api.together.xyz/v1",
+            }),
+          ),
+        });
+      }


style: This block is duplicated from the external clients section. Consider extracting Together.ai model handling into a separate function to avoid duplication.

greptile-apps · 2025-04-15T19:53:01Z

evals/utils.ts

+        model: wrapAISDKModel(anthropic(modelName)),
+      });
+    } else if (isGroqModel) {
+      const groqModel = modelName.substring(modelName.indexOf("/") + 1);


logic: Potential error if '/' is not found in modelName. Add null check before substring operation.

Suggested change

const groqModel = modelName.substring(modelName.indexOf("/") + 1);

const slashIndex = modelName.indexOf("/");

const groqModel = slashIndex === -1 ? modelName : modelName.substring(slashIndex + 1);

types/evals.ts

greptile-apps · 2025-04-15T19:53:22Z

types/evals.ts

+  openAiKey?: string;
+  googleKey?: string;
+  anthropicKey?: string;
+  groqKey?: string;
+  cerebrasKey?: string;
+  togetherKey?: string;


style: Consider grouping API keys into a separate interface/type to improve maintainability as new providers are added

greptile-apps

PR Summary

(updates since last review)

This PR continues to refine the LLM client integration in Stagehand's evaluation system with several important updates:

Expanded the default evaluation models to include claude-3-5-sonnet-latest, gpt-4o-mini, and gpt-4o in taskConfig.ts
Added comprehensive model support across providers (Google, Anthropic, OpenAI, Together, Groq, Cerebras) with ALL_EVAL_MODELS list
Implemented provider-specific model filtering in getModelList() function
Removed direct Groq/Cerebras client imports in favor of accessing through Together API
Added error handling for Groq/Cerebras models when external clients are disabled

These changes build upon the previous implementation by providing more robust model support while maintaining the new external client opt-in behavior.

_{5 file(s) reviewed, 2 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-04-16T19:49:24Z

evals/args.ts

+  } else if (arg.startsWith("-useExternalClients=")) {
+    const val = arg.split("=")[1]?.toLowerCase();
+    parsedArgs.useExternalClients = val === "true";
  } else {


logic: No default value set for useExternalClients. Should explicitly initialize to false in parsedArgs object to match PR description

Suggested change

} else if (arg.startsWith("-useExternalClients=")) {

const val = arg.split("=")[1]?.toLowerCase();

parsedArgs.useExternalClients = val === "true";

} else {

useExternalClients: false,

leftover: [],

greptile-apps · 2025-04-16T19:49:36Z

evals/utils.ts

+  anthropicKey,
+  togetherKey,
+}: CreateLLMClientOptions): LLMClient {
+  const isOpenAIModel = modelName.startsWith("gpt");


logic: The OpenAI model detection logic was changed to remove || modelName.includes("/"), but this check is still used later in the code. This could cause inconsistent behavior.

default to stagehand LLM clients for evals

489d585

greptile-apps bot reviewed Apr 15, 2025

View reviewed changes

revert taskConfig.ts change

b347602

seanmcguire12 marked this pull request as draft April 16, 2025 01:03

seanmcguire12 added 3 commits April 16, 2025 12:13

use aiSDK for groq and cerebras

18ae8fc

update CreateLLMClientOptions interface

7f3d935

single dash for useExternalClients

7ac680c

seanmcguire12 marked this pull request as ready for review April 16, 2025 19:48

seanmcguire12 requested a review from kamath April 16, 2025 19:49

greptile-apps bot reviewed Apr 16, 2025

View reviewed changes

miguelg719 approved these changes Apr 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default to stagehand LLM clients for evals #669

default to stagehand LLM clients for evals #669

seanmcguire12 commented Apr 15, 2025 •

edited

Loading

changeset-bot bot commented Apr 15, 2025 •

edited

Loading

greptile-apps bot left a comment

greptile-apps bot Apr 15, 2025

greptile-apps bot Apr 15, 2025

greptile-apps bot Apr 15, 2025

greptile-apps bot Apr 15, 2025

greptile-apps bot left a comment

greptile-apps bot Apr 16, 2025

greptile-apps bot Apr 16, 2025

	const groqModel = modelName.substring(modelName.indexOf("/") + 1);
	const slashIndex = modelName.indexOf("/");
	const groqModel = slashIndex === -1 ? modelName : modelName.substring(slashIndex + 1);

default to stagehand LLM clients for evals #669

Are you sure you want to change the base?

default to stagehand LLM clients for evals #669

Conversation

seanmcguire12 commented Apr 15, 2025 • edited Loading

why

what changed

test plan

changeset-bot bot commented Apr 15, 2025 • edited Loading

⚠️ No Changeset found

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

greptile-apps bot Apr 15, 2025

Choose a reason for hiding this comment

greptile-apps bot Apr 15, 2025

Choose a reason for hiding this comment

greptile-apps bot Apr 15, 2025

Choose a reason for hiding this comment

greptile-apps bot Apr 15, 2025

Choose a reason for hiding this comment

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

greptile-apps bot Apr 16, 2025

Choose a reason for hiding this comment

greptile-apps bot Apr 16, 2025

Choose a reason for hiding this comment

seanmcguire12 commented Apr 15, 2025 •

edited

Loading

changeset-bot bot commented Apr 15, 2025 •

edited

Loading