-
Notifications
You must be signed in to change notification settings - Fork 598
default to stagehand LLM clients for evals #669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
default to stagehand LLM clients for evals #669
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces a significant change to Stagehand's evaluation system by making internal LLM clients the default while maintaining external client support through a new CLI flag.
Key changes:
- Added new
--useExternalClients
CLI flag (defaults to false) inevals/args.ts
- Implemented
createLLMClient
utility function inevals/utils.ts
to handle both internal and external client creation - Changed default evaluation model to
gpt-4o-mini
intaskConfig.ts
- Added comprehensive model provider support (OpenAI, Google, Anthropic, Groq, Cerebras) with proper API key handling
- Introduced
CreateLLMClientOptions
interface intypes/evals.ts
for type safety and configuration
Potential issues:
- Code duplication in model name parsing between external and internal client paths
- Inconsistent handling of Together.ai models compared to other providers
- Missing validation for required API keys when specific providers are selected
5 file(s) reviewed, 6 comment(s)
Edit PR Review Bot Settings | Greptile
modelName: input.modelName, | ||
useExternalClients: parsedArgs.useExternalClients === true, | ||
logger: (msg) => logger.log(msg), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Strict comparison with boolean could cause issues if parsedArgs.useExternalClients is undefined. Consider using !!parsedArgs.useExternalClients
instead.
if (modelName.includes("/")) { | ||
return new CustomOpenAIClient({ | ||
modelName, | ||
client: wrapOpenAI( | ||
new OpenAI({ | ||
apiKey: togetherKey, | ||
baseURL: "https://api.together.xyz/v1", | ||
}), | ||
), | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: This block is duplicated from the external clients section. Consider extracting Together.ai model handling into a separate function to avoid duplication.
model: wrapAISDKModel(anthropic(modelName)), | ||
}); | ||
} else if (isGroqModel) { | ||
const groqModel = modelName.substring(modelName.indexOf("/") + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Potential error if '/' is not found in modelName. Add null check before substring operation.
const groqModel = modelName.substring(modelName.indexOf("/") + 1); | |
const slashIndex = modelName.indexOf("/"); | |
const groqModel = slashIndex === -1 ? modelName : modelName.substring(slashIndex + 1); |
types/evals.ts
Outdated
openAiKey?: string; | ||
googleKey?: string; | ||
anthropicKey?: string; | ||
groqKey?: string; | ||
cerebrasKey?: string; | ||
togetherKey?: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider grouping API keys into a separate interface/type to improve maintainability as new providers are added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
(updates since last review)
This PR continues to refine the LLM client integration in Stagehand's evaluation system with several important updates:
- Expanded the default evaluation models to include
claude-3-5-sonnet-latest
,gpt-4o-mini
, andgpt-4o
intaskConfig.ts
- Added comprehensive model support across providers (Google, Anthropic, OpenAI, Together, Groq, Cerebras) with
ALL_EVAL_MODELS
list - Implemented provider-specific model filtering in
getModelList()
function - Removed direct Groq/Cerebras client imports in favor of accessing through Together API
- Added error handling for Groq/Cerebras models when external clients are disabled
These changes build upon the previous implementation by providing more robust model support while maintaining the new external client opt-in behavior.
5 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile
} else if (arg.startsWith("-useExternalClients=")) { | ||
const val = arg.split("=")[1]?.toLowerCase(); | ||
parsedArgs.useExternalClients = val === "true"; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: No default value set for useExternalClients. Should explicitly initialize to false in parsedArgs object to match PR description
} else if (arg.startsWith("-useExternalClients=")) { | |
const val = arg.split("=")[1]?.toLowerCase(); | |
parsedArgs.useExternalClients = val === "true"; | |
} else { | |
useExternalClients: false, | |
leftover: [], |
anthropicKey, | ||
togetherKey, | ||
}: CreateLLMClientOptions): LLMClient { | ||
const isOpenAIModel = modelName.startsWith("gpt"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The OpenAI model detection logic was changed to remove || modelName.includes("/")
, but this check is still used later in the code. This could cause inconsistent behavior.
why
what changed
useExternalClients
that defaults to false. if you want to run evals with external clients (to get telemetry, you can set-useExternalClients=true
test plan