Skip to content

(EAI-897): Run full NL-to-mongosh benchmark + prompt maxxing #677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 73 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
5568f43
init draft
Feb 24, 2025
686f3c6
add db node
Feb 24, 2025
775a727
Merge remote-tracking branch 'upstream/main' into tree_of_generation
Feb 26, 2025
d9a7ae7
NL generation hack
Feb 26, 2025
6262034
modularize prompt components
Feb 26, 2025
c001709
it compiles...but does it run? to be continued...
Feb 28, 2025
45cf169
start on annotating db
Mar 3, 2025
153c6a5
checkpoint before merge
Mar 4, 2025
e4f175f
Merge remote-tracking branch 'upstream/main' into EAI-894
Mar 4, 2025
d290074
push models to core
Mar 4, 2025
1d26140
checkpoint functional
Mar 4, 2025
03a18e2
working
Mar 4, 2025
3ecfaa5
sample outputs
Mar 4, 2025
5a06c83
checkpoint much working
Mar 5, 2025
dfad31f
tests passing for makeGenerateChildrenWithOpenAi
Mar 5, 2025
65b96fd
kinda works. nuff for today
Mar 5, 2025
f31896c
code execution nodes
Mar 6, 2025
9e3a9f5
working well e2e
Mar 7, 2025
8349446
clean up PR code
Mar 7, 2025
e322a1b
remove artifact outputs
Mar 7, 2025
7510882
update gen script
Mar 10, 2025
47dc9c5
skip tests in CI
Mar 10, 2025
29f518d
Fix build errr
Mar 11, 2025
c145269
fix test build err redux
Mar 11, 2025
cb5aeec
mostly works!
Mar 13, 2025
8e40f2d
Add metadata to queries
Mar 14, 2025
53fba7b
most of fuzzy match clustering
Mar 14, 2025
47688e6
clustering and filtering logic
Mar 24, 2025
c111508
standardize dataset entry
Mar 24, 2025
6ea7c3d
truncated arrays include values at end
Mar 25, 2025
a433a6d
better handling db output for LLM
Mar 25, 2025
ceb561f
truncate + agent mode
Mar 25, 2025
e1a2e82
working better
Mar 25, 2025
038a734
table res
Mar 25, 2025
73df5a5
upload to BT & prompt refinements
Mar 26, 2025
7bebe94
Fix ts err
Mar 26, 2025
9845a16
aggregate the good ones from the jump
Mar 26, 2025
29b2a29
var name update
Mar 26, 2025
d393bf0
dataset analysis
Mar 27, 2025
8ad5dec
implement nl feedback
Mar 28, 2025
1fafbef
Merge remote-tracking branch 'upstream/main' into EAI-895
Mar 28, 2025
585cccc
Merge remote-tracking branch 'upstream/main' into EAI-895
Mar 28, 2025
1843988
fix build err
Mar 28, 2025
58bb298
remove unused file
Mar 28, 2025
41f00a2
fix build errs
Mar 28, 2025
5c4578b
Remove filter datasets
Mar 28, 2025
0bb5e82
set repo up for multi nl to code evals
Apr 2, 2025
9cac4ef
working agentic task
Apr 2, 2025
665a6a6
add is reasonable check
Apr 4, 2025
6166249
working better yet!
Apr 4, 2025
5a08676
Add simplified document schema helper
Apr 7, 2025
6f4701a
Add simplified document schema helper
Apr 7, 2025
11e6ba7
Remove console log
Apr 7, 2025
f80998f
filter down dataset
Apr 8, 2025
16d2346
most things mostly work
Apr 8, 2025
8af5322
upload curr dataset to braintrust
Apr 8, 2025
94cd461
concurrent experiment execution
Apr 8, 2025
d52704f
daze work
Apr 9, 2025
fbaa330
upload filtered
Apr 10, 2025
31ad0c9
support gemini in benchmark
Apr 10, 2025
fce1fec
Merge remote-tracking branch 'upstream/main' into EAI-896
Apr 10, 2025
22ac523
Fix build errs
Apr 10, 2025
7329353
fix build err in test
Apr 10, 2025
02a209e
fix build err again
Apr 10, 2025
db4e836
agentic edits
Apr 10, 2025
cf0b8d2
more prompt maxxing
Apr 11, 2025
704d30f
modularize experiments
Apr 14, 2025
163933f
prep official benchmark
Apr 14, 2025
a8b6751
latest
Apr 15, 2025
7df269d
add gpt 4.1 models
Apr 15, 2025
da0c98e
Fix closure on tool call
Apr 16, 2025
a438a6f
Merge remote-tracking branch 'upstream/main' into EAI-897
Apr 17, 2025
77ef8b8
fix build errs
Apr 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import { models } from "mongodb-rag-core/models";
import { getOpenAiEndpointAndApiKey, models } from "mongodb-rag-core/models";
import "dotenv/config";
import PromisePool from "@supercharge/promise-pool";
import { runQuizQuestionEval } from "./QuizQuestionEval";
import { getQuizQuestionEvalCasesFromBraintrust } from "./getQuizQuestionEvalCasesFromBraintrust";
import { mongoDbQuizQuestionExamples } from "./mongoDbQuizQuestionExamples";
import { openAiClientFactory } from "../openAiClients";
import { OpenAI } from "mongodb-rag-core/openai";

async function main() {
const DEFAULT_MAX_CONCURRENCY = 15;
Expand Down Expand Up @@ -46,7 +46,9 @@ async function main() {
await runQuizQuestionEval({
projectName,
model: modelInfo.deployment,
openaiClient: openAiClientFactory.makeOpenAiClient(modelInfo),
openaiClient: new OpenAI({
...(await getOpenAiEndpointAndApiKey(modelInfo)),
}),
experimentName,
additionalMetadata: {
...modelInfo,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import { assertEnvVars, BRAINTRUST_ENV_VARS } from "mongodb-rag-core";
import { getBraintrustExperimentSummary } from "./getBraintrustExperimentSummary";

describe.skip("getBraintrustExperimentSummary", () => {
it("should return the experiment summary", async () => {
const { BRAINTRUST_API_KEY } = assertEnvVars(BRAINTRUST_ENV_VARS);
const result = await getBraintrustExperimentSummary({
experimentName:
"mongosh-benchmark-official?experimentType=agentic&model=gemini-2.5-pro-preview-03-25",
projectName: "natural-language-to-mongosh",
apiKey: BRAINTRUST_API_KEY,
});
expect(result).toBeDefined();
});
});
237 changes: 237 additions & 0 deletions packages/benchmarks/src/reporting/getBraintrustExperimentSummary.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
import { init } from "mongodb-rag-core/braintrust";

export interface GetBraintrustExperimentSummary {
experimentName: string;
projectName: string;
apiKey: string;
}

export async function getBraintrustExperimentSummary({
projectName,
experimentName,
apiKey,
}: GetBraintrustExperimentSummary): Promise<unknown> {
const experiment = await init(projectName, {
experiment: experimentName,
apiKey,
open: true,
});
const id = await experiment.id;

const metadata = (await fetch(
`https://api.braintrust.dev/v1/experiment/${id}`,
{
headers: {
Authorization: `Bearer ${apiKey}`,
},
}
).then((res) => res.json())) as GetExperimentMetadataResponse;

const summary = (await fetch(
`https://api.braintrust.dev/v1/experiment/${id}/summarize?summarize_scores=true&comparison_experiment_id=${id}`,
{
headers: {
Authorization: `Bearer ${apiKey}`,
},
}
).then((res) => res.json())) as GetExperimentSummaryResponse;

return { metadata, summary };
}

// ---
// Types from the Braintrust API docs
// ---
/**
Metadata about the state of the repo when the experiment was created
*/
export type RepoInfo = {
/**
SHA of most recent commit
*/
commit?: string | null;
/**
Name of the branch the most recent commit belongs to
*/
branch?: string | null;
/**
Name of the tag on the most recent commit
*/
tag?: string | null;
/**
Whether or not the repo had uncommitted changes when snapshotted
*/
dirty?: boolean | null;
/**
Name of the author of the most recent commit
*/
author_name?: string | null;
/**
Email of the author of the most recent commit
*/
author_email?: string | null;
/**
Most recent commit message
*/
commit_message?: string | null;
/**
Time of the most recent commit
*/
commit_time?: string | null;
/**
If the repo was dirty when run, this includes the diff between the current state of the repo and the most recent commit.
*/
git_diff?: string | null;
};

export interface GetExperimentMetadataResponse {
/**
Unique identifier for the experiment
*/
id: string;
/**
Unique identifier for the project that the experiment belongs under
*/
project_id: string;
/**
Name of the experiment. Within a project, experiment names are unique
*/
name: string;
/**
Textual description of the experiment
*/
description?: string | null;
/**
Date of experiment creation
*/
created?: string | null;
repo_info?: RepoInfo;
/**
Commit, taken directly from `repo_info.commit`
*/
commit?: string | null;
/**
Id of default base experiment to compare against when viewing this experiment
*/
base_exp_id?: string | null;
/**
Date of experiment deletion, or null if the experiment is still active
*/
deleted_at?: string | null;
/**
Identifier of the linked dataset, or null if the experiment is not linked to a dataset
*/
dataset_id?: string | null;
/**
Version number of the linked dataset the experiment was run against. This can be used to reproduce the experiment after the dataset has been modified.
*/
dataset_version?: string | null;
/**
Whether or not the experiment is public. Public experiments can be viewed by anybody inside or outside the organization
*/
public: boolean;
/**
Identifies the user who created the experiment
*/
user_id?: string | null;
/**
User-controlled metadata about the experiment
*/
metadata?: {
[k: string]: {
[k: string]: unknown;
};
} | null;
}

/**
Summary of an experiment
*/
export interface GetExperimentSummaryResponse {
/**
Name of the project that the experiment belongs to
*/
project_name: string;
/**
Name of the experiment
*/
experiment_name: string;
/**
URL to the project's page in the Braintrust app
*/
project_url: string;
/**
URL to the experiment's page in the Braintrust app
*/
experiment_url: string;
/**
The experiment which scores are baselined against
*/
comparison_experiment_name?: string | null;
/**
Summary of the experiment's scores
*/
scores?: {
[k: string]: ScoreSummary;
} | null;
/**
Summary of the experiment's metrics
*/
metrics?: {
[k: string]: MetricSummary;
} | null;
}
/**
Summary of a score's performance
*/
export interface ScoreSummary {
/**
Name of the score
*/
name: string;
/**
Average score across all examples
*/
score: number;
/**
Difference in score between the current and comparison experiment
*/
diff?: number;
/**
Number of improvements in the score
*/
improvements: number;
/**
Number of regressions in the score
*/
regressions: number;
}
/**
Summary of a metric's performance
*/
export interface MetricSummary {
/**
Name of the metric
*/
name: string;
/**
Average metric across all examples
*/
metric: number;
/**
Unit label for the metric
*/
unit: string;
/**
Difference in metric between the current and comparison experiment
*/
diff?: number;
/**
Number of improvements in the metric
*/
improvements: number;
/**
Number of regressions in the metric
*/
regressions: number;
}
Loading