(EAI-897): Run full NL-to-mongosh benchmark + prompt maxxing #677

mongodben · 2025-04-17T19:22:57Z

Jira: https://jira.mongodb.org/browse/EAI-897

Changes

Add methods for getting Braintrust experiment metadata
- ended up not using these, but figure that they're good to include anyways as they are tested and functional
Clean up orchestration for running benchmarks on different prompt types
Only run benchmarks on a subset of all possible prompt combos (as the total would be unreasonably high, hundreds of experiments and over 100K individual evals)
Update prompts for optimized performance (prompt maxxing)
standardize format for experiment names, inspired by URL query params. this makes searching in braintrust for experiments quite easy.
Add a few more supported models, Gemini 2.5 pro and the GPT 4.1 family

Notes

Ben Perlmutter added 30 commits February 24, 2025 10:12

init draft

5568f43

add db node

686f3c6

Merge remote-tracking branch 'upstream/main' into tree_of_generation

775a727

NL generation hack

d9a7ae7

modularize prompt components

6262034

it compiles...but does it run? to be continued...

c001709

start on annotating db

45cf169

checkpoint before merge

153c6a5

Merge remote-tracking branch 'upstream/main' into EAI-894

e4f175f

push models to core

d290074

checkpoint functional

1d26140

working

03a18e2

sample outputs

3ecfaa5

checkpoint much working

5a06c83

tests passing for makeGenerateChildrenWithOpenAi

dfad31f

kinda works. nuff for today

65b96fd

code execution nodes

f31896c

working well e2e

9e3a9f5

clean up PR code

8349446

remove artifact outputs

e322a1b

update gen script

7510882

skip tests in CI

47dc9c5

Fix build errr

29f518d

fix test build err redux

c145269

mostly works!

cb5aeec

Add metadata to queries

8e40f2d

most of fuzzy match clustering

53fba7b

clustering and filtering logic

47688e6

standardize dataset entry

c111508

truncated arrays include values at end

6ea7c3d

Ben Perlmutter added 28 commits March 28, 2025 16:12

fix build errs

41f00a2

Remove filter datasets

5c4578b

set repo up for multi nl to code evals

0bb5e82

working agentic task

9cac4ef

add is reasonable check

665a6a6

working better yet!

6166249

Add simplified document schema helper

5a08676

Add simplified document schema helper

6f4701a

Remove console log

11e6ba7

filter down dataset

f80998f

most things mostly work

16d2346

upload curr dataset to braintrust

8af5322

concurrent experiment execution

94cd461

daze work

d52704f

upload filtered

fbaa330

support gemini in benchmark

31ad0c9

Merge remote-tracking branch 'upstream/main' into EAI-896

fce1fec

Fix build errs

22ac523

fix build err in test

7329353

fix build err again

02a209e

agentic edits

db4e836

more prompt maxxing

cf0b8d2

modularize experiments

704d30f

prep official benchmark

163933f

latest

a8b6751

add gpt 4.1 models

7df269d

Fix closure on tool call

da0c98e

Merge remote-tracking branch 'upstream/main' into EAI-897

a438a6f

mongodben changed the title ~~(EAI-897): Run full NL-to-mongosh benchmark~~ (EAI-897): Run full NL-to-mongosh benchmark + prompt maxxing Apr 17, 2025

fix build errs

77ef8b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(EAI-897): Run full NL-to-mongosh benchmark + prompt maxxing #677

(EAI-897): Run full NL-to-mongosh benchmark + prompt maxxing #677

mongodben commented Apr 17, 2025 •

edited

Loading

(EAI-897): Run full NL-to-mongosh benchmark + prompt maxxing #677

Are you sure you want to change the base?

(EAI-897): Run full NL-to-mongosh benchmark + prompt maxxing #677

Conversation

mongodben commented Apr 17, 2025 • edited Loading

Changes

Notes

mongodben commented Apr 17, 2025 •

edited

Loading