Skip to content

Add Tools for Managing Agent Traces in Hugging Face Datasets #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
5 changes: 5 additions & 0 deletions src/agentlab/llm/traces/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
HF_USERNAME = "your_username"
HF_INDEX_DATASET = "your_username/agent_traces_index"
HF_TRACE_DATASET = "your_username/agent_traces_data"
WHITELISTED_BENCHMARKS = ["benchmark1", "benchmark2"]

29 changes: 29 additions & 0 deletions src/agentlab/llm/traces/query.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from datasets import load_dataset
import requests

# Hugging Face dataset name for the index
INDEX_DATASET = "your_username/agent_traces_index"

# Function to query traces based on LLM and benchmark
def query_traces(llm=None, benchmark=None):
dataset = load_dataset(INDEX_DATASET, split="train")
df = dataset.to_pandas()

if llm:
df = df[df["llm"] == llm]
if benchmark:
df = df[df["benchmark"] == benchmark]

return df[["exp_id", "study_name", "trace_pointer"]].to_dict(orient="records")

# Function to download a trace based on exp_id
def download_trace(exp_id: str, save_path: str):
dataset = load_dataset(INDEX_DATASET, split="train")
df = dataset.to_pandas()
trace_url = df[df["exp_id"] == exp_id]["trace_pointer"].values[0]

response = requests.get(trace_url)
with open(save_path, "wb") as f:
f.write(response.content)
print(f"Downloaded trace {exp_id} to {save_path}")

57 changes: 57 additions & 0 deletions src/agentlab/llm/traces/uploads.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
from datasets import Dataset, load_dataset
from huggingface_hub import HfApi
import pandas as pd

# Hugging Face dataset names
INDEX_DATASET = "/agent_traces_index"
TRACE_DATASET = "/agent_traces_data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dev version so all good but eventually we'd switch this to env variables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i understand that, Once this PR is completed , i will remove them and u can add it in your .env

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, moving the Hugging Face dataset names, INDEX_DATASET and TRACE_DATASET to environment variables would enhance the code maintainability and security. This would also help us in managing different environment-specific settings. Thank you for addressing this.


# Hugging Face API instance
api = HfApi()

def upload_index_data(index_df: pd.DataFrame):
dataset = Dataset.from_pandas(index_df)
dataset.push_to_hub(INDEX_DATASET, split="train")

def upload_trace(trace_file: str, exp_id: str):
api.upload_file(
path_or_fileobj=trace_file,
path_in_repo=f"{exp_id}.zip",
repo_id=TRACE_DATASET,
repo_type="dataset",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would approve new content on our datasets. Would there be a way to make new uploads into a PR ?
I'm guessing that might be on the HuggingFace side, in the dataset settings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be interesting to compress the file if needed, instead of requiring it to be zipped already.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it


def add_study(exp_id: str, study_name: str, llm: str, benchmark: str, trace_file: str):
# Check if the benchmark is whitelisted
WHITELISTED_BENCHMARKS = ["benchmark1", "benchmark2"]
if benchmark not in WHITELISTED_BENCHMARKS:
raise ValueError("Benchmark not whitelisted")

# Assign a license based on LLM and benchmark
LICENSES = {
("GPT-4", "benchmark1"): "MIT",
("Llama2", "benchmark2"): "Apache-2.0",
}
license_type = LICENSES.get((llm, benchmark), "Unknown")

# Upload trace file
upload_trace(trace_file, exp_id)

# Create metadata entry
index_entry = {
"exp_id": exp_id,
"study_name": study_name,
"llm": llm,
"benchmark": benchmark,
"license": license_type,
"trace_pointer": f"https://huggingface.co/datasets/{TRACE_DATASET}/resolve/main/{exp_id}.zip",
}

# Load the existing index dataset and add new entry
dataset = load_dataset(INDEX_DATASET, split="train")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having issues on this line when trying to test things on my side, as the dataset version that's online is empty. Would there be a way to initialize it first ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we shall have a test dataset online to test the functionality

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid concern, RohitP2005. Having an online test dataset would indeed help us verify the functionality in real-world settings. Would it be possible to initialize and upload a minimal test dataset that we could use for these purposes? We can involve the team in generating sample data if needed.

df = dataset.to_pandas()
df = df.append(index_entry, ignore_index=True)
upload_index_data(df)

print(f"Study {exp_id} added successfully!")