Skip to content

Add Tools for Managing Agent Traces in Hugging Face Datasets #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

RohitP2005
Copy link

[Sample PR] Add Tools for Managing Agent Traces in Hugging Face Datasets

reference: #53

Summary

This PR introduces a foundational implementation for managing and uploading agent traces to Hugging Face datasets. It provides tools to simplify adding traces, maintaining an index dataset for easy retrieval, and enforcing whitelist-based constraints for legality.

Key Features

1. Hugging Face Dataset Structure

  • Index Dataset: Stores metadata for each trace, allowing easy querying based on attributes.
  • Trace Dataset: Contains actual zipped trace files, which can be retrieved via pointers from the index.

2. Upload System

  • Functionality to upload traces one by one.
  • Automated grouping of traces by study.
  • Metadata generation, including:
    • study_name, llm, benchmark, and license.
    • A reference (trace_pointer) to the actual trace file.

Notes

  • This is a sample template and can be expanded upon.
  • Future work may include better versioning, enhanced querying capabilities, and automated dataset updates.

Checklist

  • Upload functionality
  • Query functionality
  • Legal compliance checks
  • Documentation

}

# Load the existing index dataset and add new entry
dataset = load_dataset(INDEX_DATASET, split="train")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having issues on this line when trying to test things on my side, as the dataset version that's online is empty. Would there be a way to initialize it first ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we shall have a test dataset online to test the functionality

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid concern, RohitP2005. Having an online test dataset would indeed help us verify the functionality in real-world settings. Would it be possible to initialize and upload a minimal test dataset that we could use for these purposes? We can involve the team in generating sample data if needed.

Comment on lines 12 to 22
def upload_index_data(index_df: pd.DataFrame):
dataset = Dataset.from_pandas(index_df)
dataset.push_to_hub(INDEX_DATASET, split="train")

def upload_trace(trace_file: str, exp_id: str):
api.upload_file(
path_or_fileobj=trace_file,
path_in_repo=f"{exp_id}.zip",
repo_id=TRACE_DATASET,
repo_type="dataset",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would approve new content on our datasets. Would there be a way to make new uploads into a PR ?
I'm guessing that might be on the HuggingFace side, in the dataset settings.

Comment on lines 5 to 7
# Hugging Face dataset names
INDEX_DATASET = "/agent_traces_index"
TRACE_DATASET = "/agent_traces_data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dev version so all good but eventually we'd switch this to env variables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i understand that, Once this PR is completed , i will remove them and u can add it in your .env

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, moving the Hugging Face dataset names, INDEX_DATASET and TRACE_DATASET to environment variables would enhance the code maintainability and security. This would also help us in managing different environment-specific settings. Thank you for addressing this.

@TLSDC
Copy link
Collaborator

TLSDC commented Feb 18, 2025

Hello @RohitP2005, this looks very interesting, thank you !
Aside from the previous comments, there is a design aspect that needs to be changed.
Atm you have 2 tables, one that has experiment metadata, which points to the corresponding zipped experiment content.

Ideally, we would have a third table on top of this, with one entry per study (as in the reproducibility_journal.csv file), with a key. The entries in the experiment metadata table would point to that key.
This way we could query per llm/benchmark like you did, but also very importantly per study.

Comment on lines 16 to 22
def upload_trace(trace_file: str, exp_id: str):
api.upload_file(
path_or_fileobj=trace_file,
path_in_repo=f"{exp_id}.zip",
repo_id=TRACE_DATASET,
repo_type="dataset",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be interesting to compress the file if needed, instead of requiring it to be zipped already.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

@RohitP2005
Copy link
Author

Hello @RohitP2005, this looks very interesting, thank you ! Aside from the previous comments, there is a design aspect that needs to be changed. Atm you have 2 tables, one that has experiment metadata, which points to the corresponding zipped experiment content.

Ideally, we would have a third table on top of this, with one entry per study (as in the reproducibility_journal.csv file), with a key. The entries in the experiment metadata table would point to that key. This way we could query per llm/benchmark like you did, but also very importantly per study.

Yeah understood , I will look into it as soon as possible

@RohitP2005
Copy link
Author

RohitP2005 commented Feb 28, 2025

Hey @TLSDC @recursix

I thought of restructuring the travel uploads and creating classes for Study and Experiments with methods within them for their functionality. The functions are implemented in the utils files. Also, query functionality has been added.

Kindly refer to Discord for a detailed description.

Comment on lines +95 to +100
try:
dataset = load_dataset(trace_dataset, use_auth_token=hf_token, split="train")
existing_data = {"exp_id": dataset["exp_id"], "zip_file": dataset["zip_file"]}
except Exception as e:
print(f"Could not load existing dataset: {e}. Creating a new dataset.")
existing_data = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the traces dataset is going to be a problem, as the traces are really heavy (200GB for our TMLR paper).
Ideally we'd have smth more similar to your original version:

def upload_trace(trace_file: str, exp_id: str):
    api.upload_file(
        path_or_fileobj=trace_file,
        path_in_repo=f"{exp_id}.zip",
        repo_id=TRACE_DATASET,
        repo_type="dataset",
    )

We would trust the index dataset to avoid duplicates, and use the trace dataset as a container in which we'd dump the zipfiles.

Copy link
Collaborator

@TLSDC TLSDC Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can safely remove this file now. It would be nice to have an equivalent of the upload method, to merge all 3 levels of upload (study, index, traces)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this to recent changes in the upload methods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants