-
Notifications
You must be signed in to change notification settings - Fork 62
Add Tools for Managing Agent Traces in Hugging Face Datasets #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/agentlab/llm/traces/uploads.py
Outdated
} | ||
|
||
# Load the existing index dataset and add new entry | ||
dataset = load_dataset(INDEX_DATASET, split="train") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having issues on this line when trying to test things on my side, as the dataset version that's online is empty. Would there be a way to initialize it first ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we shall have a test dataset online to test the functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a valid concern, RohitP2005. Having an online test dataset would indeed help us verify the functionality in real-world settings. Would it be possible to initialize and upload a minimal test dataset that we could use for these purposes? We can involve the team in generating sample data if needed.
src/agentlab/llm/traces/uploads.py
Outdated
def upload_index_data(index_df: pd.DataFrame): | ||
dataset = Dataset.from_pandas(index_df) | ||
dataset.push_to_hub(INDEX_DATASET, split="train") | ||
|
||
def upload_trace(trace_file: str, exp_id: str): | ||
api.upload_file( | ||
path_or_fileobj=trace_file, | ||
path_in_repo=f"{exp_id}.zip", | ||
repo_id=TRACE_DATASET, | ||
repo_type="dataset", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would approve new content on our datasets. Would there be a way to make new uploads into a PR ?
I'm guessing that might be on the HuggingFace side, in the dataset settings.
src/agentlab/llm/traces/uploads.py
Outdated
# Hugging Face dataset names | ||
INDEX_DATASET = "/agent_traces_index" | ||
TRACE_DATASET = "/agent_traces_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a dev version so all good but eventually we'd switch this to env variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i understand that, Once this PR is completed , i will remove them and u can add it in your .env
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, moving the Hugging Face dataset names, INDEX_DATASET
and TRACE_DATASET
to environment variables would enhance the code maintainability and security. This would also help us in managing different environment-specific settings. Thank you for addressing this.
Hello @RohitP2005, this looks very interesting, thank you ! Ideally, we would have a third table on top of this, with one entry per study (as in the reproducibility_journal.csv file), with a key. The entries in the experiment metadata table would point to that key. |
src/agentlab/llm/traces/uploads.py
Outdated
def upload_trace(trace_file: str, exp_id: str): | ||
api.upload_file( | ||
path_or_fileobj=trace_file, | ||
path_in_repo=f"{exp_id}.zip", | ||
repo_id=TRACE_DATASET, | ||
repo_type="dataset", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be interesting to compress the file if needed, instead of requiring it to be zipped already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
Yeah understood , I will look into it as soon as possible |
I thought of restructuring the travel uploads and creating classes for Study and Experiments with methods within them for their functionality. The functions are implemented in the utils files. Also, query functionality has been added. Kindly refer to Discord for a detailed description. |
try: | ||
dataset = load_dataset(trace_dataset, use_auth_token=hf_token, split="train") | ||
existing_data = {"exp_id": dataset["exp_id"], "zip_file": dataset["zip_file"]} | ||
except Exception as e: | ||
print(f"Could not load existing dataset: {e}. Creating a new dataset.") | ||
existing_data = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading the traces dataset is going to be a problem, as the traces are really heavy (200GB for our TMLR paper).
Ideally we'd have smth more similar to your original version:
def upload_trace(trace_file: str, exp_id: str):
api.upload_file(
path_or_fileobj=trace_file,
path_in_repo=f"{exp_id}.zip",
repo_id=TRACE_DATASET,
repo_type="dataset",
)
We would trust the index dataset to avoid duplicates, and use the trace dataset as a container in which we'd dump the zipfiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can safely remove this file now. It would be nice to have an equivalent of the upload method, to merge all 3 levels of upload (study, index, traces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this to recent changes in the upload methods
[Sample PR] Add Tools for Managing Agent Traces in Hugging Face Datasets
reference: #53
Summary
This PR introduces a foundational implementation for managing and uploading agent traces to Hugging Face datasets. It provides tools to simplify adding traces, maintaining an index dataset for easy retrieval, and enforcing whitelist-based constraints for legality.
Key Features
1. Hugging Face Dataset Structure
2. Upload System
study_name
,llm
,benchmark
, andlicense
.trace_pointer
) to the actual trace file.Notes
Checklist