-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[MEVD] Have batching UpsertAsync return IReadOnlyList instead of IAsyncEnumerable (for database-generated IDs) #10692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It also includes an exception handling, which I described here #10640 (comment) . But I think usability can be always simplified. For example, if you are not interested in returned keys, instead of doing: await foreach (var _ in this.Collection.UpsertBatchAsync(this.TestData))
{
} It's possible to do: await this.Collection.UpsertBatchAsync(this.TestData).ToListAsync(); But this is topic about having If we are talking about "return something" vs "return nothing":
|
await this.Collection.UpsertBatchAsync(this.TestData).ToListAsync(); Note that there's no ToListAsync() in before .NET 10 (which hasn't yet been released) - one has to reference the System.Linq.Async package in order to do that (and that's pretty undiscoverable - one has to someone know that one needs to reference that package). Even then, I can see lots of people simply omitting the ToListAsync(), and as a result their code wouldn't actually wait until the upsert is complete; that's a bit similar to forgetting an
That's true. I guess the thing I'm really interested in here, is why no native (non-relational) vector database supports key generation (AFAIK); the only databases which seem to support it are relational databases, and that's probably the case for other scenarios which have nothing to do with vectors usage. That feels like it means that database-generated keys may simply not make much sense in typical vector database usage. |
Just to chime in, for python a AsyncIterable is a much worse dev experience compared to a List as return, I would be open to returning |
@eavanvalkenburg that would simplify the .NET dev experience a bit more... For cases where you don't care about the IDs (because you're providing them and not db-generating them, because the database doesn't even support ID generation - like most cases), users currently have to consume the returned IAsyncEnumerable (e.g. await collection.UpsertBatchAsync().ToListAsync() - this also requires an external System.Linq.Async package currently). Plus actually streaming the returned IDs (as opposed to returning a non-streaming List) is unlikely to be very important - streaming is an important perf feature in cases where getting the 1st element back (before the rest) would allow you to start important work earlier, and/or when each item would take a long time, which doesn't seem to be the case for database row insertion. So changing the result type from IAsyncEnumerable to IReadOnlyList could be a sensible compromise in .NET too... |
Discussed this a bit more, next steps:
|
Thinking a bit more about our error behavior, I realized that our support for database-generated keys makes our Upsert API non-idempotent. That is, as long as the user provides key values, they can simply retry operations (e.g. because of timeout errors or whatever). But the moment database key generation is involved, retrying on error could insert an additional row (leading to a duplicate).
Overall I'd have preferred to simply not support database key generation in day 1, and only introduce it later based on user demand, but unfortunately this feature requires an API change (Upsert returning value(s)), so we can't do that without breaking. Hopefully it all works out. |
I did a pass through the different DBs we support or would still like to add support for, and this is my findings. The idempotent upsert issue is definitely a concern. I think we have a few options:
|
Thanks for doing the research @westey-m, that's super useful. It shows that there are multiple non-relational databases supporting database-generated keys (Weaviate, Milvus, MongoDB...). For me, this list is definitely enough to justify including the functionality in the abstraction.
We could, though at the end of the day the non-atomicity is a limitation of the database, and I think it's OK to let the user decide whether they're willing to live with non-idempotent inserts or not... In other words, we can clearly document this on the abstraction API (for batch IVectorStoreRecordCollection.UpsertAsync), and refer the user to look at their specific implementations documentation of batch UpsertAsync). Beyond that it's really their choice - I wouldn't deny them the option just to protect them from possible errors. Note that where users do want idempotency with database-generated keys, and the database doesn't support atomicity, they can still upsert one-by-one (the non-batching API). This allows them to precisely manage what behavior they want on errors etc., though it would have a performance cost where a database-native batching operation is available. As an aside, this is why I think it does make sense to have separate insert, update and upsert operations... That allows making things far clearer around these guarantees, which really can vary by operation. But that's not up to us.
We could, though it seems like we do need to decide now whether we are going to support this or not in the future, since it determines the return value of UpsertAsync... I wouldn't want us to leave the API in its current form (returning keys), and then later decide that we won't support this (because of the idempotency concern) and leave a dead return value (or break to remove it).
That's indeed the safest option :) But I think you've shown that enough databases support this that we should too. In summary:
|
This works for me. That's essentially where I was heading with Option 2 as well. I think we should have it on the interface and we can then add support later on a DB by DB basis where it makes sense. I also don't think we need any implementations for build. Why use |
I think that's just general practice when returning collection data from APIs when that data isn't meant to be mutable... It's mostly an indication to the user that this is returned data from the database, and that they can't mutate anything via that type. Otherwise some people might get an idea that the returned collection is somehow bound to the database, and modifying it actually does something. Is this something you feel strongly about? It's another possible good question to raise in API review. |
Really? Surely, that would be very strange for a list of returned ids when using an API that makes no claims to being bound to the DB, even in places where we deal with records.
Not really no. Modifying the response during further processing is probably very much an edge case, so I don't think it matters too much. |
I agree and it's indeed a bit contrived... In any case, since you're OK with it let's proceed with IReadOnlyList for now - but I'll raise this as a question for API review, I'm interested in the general response they give. |
As an aside, returning IAsyncEnumerable from UpsertBatch may be something we'd want to reexamine. First, it seems that only a small minority of databases actually support database-generated keys/ids (basically the relational databases like PG) - I'm not sure if we've fully mapped out the support, but if that's true, I'm not sure the abstraction should have an API feature which only works for the minority etc.
In addition, I'm not sure if database-generated keys correspond to the way that vector stores are generally used - this would probably be why no native vector database AFAIK actually supports this feature (please correct me if I'm wrong). If that's true, then we're supporting them in the vector database abstraction because relational databases happen to have them for other, non-vector scenarios.
Note that I wouldn't raise this if returning IAsyncEnumerable didn't create API usability issues. Currently, inserting requires the following code:
This is because one cannot simply
await
an IAsyncEnumerable. I want to make sure we're not forcing users to do this every time they want to add some records, only because some databases support database key generation, and even then, when vector store usage typically wouldn't require actually fetching the key back at the point where it's inserted.I'm very open to being convinced otherwise here, but the fact that dedicated vector stores don't support database-generated keys feels like a strong indication that they don't belong on the API surface of the abstraaction.
/cc @dmytrostruk @westey-m @adamsitnik
The text was updated successfully, but these errors were encountered: