Skip to content

[MEVD] Have batching UpsertAsync return IReadOnlyList instead of IAsyncEnumerable (for database-generated IDs) #10692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
roji opened this issue Feb 26, 2025 · 12 comments
Assignees
Labels
Build Features planned for next Build conference memory connector memory msft.ext.vectordata Related to Microsoft.Extensions.VectorData .NET Issue or Pull requests regarding .NET code

Comments

@roji
Copy link
Member

roji commented Feb 26, 2025

As an aside, returning IAsyncEnumerable from UpsertBatch may be something we'd want to reexamine. First, it seems that only a small minority of databases actually support database-generated keys/ids (basically the relational databases like PG) - I'm not sure if we've fully mapped out the support, but if that's true, I'm not sure the abstraction should have an API feature which only works for the minority etc.

In addition, I'm not sure if database-generated keys correspond to the way that vector stores are generally used - this would probably be why no native vector database AFAIK actually supports this feature (please correct me if I'm wrong). If that's true, then we're supporting them in the vector database abstraction because relational databases happen to have them for other, non-vector scenarios.

Note that I wouldn't raise this if returning IAsyncEnumerable didn't create API usability issues. Currently, inserting requires the following code:

await foreach (var _ in this.Collection.UpsertBatchAsync(this.TestData))
{
}

This is because one cannot simply await an IAsyncEnumerable. I want to make sure we're not forcing users to do this every time they want to add some records, only because some databases support database key generation, and even then, when vector store usage typically wouldn't require actually fetching the key back at the point where it's inserted.

I'm very open to being convinced otherwise here, but the fact that dedicated vector stores don't support database-generated keys feels like a strong indication that they don't belong on the API surface of the abstraaction.

/cc @dmytrostruk @westey-m @adamsitnik

@dmytrostruk
Copy link
Member

Note that I wouldn't raise this if returning IAsyncEnumerable didn't create API usability issues.

It also includes an exception handling, which I described here #10640 (comment) . But I think usability can be always simplified. For example, if you are not interested in returned keys, instead of doing:

await foreach (var _ in this.Collection.UpsertBatchAsync(this.TestData))
{
}

It's possible to do:

await this.Collection.UpsertBatchAsync(this.TestData).ToListAsync();

But this is topic about having IAsyncEnumerable vs IReadOnlyList as a return type.

If we are talking about "return something" vs "return nothing":

  • If we have an API for upsert operation that returns keys - we cover both cases when DB supports key generation (in this case, we return them from database) and when DB doesn't support key generation (in this case, we return keys that were generated by user).
  • If we have an API that doesn't return anything, then it will be harder to operate with records with DB-generated keys. User upserts a record, but can't get it back by ID, since we didn't return it back, so the only way how to get that record would be to find it by some vector/non-vector property. So, this would impact also GetAsync method from IVectorStoreRecordCollection interface. But I still think that having an ability to get record by ID (whether it's generated by DB or by user) is very important.

@roji
Copy link
Member Author

roji commented Feb 27, 2025

instead of doing [...] It's possible to do [...]:

await this.Collection.UpsertBatchAsync(this.TestData).ToListAsync();

Note that there's no ToListAsync() in before .NET 10 (which hasn't yet been released) - one has to reference the System.Linq.Async package in order to do that (and that's pretty undiscoverable - one has to someone know that one needs to reference that package).

Even then, I can see lots of people simply omitting the ToListAsync(), and as a result their code wouldn't actually wait until the upsert is complete; that's a bit similar to forgetting an await. It's true that modern .NET SDKs emit a warning for missing await (and also unconsumed IAsyncEnumerable returned from methods). But I'm still slightly worried that with this method, where users actually interested in the return value is going to be a pretty small minority (as most databases simply don't support it) will miss it.

If we have an API that doesn't return anything, then it will be harder to operate with records with DB-generated keys.

That's true. I guess the thing I'm really interested in here, is why no native (non-relational) vector database supports key generation (AFAIK); the only databases which seem to support it are relational databases, and that's probably the case for other scenarios which have nothing to do with vectors usage. That feels like it means that database-generated keys may simply not make much sense in typical vector database usage.

@eavanvalkenburg
Copy link
Member

Just to chime in, for python a AsyncIterable is a much worse dev experience compared to a List as return, I would be open to returning List | None since then we have the flexibility to implement based on the capabilities of the store, and that means in practice that most will not return anything. Of course that might make sense for python and not for dotnet!

@roji
Copy link
Member Author

roji commented Mar 14, 2025

@eavanvalkenburg that would simplify the .NET dev experience a bit more... For cases where you don't care about the IDs (because you're providing them and not db-generating them, because the database doesn't even support ID generation - like most cases), users currently have to consume the returned IAsyncEnumerable (e.g. await collection.UpsertBatchAsync().ToListAsync() - this also requires an external System.Linq.Async package currently). Plus actually streaming the returned IDs (as opposed to returning a non-streaming List) is unlikely to be very important - streaming is an important perf feature in cases where getting the 1st element back (before the rest) would allow you to start important work earlier, and/or when each item would take a long time, which doesn't seem to be the case for database row insertion.

So changing the result type from IAsyncEnumerable to IReadOnlyList could be a sensible compromise in .NET too...

@roji roji changed the title .Net MEVD: Discuss the returning of IAsyncEnumerable from UpsertBatchAsync .Net MEVD: Investigate and decide what we want to do with UpsertBatchAsync and returning IDs Mar 19, 2025
@roji
Copy link
Member Author

roji commented Mar 19, 2025

Discussed this a bit more, next steps:

  • Confirm that there actually are dedicated (non-relational) database that do database ID generation (@westey-m).
  • Assuming we keep the feature (likely), consider replacing the current IAsyncEnumerable with e.g. IReadOnlyList as suggested above by @eavanvalkenburg; this would improve the dev experience, and there's likely to be very little value in actually streaming the IDs coming back from the database.
  • Decide how the user should configure this in the record definition; options are to actually introduce AutoGenerated on VectorStoreRecordKeyProperty - this would add it for all connectors (most would ignore it). Alternatively, implement this later as a connector-specific extension (see #10359).
    • We'd still have to leave the return type on UpsertAsync and UpsertBatchAsync - even if no connector currently supports it - to not block this in the future.
  • Check the actual implementation of the feature across supported providers; IIRC PG and SQLite do not implement this.

@roji roji assigned westey-m and roji and unassigned roji Mar 19, 2025
@roji roji changed the title .Net MEVD: Investigate and decide what we want to do with UpsertBatchAsync and returning IDs [MEVD] Have batching UpsertAsync return IReadOnlyList instead of IAsyncEnumerable (for database-generated IDs) Apr 7, 2025
@roji roji moved this from Backlog: Planned to Sprint: In Progress in Semantic Kernel Apr 7, 2025
@roji roji moved this from Sprint: In Progress to Sprint: Planned in Semantic Kernel Apr 7, 2025
@roji
Copy link
Member Author

roji commented Apr 7, 2025

Thinking a bit more about our error behavior, I realized that our support for database-generated keys makes our Upsert API non-idempotent. That is, as long as the user provides key values, they can simply retry operations (e.g. because of timeout errors or whatever). But the moment database key generation is involved, retrying on error could insert an additional row (leading to a duplicate).

Fortunately, as far as we know, the databases that actually support database key generation are relational (SQL) databases, which also provide strong consistency guarantees; in those cases, any error should cause the entire batch to be rolled back anyway. But @westey-m @adamsitnik I think someone mentioned that there may be other non-SQL databases supporting key generation as well... (UPDATED: There are)

Overall I'd have preferred to simply not support database key generation in day 1, and only introduce it later based on user demand, but unfortunately this feature requires an API change (Upsert returning value(s)), so we can't do that without breaking. Hopefully it all works out.

/cc @westey-m @dmytrostruk @adamsitnik

@westey-m
Copy link
Contributor

westey-m commented Apr 7, 2025

I did a pass through the different DBs we support or would still like to add support for, and this is my findings.
Out of the traditional VectorDBs, two can autogenerate, Milvus and Weaviate. The included links point at any relevant docs.

The idempotent upsert issue is definitely a concern. I think we have a few options:

  1. Only support id generation on upsert where the db guarantees strong consistency.
  2. Don't support id generation for now, but return ids from upsert to allow support later.
  3. Don't support id generation at all.
DB ID Status  
Azure AI Search Id required  
MongoDB Can auto generate https://www.geeksforgeeks.org/how-to-set-a-primary-key-in-mongoddb/
CosmosNoSQL Can auto generate GUID in SDK
CosmosMongoDB Can auto generate  
Chroma Id required chroma-core/chroma#979
InMemory Id required  
Milvus Can auto generate https://milvus.io/docs/primary-field.md
Pinecone Unclear, but seems required  
Postgres Can auto generate  
Qdrant Id required qdrant/qdrant#1535
Redis Unclear https://redis.io/docs/latest/commands/incr/
Sqlite Can auto generate https://www.sqlitetutorial.net/sqlite-autoincrement/
SqlServer Can auto generate https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity- property?view=sql-server-ver16#a-use-the-identity-property-with-create-table
Weaviate Can auto generate https://weaviate.io/developers/weaviate/manage-data/create#create-an-object-with-a-specified-id

CC @roji @dmytrostruk @adamsitnik

@roji
Copy link
Member Author

roji commented Apr 8, 2025

Thanks for doing the research @westey-m, that's super useful. It shows that there are multiple non-relational databases supporting database-generated keys (Weaviate, Milvus, MongoDB...). For me, this list is definitely enough to justify including the functionality in the abstraction.

The idempotent upsert issue is definitely a concern. I think we have a few options:
Only support id generation on upsert where the db guarantees strong consistency.

We could, though at the end of the day the non-atomicity is a limitation of the database, and I think it's OK to let the user decide whether they're willing to live with non-idempotent inserts or not... In other words, we can clearly document this on the abstraction API (for batch IVectorStoreRecordCollection.UpsertAsync), and refer the user to look at their specific implementations documentation of batch UpsertAsync). Beyond that it's really their choice - I wouldn't deny them the option just to protect them from possible errors.

Note that where users do want idempotency with database-generated keys, and the database doesn't support atomicity, they can still upsert one-by-one (the non-batching API). This allows them to precisely manage what behavior they want on errors etc., though it would have a performance cost where a database-native batching operation is available.

As an aside, this is why I think it does make sense to have separate insert, update and upsert operations... That allows making things far clearer around these guarantees, which really can vary by operation. But that's not up to us.

Don't support id generation for now, but return ids from upsert to allow support later.

We could, though it seems like we do need to decide now whether we are going to support this or not in the future, since it determines the return value of UpsertAsync... I wouldn't want us to leave the API in its current form (returning keys), and then later decide that we won't support this (because of the idempotency concern) and leave a dead return value (or break to remove it).

Don't support id generation at all.

That's indeed the safest option :) But I think you've shown that enough databases support this that we should too.

In summary:

  • I'm for including this functionality in MEVD.
  • I also think we shouldn't limit it for cases where idempotency can't be guaranteed - but we should document this very carefully.
  • Given that it's also quite pervasive, I think it's also OK to add an AutoGenerated flag on the record definition (and attribute). This would be ignored for databases where it isn't supported (that's OK).
  • As discussed before, IMHO we still should change the method to return IReadOnlyList instead of IAsyncEnumerable (that's #11402).
  • We don't strictly-speaking have to actually implement this for Build - having the correct API shape is the only thing we have to do. Though we can try doing this as a stretch goal.

@westey-m
Copy link
Contributor

westey-m commented Apr 8, 2025

This works for me. That's essentially where I was heading with Option 2 as well. I think we should have it on the interface and we can then add support later on a DB by DB basis where it makes sense. I also don't think we need any implementations for build.
And yes, let's change the response type.

Why use IReadOnlyList though? It's not like the connector requires it to stay stable after returning it, since the data is only specific to that call and not long lived. If the caller wants to modify it, why stop them? So, why not ICollection or IList?

@roji
Copy link
Member Author

roji commented Apr 8, 2025

Why use IReadOnlyList though? It's not like the connector requires it to stay stable after returning it, since the data is only specific to that call and not long lived. If the caller wants to modify it, why stop them? So, why not ICollection or IList?

I think that's just general practice when returning collection data from APIs when that data isn't meant to be mutable... It's mostly an indication to the user that this is returned data from the database, and that they can't mutate anything via that type. Otherwise some people might get an idea that the returned collection is somehow bound to the database, and modifying it actually does something.

Is this something you feel strongly about? It's another possible good question to raise in API review.

@roji roji closed this as completed Apr 8, 2025
@roji roji reopened this Apr 8, 2025
@roji roji removed the triage label Apr 8, 2025
@roji roji assigned roji and unassigned westey-m Apr 8, 2025
@roji roji moved this from Sprint: Planned to Sprint: In Progress in Semantic Kernel Apr 8, 2025
@westey-m
Copy link
Contributor

westey-m commented Apr 8, 2025

Otherwise some people might get an idea that the returned collection is somehow bound to the database

Really? Surely, that would be very strange for a list of returned ids when using an API that makes no claims to being bound to the DB, even in places where we deal with records.

Is this something you feel strongly about?

Not really no. Modifying the response during further processing is probably very much an edge case, so I don't think it matters too much.

@roji
Copy link
Member Author

roji commented Apr 8, 2025

Really? Surely, that would be very strange for a list of returned ids when using an API that makes no claims to being bound to the DB, even in places where we deal with records.

I agree and it's indeed a bit contrived...

In any case, since you're OK with it let's proceed with IReadOnlyList for now - but I'll raise this as a question for API review, I'm interested in the general response they give.

roji added a commit to roji/semantic-kernel that referenced this issue Apr 10, 2025
roji added a commit to roji/semantic-kernel that referenced this issue Apr 10, 2025
@roji roji closed this as completed Apr 10, 2025
@roji roji moved this from Sprint: In Progress to Sprint: Done in Semantic Kernel Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Features planned for next Build conference memory connector memory msft.ext.vectordata Related to Microsoft.Extensions.VectorData .NET Issue or Pull requests regarding .NET code
Projects
Status: Sprint: Done
Development

No branches or pull requests

6 participants