[MEVD] Have batching UpsertAsync return IReadOnlyList instead of IAsyncEnumerable (for database-generated IDs) #10692

roji · 2025-02-26T10:56:02Z

As an aside, returning IAsyncEnumerable from UpsertBatch may be something we'd want to reexamine. First, it seems that only a small minority of databases actually support database-generated keys/ids (basically the relational databases like PG) - I'm not sure if we've fully mapped out the support, but if that's true, I'm not sure the abstraction should have an API feature which only works for the minority etc.

In addition, I'm not sure if database-generated keys correspond to the way that vector stores are generally used - this would probably be why no native vector database AFAIK actually supports this feature (please correct me if I'm wrong). If that's true, then we're supporting them in the vector database abstraction because relational databases happen to have them for other, non-vector scenarios.

Note that I wouldn't raise this if returning IAsyncEnumerable didn't create API usability issues. Currently, inserting requires the following code:

await foreach (var _ in this.Collection.UpsertBatchAsync(this.TestData))
{
}

This is because one cannot simply await an IAsyncEnumerable. I want to make sure we're not forcing users to do this every time they want to add some records, only because some databases support database key generation, and even then, when vector store usage typically wouldn't require actually fetching the key back at the point where it's inserted.

I'm very open to being convinced otherwise here, but the fact that dedicated vector stores don't support database-generated keys feels like a strong indication that they don't belong on the API surface of the abstraaction.

/cc @dmytrostruk @westey-m @adamsitnik

The text was updated successfully, but these errors were encountered:

dmytrostruk · 2025-02-26T18:47:26Z

Note that I wouldn't raise this if returning IAsyncEnumerable didn't create API usability issues.

It also includes an exception handling, which I described here #10640 (comment) . But I think usability can be always simplified. For example, if you are not interested in returned keys, instead of doing:

await foreach (var _ in this.Collection.UpsertBatchAsync(this.TestData))
{
}

It's possible to do:

await this.Collection.UpsertBatchAsync(this.TestData).ToListAsync();

But this is topic about having IAsyncEnumerable vs IReadOnlyList as a return type.

If we are talking about "return something" vs "return nothing":

If we have an API for upsert operation that returns keys - we cover both cases when DB supports key generation (in this case, we return them from database) and when DB doesn't support key generation (in this case, we return keys that were generated by user).
If we have an API that doesn't return anything, then it will be harder to operate with records with DB-generated keys. User upserts a record, but can't get it back by ID, since we didn't return it back, so the only way how to get that record would be to find it by some vector/non-vector property. So, this would impact also GetAsync method from IVectorStoreRecordCollection interface. But I still think that having an ability to get record by ID (whether it's generated by DB or by user) is very important.

roji · 2025-02-27T06:55:56Z

instead of doing [...] It's possible to do [...]:

await this.Collection.UpsertBatchAsync(this.TestData).ToListAsync();

Note that there's no ToListAsync() in before .NET 10 (which hasn't yet been released) - one has to reference the System.Linq.Async package in order to do that (and that's pretty undiscoverable - one has to someone know that one needs to reference that package).

Even then, I can see lots of people simply omitting the ToListAsync(), and as a result their code wouldn't actually wait until the upsert is complete; that's a bit similar to forgetting an await. It's true that modern .NET SDKs emit a warning for missing await (and also unconsumed IAsyncEnumerable returned from methods). But I'm still slightly worried that with this method, where users actually interested in the return value is going to be a pretty small minority (as most databases simply don't support it) will miss it.

If we have an API that doesn't return anything, then it will be harder to operate with records with DB-generated keys.

That's true. I guess the thing I'm really interested in here, is why no native (non-relational) vector database supports key generation (AFAIK); the only databases which seem to support it are relational databases, and that's probably the case for other scenarios which have nothing to do with vectors usage. That feels like it means that database-generated keys may simply not make much sense in typical vector database usage.

eavanvalkenburg · 2025-03-14T14:28:28Z

Just to chime in, for python a AsyncIterable is a much worse dev experience compared to a List as return, I would be open to returning List | None since then we have the flexibility to implement based on the capabilities of the store, and that means in practice that most will not return anything. Of course that might make sense for python and not for dotnet!

roji · 2025-03-14T17:08:18Z

@eavanvalkenburg that would simplify the .NET dev experience a bit more... For cases where you don't care about the IDs (because you're providing them and not db-generating them, because the database doesn't even support ID generation - like most cases), users currently have to consume the returned IAsyncEnumerable (e.g. await collection.UpsertBatchAsync().ToListAsync() - this also requires an external System.Linq.Async package currently). Plus actually streaming the returned IDs (as opposed to returning a non-streaming List) is unlikely to be very important - streaming is an important perf feature in cases where getting the 1st element back (before the rest) would allow you to start important work earlier, and/or when each item would take a long time, which doesn't seem to be the case for database row insertion.

So changing the result type from IAsyncEnumerable to IReadOnlyList could be a sensible compromise in .NET too...

roji · 2025-03-19T22:50:58Z

Discussed this a bit more, next steps:

Confirm that there actually are dedicated (non-relational) database that do database ID generation (@westey-m).
Assuming we keep the feature (likely), consider replacing the current IAsyncEnumerable with e.g. IReadOnlyList as suggested above by @eavanvalkenburg; this would improve the dev experience, and there's likely to be very little value in actually streaming the IDs coming back from the database.
Decide how the user should configure this in the record definition; options are to actually introduce AutoGenerated on VectorStoreRecordKeyProperty - this would add it for all connectors (most would ignore it). Alternatively, implement this later as a connector-specific extension (see #10359).
- We'd still have to leave the return type on UpsertAsync and UpsertBatchAsync - even if no connector currently supports it - to not block this in the future.
Check the actual implementation of the feature across supported providers; IIRC PG and SQLite do not implement this.

roji · 2025-04-07T10:18:22Z

Thinking a bit more about our error behavior, I realized that our support for database-generated keys makes our Upsert API non-idempotent. That is, as long as the user provides key values, they can simply retry operations (e.g. because of timeout errors or whatever). But the moment database key generation is involved, retrying on error could insert an additional row (leading to a duplicate).

Fortunately, as far as we know, the databases that actually support database key generation are relational (SQL) databases, which also provide strong consistency guarantees; in those cases, any error should cause the entire batch to be rolled back anyway. But @westey-m @adamsitnik I think someone mentioned that there may be other non-SQL databases supporting key generation as well... (UPDATED: There are)

Overall I'd have preferred to simply not support database key generation in day 1, and only introduce it later based on user demand, but unfortunately this feature requires an API change (Upsert returning value(s)), so we can't do that without breaking. Hopefully it all works out.

/cc @westey-m @dmytrostruk @adamsitnik

Closes microsoft#10692

westey-m · 2025-04-07T16:06:38Z

I did a pass through the different DBs we support or would still like to add support for, and this is my findings.
Out of the traditional VectorDBs, two can autogenerate, Milvus and Weaviate. The included links point at any relevant docs.

The idempotent upsert issue is definitely a concern. I think we have a few options:

Only support id generation on upsert where the db guarantees strong consistency.
Don't support id generation for now, but return ids from upsert to allow support later.
Don't support id generation at all.

DB	ID Status
Azure AI Search	Id required
MongoDB	Can auto generate	https://www.geeksforgeeks.org/how-to-set-a-primary-key-in-mongoddb/
CosmosNoSQL	Can auto generate GUID in SDK
CosmosMongoDB	Can auto generate
Chroma	Id required	chroma-core/chroma#979
InMemory	Id required
Milvus	Can auto generate	https://milvus.io/docs/primary-field.md
Pinecone	Unclear, but seems required
Postgres	Can auto generate
Qdrant	Id required	qdrant/qdrant#1535
Redis	Unclear	https://redis.io/docs/latest/commands/incr/
Sqlite	Can auto generate	https://www.sqlitetutorial.net/sqlite-autoincrement/
SqlServer	Can auto generate	https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity- property?view=sql-server-ver16#a-use-the-identity-property-with-create-table
Weaviate	Can auto generate	https://weaviate.io/developers/weaviate/manage-data/create#create-an-object-with-a-specified-id

CC @roji @dmytrostruk @adamsitnik

roji · 2025-04-08T09:31:33Z

Thanks for doing the research @westey-m, that's super useful. It shows that there are multiple non-relational databases supporting database-generated keys (Weaviate, Milvus, MongoDB...). For me, this list is definitely enough to justify including the functionality in the abstraction.

The idempotent upsert issue is definitely a concern. I think we have a few options:
Only support id generation on upsert where the db guarantees strong consistency.

We could, though at the end of the day the non-atomicity is a limitation of the database, and I think it's OK to let the user decide whether they're willing to live with non-idempotent inserts or not... In other words, we can clearly document this on the abstraction API (for batch IVectorStoreRecordCollection.UpsertAsync), and refer the user to look at their specific implementations documentation of batch UpsertAsync). Beyond that it's really their choice - I wouldn't deny them the option just to protect them from possible errors.

Note that where users do want idempotency with database-generated keys, and the database doesn't support atomicity, they can still upsert one-by-one (the non-batching API). This allows them to precisely manage what behavior they want on errors etc., though it would have a performance cost where a database-native batching operation is available.

As an aside, this is why I think it does make sense to have separate insert, update and upsert operations... That allows making things far clearer around these guarantees, which really can vary by operation. But that's not up to us.

Don't support id generation for now, but return ids from upsert to allow support later.

We could, though it seems like we do need to decide now whether we are going to support this or not in the future, since it determines the return value of UpsertAsync... I wouldn't want us to leave the API in its current form (returning keys), and then later decide that we won't support this (because of the idempotency concern) and leave a dead return value (or break to remove it).

Don't support id generation at all.

That's indeed the safest option :) But I think you've shown that enough databases support this that we should too.

In summary:

I'm for including this functionality in MEVD.
I also think we shouldn't limit it for cases where idempotency can't be guaranteed - but we should document this very carefully.
Given that it's also quite pervasive, I think it's also OK to add an AutoGenerated flag on the record definition (and attribute). This would be ignored for databases where it isn't supported (that's OK).
As discussed before, IMHO we still should change the method to return IReadOnlyList instead of IAsyncEnumerable (that's #11402).
We don't strictly-speaking have to actually implement this for Build - having the correct API shape is the only thing we have to do. Though we can try doing this as a stretch goal.

westey-m · 2025-04-08T09:55:00Z

This works for me. That's essentially where I was heading with Option 2 as well. I think we should have it on the interface and we can then add support later on a DB by DB basis where it makes sense. I also don't think we need any implementations for build.
And yes, let's change the response type.

Why use IReadOnlyList though? It's not like the connector requires it to stay stable after returning it, since the data is only specific to that call and not long lived. If the caller wants to modify it, why stop them? So, why not ICollection or IList?

roji · 2025-04-08T10:18:26Z

Why use IReadOnlyList though? It's not like the connector requires it to stay stable after returning it, since the data is only specific to that call and not long lived. If the caller wants to modify it, why stop them? So, why not ICollection or IList?

I think that's just general practice when returning collection data from APIs when that data isn't meant to be mutable... It's mostly an indication to the user that this is returned data from the database, and that they can't mutate anything via that type. Otherwise some people might get an idea that the returned collection is somehow bound to the database, and modifying it actually does something.

Is this something you feel strongly about? It's another possible good question to raise in API review.

westey-m · 2025-04-08T11:02:41Z

Otherwise some people might get an idea that the returned collection is somehow bound to the database

Really? Surely, that would be very strange for a list of returned ids when using an API that makes no claims to being bound to the DB, even in places where we deal with records.

Is this something you feel strongly about?

Not really no. Modifying the response during further processing is probably very much an edge case, so I don't think it matters too much.

roji · 2025-04-08T12:22:13Z

Really? Surely, that would be very strange for a list of returned ids when using an API that makes no claims to being bound to the DB, even in places where we deal with records.

I agree and it's indeed a bit contrived...

In any case, since you're OK with it let's proceed with IReadOnlyList for now - but I'll raise this as a question for API review, I'm interested in the general response they give.

Closes microsoft#10692

Closes #10692

markwallace-microsoft added .NET Issue or Pull requests regarding .NET code triage labels Feb 26, 2025

roji mentioned this issue Feb 26, 2025

.Net MEVD: Decide and align error behavior when upserts fail #10640

Open

markwallace-microsoft added this to Semantic Kernel Feb 27, 2025

markwallace-microsoft removed the triage label Feb 27, 2025

sophialagerkranspandey assigned roji Feb 27, 2025

markwallace-microsoft added the msft.ext.vectordata Related to Microsoft.Extensions.VectorData label Feb 27, 2025

sophialagerkranspandey removed the msft.ext.vectordata Related to Microsoft.Extensions.VectorData label Feb 27, 2025

markwallace-microsoft moved this to Backlog: Planned in Semantic Kernel Feb 27, 2025

roji mentioned this issue Mar 4, 2025

.Net: IVectorStore implementation for Azure SQL #10623

Merged

roji added memory connector memory msft.ext.vectordata Related to Microsoft.Extensions.VectorData Build Features planned for next Build conference labels Mar 5, 2025

adamsitnik mentioned this issue Mar 14, 2025

Bug: Upsert method of SQL Server connectors returns only inserted keys, instead of all keys #10979

Closed

roji changed the title ~~.Net MEVD: Discuss the returning of IAsyncEnumerable from UpsertBatchAsync~~ .Net MEVD: Investigate and decide what we want to do with UpsertBatchAsync and returning IDs Mar 19, 2025

roji assigned westey-m and roji and unassigned roji Mar 19, 2025

roji mentioned this issue Mar 31, 2025

.Net: [MEVD] Consider returning a simple list from IVectorStore.ListCollectionNamesAsync instead of IAsyncEnumerable #11291

Open

roji changed the title ~~.Net MEVD: Investigate and decide what we want to do with UpsertBatchAsync and returning IDs~~ [MEVD] Have batching UpsertAsync return IReadOnlyList instead of IAsyncEnumerable (for database-generated IDs) Apr 7, 2025

roji unassigned westey-m Apr 7, 2025

roji moved this from Backlog: Planned to Sprint: In Progress in Semantic Kernel Apr 7, 2025

roji moved this from Sprint: In Progress to Sprint: Planned in Semantic Kernel Apr 7, 2025

roji mentioned this issue Apr 7, 2025

.Net: [MEVD] Figure out our policy for null keys in upserts #11401

Open

roji added a commit to roji/semantic-kernel that referenced this issue Apr 7, 2025

Change UpsertAsync return type to IReadOnlyList<TKey>

4a57c6d

Closes microsoft#10692

roji mentioned this issue Apr 7, 2025

.Net: Change UpsertAsync return type to IReadOnlyList<TKey> #11402

Merged

roji assigned westey-m and unassigned roji Apr 7, 2025

roji closed this as completed Apr 8, 2025

roji reopened this Apr 8, 2025

markwallace-microsoft added the triage label Apr 8, 2025

roji removed the triage label Apr 8, 2025

roji assigned roji and unassigned westey-m Apr 8, 2025

roji moved this from Sprint: Planned to Sprint: In Progress in Semantic Kernel Apr 8, 2025

roji mentioned this issue Apr 10, 2025

.Net: [MEVD] Implement store key generation where supported #11485

Open

8 tasks

roji added a commit to roji/semantic-kernel that referenced this issue Apr 10, 2025

Change UpsertAsync return type to IReadOnlyList<TKey>

e296ea7

Closes microsoft#10692

roji added a commit to roji/semantic-kernel that referenced this issue Apr 10, 2025

Change UpsertAsync return type to IReadOnlyList<TKey>

69ecf92

Closes microsoft#10692

roji added a commit that referenced this issue Apr 10, 2025

.Net: Change UpsertAsync return type to IReadOnlyList<TKey> (#11402)

ae0fa62

Closes #10692

roji closed this as completed Apr 10, 2025

roji moved this from Sprint: In Progress to Sprint: Done in Semantic Kernel Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MEVD] Have batching UpsertAsync return IReadOnlyList instead of IAsyncEnumerable (for database-generated IDs) #10692

[MEVD] Have batching UpsertAsync return IReadOnlyList instead of IAsyncEnumerable (for database-generated IDs) #10692

roji commented Feb 26, 2025 •

edited

Loading

dmytrostruk commented Feb 26, 2025

roji commented Feb 27, 2025

eavanvalkenburg commented Mar 14, 2025

roji commented Mar 14, 2025

roji commented Mar 19, 2025

roji commented Apr 7, 2025 •

edited

Loading

westey-m commented Apr 7, 2025

roji commented Apr 8, 2025 •

edited

Loading

westey-m commented Apr 8, 2025

roji commented Apr 8, 2025

westey-m commented Apr 8, 2025 •

edited

Loading

roji commented Apr 8, 2025

[MEVD] Have batching UpsertAsync return IReadOnlyList instead of IAsyncEnumerable (for database-generated IDs) #10692

[MEVD] Have batching UpsertAsync return IReadOnlyList instead of IAsyncEnumerable (for database-generated IDs) #10692

Comments

roji commented Feb 26, 2025 • edited Loading

dmytrostruk commented Feb 26, 2025

roji commented Feb 27, 2025

eavanvalkenburg commented Mar 14, 2025

roji commented Mar 14, 2025

roji commented Mar 19, 2025

roji commented Apr 7, 2025 • edited Loading

westey-m commented Apr 7, 2025

roji commented Apr 8, 2025 • edited Loading

westey-m commented Apr 8, 2025

roji commented Apr 8, 2025

westey-m commented Apr 8, 2025 • edited Loading

roji commented Apr 8, 2025

roji commented Feb 26, 2025 •

edited

Loading

roji commented Apr 7, 2025 •

edited

Loading

roji commented Apr 8, 2025 •

edited

Loading

westey-m commented Apr 8, 2025 •

edited

Loading