-
Notifications
You must be signed in to change notification settings - Fork 3.7k
CASSANDRA-20567: SAI marks an index as non-empty when a partial partition/row modifications is flushed due to repair #4105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
@@ -218,7 +218,7 @@ private void flushVectorIndex(long startTime, Stopwatch stopwatch) throws IOExce | |||
private void completeIndexFlush(long cellCount, long startTime, Stopwatch stopwatch) throws IOException | |||
{ | |||
// create a completion marker indicating that the index is complete and not-empty | |||
ColumnCompletionMarkerUtil.create(indexDescriptor, indexIdentifier, false); | |||
ColumnCompletionMarkerUtil.create(indexDescriptor, indexIdentifier, cellCount == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the actual patch for the bug fix. Repair is causing multiple sstables to be generated, and the tests do partial partition/row updates, so SAI sees a column as having data, but it didn't write anything for this range... this leads to cellCount == 0
@@ -371,6 +371,8 @@ public void test() throws IOException | |||
.add(this::selectTokenRange)) | |||
.addIf(State::hasEnoughMemtable, StatefulASTBase::flushTable) | |||
.addIf(State::hasEnoughSSTables, StatefulASTBase::compactTable) | |||
.addAllIf(BaseState::allowRepair, b -> b.add(StatefulASTBase::incrementalRepair) | |||
.addIf(BaseState::isConsistent, StatefulASTBase::previewRepair)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preview repairs fails if it finds a mismatch, so we can only call it safely in the test when we know we should be consistent.
for accord, we flush during the prepare
phase, which is before we do any barriers, so we can't just say "accord makes us consistent", because we might not be (you need to go through accord to be consistent).
@@ -144,7 +149,7 @@ protected static String nextKeyspace() | |||
|
|||
protected void clusterConfig(IInstanceConfig config) | |||
{ | |||
|
|||
config.set("repair.retries.max_attempts", Integer.MAX_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are jvm-dtest clusters, retry non stop until it works... these tests don't do chaos testing, so its really the "happy path" case
@@ -25,7 +25,7 @@ | |||
import org.apache.cassandra.harry.gen.SchemaGenerators; | |||
import org.apache.cassandra.service.consensus.TransactionalMode; | |||
|
|||
@Ignore("CASSANDRA-20567: Repair is failing due to missing SAI index files when using zero copy streaming") | |||
@Ignore("It was believed that these tests were failing due to CASSANDRA-20567, but in fixing that issue it was found that the tests are still failing! Harry is detecting an incorrect response...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was really hoping fixing this patch would fix these tests, so we have to root cause more.
I had a ton of seeds from CI, and non reproduced locally =(
@@ -557,53 +554,12 @@ static RepairOption previewOption(RandomSource rs, Cluster.Node coordinator, Str | |||
|
|||
private static RepairOption repairOption(RandomSource rs, Cluster.Node coordinator, String ks, Gen<List<String>> tablesGen, Gen<RepairType> repairTypeGen, Gen<PreviewType> previewTypeGen, Gen<RepairParallelism> repairParallelismGen) | |||
{ | |||
RepairType type = repairTypeGen.next(rs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than having different tests randomize repair stuff, i pulled this out into RepairGenerators
so this logic could be reused
if (nextBoolean(rnd)) | ||
options.put(LeveledCompactionStrategy.LEVEL_FANOUT_SIZE_OPTION, SourceDSL.integers().between(1, 100).generate(rnd).toString()); | ||
{ | ||
// there is a relationship between sstable size and fanout, so respect it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one seed randomly hit this case, so fix it so we don't happen again...
i am going to move the fuzz test logic out of this patch, as it keeps being flakey due to ZCS allocating direct memory and failing. I am testing removing this allocation (which should make the test no longer flakey), but that has nothing to do with SAI so best to do as another ticket. |
0d6d5a8
to
77c6170
Compare
…tion/row modifications is flushed due to repair
db40871
to
bdbc5a1
Compare
No description provided.