Just to add a sense of scale to the problem, commits with adds take anywhere from 150-400 ms, and commits with removes take anywhere from 15-25 seconds, as measured on the waiting end of an HTTP connection to the repository.
I've been familiarizing myself more with the NativeStore code, it looks like it should be relatively easy to maintain a SequentialRecordCache for all triples removed during a transaction solely for the purpose of speeding up the commit (i.e. not changing any of the existing logic, just dumping the RecordIterator into an additional transaction-scoped cache in removeTriples(RecordIterator iter), and using that cache to produce the iterator of deletes in place of the expensive scan at commit time)
I may try to do this on my own, as the issue is serious for us. I'll post the patch here if I am successful.
I've attached one approach to speeding this up (modified from the 2.2.1 branch); please let me know if there's any reason this shouldn't work (seems to be OK in my testing, but there are no unit tests for removing statements, and I don't know if there was a rationale for not taking this approach in the first place). Since the updatedRecordsCache already contains deleted records, I just wiped out the separate handling of deleted records in the commit() method, and delete each record as it comes up in the updated iterator.
i.e. where once the code had this:
if (removed) {
// Record has been discarded earlier, do not put it back in!
continue;
}
if (added || toggled) {
I changed to this
if (removed) {
btree.remove(data);
}
else if (added || toggled) {
and just completely removed the offending scan
if (txnRemovedTriples) {
RecordIterator iter = getTriples(-1, -1, -1, -1, REMOVED_FLAG, REMOVED_FLAG);
try {
discardTriples(iter);
}
finally {
txnRemovedTriples = false;
iter.close();
}
}
Was there a reason the deletes were handled in a separate iteration this way? E.g., are there circumstances under which one would expect there to be deleted statements that were not already in the updatedRecordsCache, or performance implications to mixing deletes and updates in a single iteration?
Just adding another data point: I've been testing the attached fix under load, and commit times with deletes are consistently under 1 second, which is much more acceptable for our usage (which at heavy times involves batch updates every 20-30 seconds)
Attached changes as a patch file (for convenience).
This patch continues to be stable and much speedier for us!
Our server uses a NativeStore, removing statements via an HTTP repository connection from this store takes extremely long. When we request the size of the store it says that there are a little over 11 million triples while we expect there to be a lot more. We are not using contexts in this store.
We have tried to implement the proposed fix from Alex Vigdor, however if we use this fix the statements are not being removed from the store and it still takes a long time.
Just curious why the fix didn't work for you - what version of Sesame did you apply it to? How exactly did you implement the fix, manually based on my notes, or by applying the patch file? Can you post code showing how you use the store, especially how transactions are handled and committed? We've had this fix in production for months on multiple servers in multiple environments and it works like a charm for us - we're up to 70 million triples by now, from which we add and remove thousands every day.
We tried out the patch and it left triples behind that we didn't expect. That is a query with the patch would return more results than expected.
Carl, I'd repeat the same questions to you - what version of Sesame did you apply the patch to? Can you provide any code or query samples?
I'm also curious whether you're using contexts - all of our statements live in named contexts, perhaps there's something different going on if you're not using contexts.
Just saw Alex's comment.
We are using 2.2.2
All triples are using named contexts.
We are executing a query on a connection and then call remove on the same connection for each result. This works fine on a memory store but we gave up waiting on a native store. Around 10M triples in the native store at this time.
Using your patch the operation completed in around 5 minutes however we started to see unexpected triples being returned in queries.
Sorry we don't have any sample code as we saw this in our product.
Carl,
You might want to try updating to 2.2.3 - there was a nasty bug fixed in a library underlying the NativeStore that could cause corruption in the indices, which I did see regularly, though I'm not sure whether that would explain the unexpected triples you're seeing.
We're also using a query to determine triples to add/remove. A potential optimization you might want to consider is rather than removing each result as you read it, read all the results into a Collection, then remove the entire collection. Our code looks like this:
RepositoryConnection conn = repository.getConnection();
try{
conn.setAutoCommit(false);
Graph newGraph = new GraphImpl();
// some code here populates the new graph for a given resource
// ...
// the query builder returns a serql query that returns the old graph for the resource
GraphQuery oldQuery = queryBuilder.buildQuery(resource);
Graph oldGraph = new GraphImpl();
Iterations.addAll(oldQuery.evaluate(), oldGraph);
Collection<Statement> toDelete = CollectionUtils.subtract(oldGraph, newGraph);
Collection<Statement> toAdd = CollectionUtils.subtract(newGraph, oldGraph);
conn.remove(toDelete,context);
conn.add(toAdd,context);
conn.commit();
}
catch(Throwable e){
try{
conn.rollback();
}
catch(Throwable e1){}
if(e instanceof Exception){
throw (Exception)e;
}
throw new Exception(e);
}
finally{
try{
conn.close();
}
catch(Throwable e1){}
}
Removal of (relatively) small set of statements has been sped up by using the updatedTriplesCache for this too.