History | Log In     View a printable version of the current page. Get help!  
Issue Details [XML]

Key: SES-628
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Arjohn Kampman
Reporter: Alex Vigdor
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Sesame

Removing statements is extremely slow on large repositories

Created: 29/Oct/08 10:24 PM   Updated: 20/Oct/09 05:23 PM
Component/s: Native Sail
Affects Version/s: 2.2.1, 2.1.4, 2.2, 2.1.3
Fix Version/s: 2.3.0

File Attachments: 1. Java Source File TripleStore.java (28 kb)
2. Text File TripleStore.patch (1 kb)



 Description   
In working with a large repository (50 million triples), I have noticed that committing transactions is quite speedy when just adding statements, but removing even a handful of statements is extremely slow. Looking at the TripleStore, I see that removed statements cause this call to be made during commit():

RecordIterator iter = getTriples(-1, -1, -1, -1, REMOVED_FLAG, REMOVED_FLAG);

This essentially forces a sequential scan of the entire repository for statements marked as removed; needless to say this is extremely inefficient when trying to remove 10 out of 50 million statements! There should be some sort of explicit index for removed triples so that this can be handled efficiently.

 All   Comments   Change History      Sort Order:
Comment by Alex Vigdor [29/Oct/08 10:41 PM]
Just to add a sense of scale to the problem, commits with adds take anywhere from 150-400 ms, and commits with removes take anywhere from 15-25 seconds, as measured on the waiting end of an HTTP connection to the repository.

Comment by Alex Vigdor [30/Oct/08 02:23 PM]
I've been familiarizing myself more with the NativeStore code, it looks like it should be relatively easy to maintain a SequentialRecordCache for all triples removed during a transaction solely for the purpose of speeding up the commit (i.e. not changing any of the existing logic, just dumping the RecordIterator into an additional transaction-scoped cache in removeTriples(RecordIterator iter), and using that cache to produce the iterator of deletes in place of the expensive scan at commit time)

I may try to do this on my own, as the issue is serious for us. I'll post the patch here if I am successful.

Comment by Alex Vigdor [30/Oct/08 04:05 PM]
I've attached one approach to speeding this up (modified from the 2.2.1 branch); please let me know if there's any reason this shouldn't work (seems to be OK in my testing, but there are no unit tests for removing statements, and I don't know if there was a rationale for not taking this approach in the first place). Since the updatedRecordsCache already contains deleted records, I just wiped out the separate handling of deleted records in the commit() method, and delete each record as it comes up in the updated iterator.

i.e. where once the code had this:

if (removed) {
// Record has been discarded earlier, do not put it back in!
continue;
}
if (added || toggled) {


I changed to this


if (removed) {
btree.remove(data);
}
else if (added || toggled) {


and just completely removed the offending scan

if (txnRemovedTriples) {
RecordIterator iter = getTriples(-1, -1, -1, -1, REMOVED_FLAG, REMOVED_FLAG);
try {
discardTriples(iter);
}
finally {
txnRemovedTriples = false;
iter.close();
}
}


Was there a reason the deletes were handled in a separate iteration this way? E.g., are there circumstances under which one would expect there to be deleted statements that were not already in the updatedRecordsCache, or performance implications to mixing deletes and updates in a single iteration?

Comment by Alex Vigdor [01/Nov/08 02:21 AM]
Just adding another data point: I've been testing the attached fix under load, and commit times with deletes are consistently under 1 second, which is much more acceptable for our usage (which at heavy times involves batch updates every 20-30 seconds)

Comment by Alex Vigdor [13/Nov/08 08:28 PM]
Attached changes as a patch file (for convenience).

This patch continues to be stable and much speedier for us!

Comment by Semlab development [05/Mar/09 11:04 AM]
Our server uses a NativeStore, removing statements via an HTTP repository connection from this store takes extremely long. When we request the size of the store it says that there are a little over 11 million triples while we expect there to be a lot more. We are not using contexts in this store.

We have tried to implement the proposed fix from Alex Vigdor, however if we use this fix the statements are not being removed from the store and it still takes a long time.

Comment by Alex Vigdor [06/Mar/09 12:33 AM]
Just curious why the fix didn't work for you - what version of Sesame did you apply it to? How exactly did you implement the fix, manually based on my notes, or by applying the patch file? Can you post code showing how you use the store, especially how transactions are handled and committed? We've had this fix in production for months on multiple servers in multiple environments and it works like a charm for us - we're up to 70 million triples by now, from which we add and remove thousands every day.

Comment by Carl Bray [25/Mar/09 01:52 PM]
We tried out the patch and it left triples behind that we didn't expect. That is a query with the patch would return more results than expected.

Comment by Alex Vigdor [01/Apr/09 08:40 PM]
Carl, I'd repeat the same questions to you - what version of Sesame did you apply the patch to? Can you provide any code or query samples?

I'm also curious whether you're using contexts - all of our statements live in named contexts, perhaps there's something different going on if you're not using contexts.

Comment by Carl Bray [14/Apr/09 04:57 PM]
Just saw Alex's comment.

We are using 2.2.2
All triples are using named contexts.

We are executing a query on a connection and then call remove on the same connection for each result. This works fine on a memory store but we gave up waiting on a native store. Around 10M triples in the native store at this time.

Using your patch the operation completed in around 5 minutes however we started to see unexpected triples being returned in queries.

Sorry we don't have any sample code as we saw this in our product.

Comment by Alex Vigdor [14/Apr/09 08:15 PM]
Carl,
You might want to try updating to 2.2.3 - there was a nasty bug fixed in a library underlying the NativeStore that could cause corruption in the indices, which I did see regularly, though I'm not sure whether that would explain the unexpected triples you're seeing.

We're also using a query to determine triples to add/remove. A potential optimization you might want to consider is rather than removing each result as you read it, read all the results into a Collection, then remove the entire collection. Our code looks like this:

RepositoryConnection conn = repository.getConnection();
try{
conn.setAutoCommit(false);
Graph newGraph = new GraphImpl();
// some code here populates the new graph for a given resource
// ...
// the query builder returns a serql query that returns the old graph for the resource
GraphQuery oldQuery = queryBuilder.buildQuery(resource);
Graph oldGraph = new GraphImpl();
Iterations.addAll(oldQuery.evaluate(), oldGraph);
Collection<Statement> toDelete = CollectionUtils.subtract(oldGraph, newGraph);
Collection<Statement> toAdd = CollectionUtils.subtract(newGraph, oldGraph);
conn.remove(toDelete,context);
conn.add(toAdd,context);
conn.commit();
}
catch(Throwable e){
try{
conn.rollback();
}
catch(Throwable e1){}
if(e instanceof Exception){
throw (Exception)e;
}
throw new Exception(e);
}
finally{
try{
conn.close();
}
catch(Throwable e1){}
}

Comment by Arjohn Kampman [20/Oct/09 05:23 PM]
Removal of (relatively) small set of statements has been sped up by using the updatedTriplesCache for this too.