History | Log In     View a printable version of the current page. Get help!  
Issue Details [XML]

Key: ELMO-72
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: James Leigh
Reporter: Joshua Shinavier
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Elmo

huge collections of rdfs:Class URIs in ElmoEntityResolverImpl

Created: 23/Jul/10 01:33 AM   Updated: 03/Aug/10 12:09 AM
Component/s: None
Affects Version/s: 1.5
Fix Version/s: 1.6

Environment: 64-bit Linux (2.6.27)


 Description   
I'm using Elmo to feed high volumes of Twitter data into an AllegroGraph triple store which also contains a DBpedia dump. For its first few days of operation, the system behaved as expected. However, I now run into OutOfMemoryErrors more and more frequently. When I inspected a heap dump with jhat, I found millions of URIImpl instances created by Elmo, which make up the vast majority of the memory consumed. What is particularly strange is that nearly all of these instances encode the rdfs:Class URI. The path from the SesameManager to these URIImpls is as follows:

    manager.resources.resolver.multiples.segments[i].table[j].key.elementData[k]

where i, j, and k are any in-range array indices. There is a large amount of data in my triple store (around 250,000,000 statements) and it is updated very frequently (dozens of times per second), but there is nothing very unusual about the format of the data. Do you have any idea why this is happening?

 All   Comments   Change History      Sort Order:
Comment by Joshua Shinavier [23/Jul/10 02:15 AM]
Note: since posting the above, I've found that there are a lot of redundant "x rdf:type rdfs:Class" statements in the triple store. I'm sure this is relevant to the problem. Consider this issue as a feature request ("ignore duplicate types") instead of a bug.

Comment by Joshua Shinavier [23/Jul/10 03:06 AM]
Another comment: the redundant rdf:type statements apparently result from Elmo asserting the type of already-typed classes. This is not a problem on NativeStore (for example), which evidently ignores duplicate statements, but AllegroGraph's behavior is to maintain as many copies of a statement as there are calls which add it to the triple store.

Comment by Joshua Shinavier [23/Jul/10 04:10 AM]
One more comment: it is my calls to manager.designate which caused the rdf:type statements to be repeatedly added to the store. Since I can't avoid duplicate statements in AllegroGraph, I've changed my code so as to avoid unnecessary calls to designate(). There is still the question of what Elmo should do when it encounters a resource with redundant rdf:type statements (of which there are many in the Billion Triples Challenge datasets, for example), but this is now less of a problem for me.

Comment by James Leigh [03/Aug/10 12:09 AM]
Thanks for pointing this out. The rdf:type values will now (next release) be filtered for duplicates in memory and avoid this situation.

Fixed in revision 10497 and ported to AliBaba in revision 10481.