The included openrdf-elmo-codegen.jar can be used from the command line to create an RDF ontology file from existing JavaBeans or generate Elmo concepts from an RDF ontology file. The command below will search the given jar (example-entities.jar) for classes in the package com.example.entities and output an OWL DL ontology in example-ontology.owl using the given ontology URI and the same namespace.
java -jar openrdf-elmo-codegen.jar \
-b "com.example.entities=http://www.example.com/rdf/2007/model#" \
-r example-ontology.owl \
example-entities.jar |
In the example below, the ontology example-ontology.owl will be imported, and Elmo concepts that are defined by the given ontology URI will be created and compiled in example-concepts.jar. This jar will then be ready to be used for development and deployment of an Elmo application.
java -jar openrdf-elmo-codegen.jar \
-b "com.example.concepts=http://www.example.com/rdf/2007/model#" \
-j example-concepts.jar \
example-ontology.owl |
The openrdf-elmo-codegen.jar will import the ontologies and concepts from openrdf-elmo-concepts.jar and they don't need to be specified on the command line. However, other dependent concepts jar files should be included at the end of the command.
The Elmo scutter is a generic RDF crawler that follows rdfs:seeAlso links in RDF documents, which typically point to other relevant RDF sources on the web. The Elmo scutter is based on original code by Matt Biddulph for Jena.
RDF(S) seeAlso is also the mechanism used to connect FOAF profiles and thus (given a starting location) the scutter allows to collect FOAF profiles from the Web. Several advanced features are provided to support this scenario:
Blacklisting: sites that produce FOAF profiles in large quantities are automatically placed on a blacklist. This is to avoid collecting large amounts of uninteresting FOAF data produced by social networking and blogging services or other dynamic sources.
White listing: the crawler can be limited to a domain (defined by a URL pattern).
Metadata: the crawler can optionally store metadata about the collected statements.
Filtering: incoming statements can be filtered individually. This is useful to remove unnecessary information, such as statements from unknown name-spaces.
Persistence: when the scutter is stopped, it saves its state to the disk. This allows to continue scuttering from the point where it left off. Also, when starting the scutter it tries to load back the list of visited URLs from the repository (this requires the saving of metadata to be turned on).
Logging: The Scutter uses slf4j to provide a detailed logging of the crawler.
The data collected by the scutter is stored in a Sesame repository. We recommend using a Native RDF repository for scuttering, because it provides the best performance for uploads.
The Scutter is available as a Java class as well as a Java servlet. The servlet provides access to all of the above features, except for filtering (which requires programming) and it can be deployed by placing the Elmo.war file in the web application directory of a Servlet/JSP container.
The servlet initialization parameters to be specified in the web.xml descriptor file are listed below. An example web.xml file is provided in the war file.
| Parameter name | Description | Required/Optional/Default |
| server | URL of the Sesame server to store the collected data | Required |
| repository | Name of the repository on the server | Required |
| username | Username for access to the Sesame repository | Optional |
| password | Password for access to the Sesame repository | Optional |
| queue | Location of the file used to save the queue when the scutter is stopped | Required |
| start | URL(s) used to start scuttering. URLs should be separated by white space. | Optional |
| domain | Limits crawling to URLs that match the provided regular expression. | Optional |
| metadata | Produce reified statements containing information about the provenance of the statements and the time they were collected. Possible values: true/false | Optional, defaults to false. |
| autoblacklist | Enable/disable automatic blacklisting. Possible values: true/false | Optional, defaults to true (enabled). |
| vocab | Restrict crawling to FOAF specific vocabularies (statements with predicates from the RDF, RDFS, FOAF or WGS_84 namespaces) | Optional, only possible value is 'foaf' |
| focused | Collect data about a specific set of target persons. The target persons are given as foaf:Person instances in the repository. | Optional, actual value is ignored |
| maxThreads | Maximum number of threads allowed to be running. Must be a positive integer. | Optional, defaults to 20. |
The request parameters to the server are listed in the table below. For convenience, there is an html file provided in the distribution for calling various operations on the servlet.
| Parameter name | Description | Required/Optional/Default |
| start | Try to load the set of visited URLs and start the scutter | Parameter value ignored. |
| stop | Stop the scutter, save the queue to disk | Parameter value ignored. |
| preloadQueue | Preload the queue from the saved file | Parameter value ignored. |
| clear | Clear the queue and the set of visited URLs | Parameter value ignored. |
A custom filtering of statements can be implemented by setting an instance of the StatementFilter interface using the setStatementFilter method of the Scutter class. See the JavaDoc for more details.
The task of the Elmo smusher is to find equivalent instances in large sets of data. This is a very common problem when processing collections of FOAF profiles as several sources on the Web may describe a the same individual using different identifiers or blank nodes (which are always assumed to be different). While the servlet provided is specific to smushing foaf:Person instances, the underlying mechanism is generic
The smusher uses instances of ResourceComparator for comparing instances. Implementations of ResourceComparator are given for foaf:Person and swrc:Publication.
The smusher reports the results (matching instances) by calling methods on registered listeners. Listeners implement the SmusherListener interface. Two implementations of SmusherListener are provided: one writes out results in text, while the other represents matches using the owl:sameAs relationship and uploads such statements to a Sesame repository. While Sesame does not directly support OWL semantics, the semantics of this relationship (the equivalence of property values) can be easily axiomatized using Sesame's custom rule language.