org.openrdf.elmo.scutter
Class Scutter

java.lang.Object
  extended by org.openrdf.elmo.scutter.Scutter
All Implemented Interfaces:
Runnable

public class Scutter
extends Object
implements Runnable

Scutter is the main class of the RDF crawler (scutter). It can be invoked from the command line or built into applications. Scutter implements Runnable so that it can be run as a background thread.

Version:
$Revision: 1.20 $
Author:
Peter Mika (original version for Jena by Matt Biddulph.)

Field Summary
static int DEFAULT_MAXTHREADS
           
static int DEFAULT_SIZE_LIMIT
          Size limit (in KB) for files to be parsed.
static int LINKEDQUEUE_SIZE
           
static int STOP_TIME
          Time to wait for Scutter to finish after stopped (in sec).
 
Constructor Summary
Scutter(RetrieverFactory factory)
          Create a new scutter.
 
Method Summary
 void addURL(URL url)
          Add a single URL to the queue.
protected  void addVisited(URL url)
           
 boolean clear()
          Clear the queue and visited lists.
 boolean getAutoBlackList()
          Determine whether automatic blacklisting is enabled.
 File getBlacklistFile()
           
 Resource[] getContext()
           
 Pattern getDomainPattern()
          Return the pattern used for the whitelist.
 int getMaxThreads()
           
 List getQueue()
          Get the queue.
 File getQueueFile()
           
 int getSizeLimit()
           
 boolean getStoreMetadata()
          Determine whether the scutter is set to produce metadata.
 Set getVisited()
          Get the set of URLs visited so far.
protected  boolean inBlacklist(URL url)
          Check if a URL is on the blacklist.
 void initQueue(String[] urls)
          Add a list of URLs to the queue
 int loadBlacklist()
          Add prefixes to the blacklist from a file on the disk.
 int loadQueue()
          Add URLs to the queue from a file on the disk.
protected  void loadVisited()
          Load the URLs of sources visited so far from the repository.
static void main(String[] args)
          Scutter command-line tool.
 void run()
          Main method that loops infinitely or until the scutter is stopped.
 void saveBlacklist()
          Save the status of the blacklist to the disk.
 void saveQueue()
          Save the status of the queue to the disk.
 void setAutoBlackList(boolean blacklist)
          Set whether the scutter should automatically put sites on blacklist after a number of profiles has been collected from that site.
 void setBlacklistFile(File blacklistFile)
           
 void setContext(Resource[] context)
           
 void setDomainPattern(Pattern p)
          Sets a pattern (regular expression) for limiting crawling to those URLs that match this pattern.
 void setMaxThreads(int threads)
           
 void setQueueFile(File queueFile)
           
 void setSizeLimit(int sizelimit)
          Set the size limit for files to be loaded
 void setStoreMetadata(boolean metadata)
          Set whether the scutter should produce metadata.
 void stop()
          Stop the scutter.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LINKEDQUEUE_SIZE

public static final int LINKEDQUEUE_SIZE
See Also:
EDU.oswego.cs.dl.util.concurrent.BoundedLinkedQueue, Constant Field Values

DEFAULT_MAXTHREADS

public static final int DEFAULT_MAXTHREADS
See Also:
EDU.oswego.cs.dl.util.concurrent.PooledExecutor, Constant Field Values

STOP_TIME

public static final int STOP_TIME
Time to wait for Scutter to finish after stopped (in sec). before saving queue to disk. Time to wait is dependent on the size of the thread pool.

See Also:
Constant Field Values

DEFAULT_SIZE_LIMIT

public static final int DEFAULT_SIZE_LIMIT
Size limit (in KB) for files to be parsed. Files larger will not be loaded.

See Also:
Constant Field Values
Constructor Detail

Scutter

public Scutter(RetrieverFactory factory)
        throws Exception
Create a new scutter.

Parameters:
repository - Sesame repository to be used for storing the data
Throws:
Exception
Method Detail

main

public static final void main(String[] args)
                       throws Exception
Scutter command-line tool.

Parameters:
args - First argument is the URL of Sesame server, second argument is the repository name. Remaining arguments are interpreted as URLs and are used to initialize the queue.
Throws:
Exception

initQueue

public void initQueue(String[] urls)
Add a list of URLs to the queue

Parameters:
urls - Array of strings (URLs)

getQueue

public List getQueue()
Get the queue. Beware: may mutate while the scutter is running.

Returns:
List of URL objects

getVisited

public Set getVisited()
Get the set of URLs visited so far. Beware: may mutate while the scutter is running.

Returns:
Set of URL objects

clear

public boolean clear()
Clear the queue and visited lists.

Returns:
false if clear fails

addVisited

protected void addVisited(URL url)

loadVisited

protected void loadVisited()
Load the URLs of sources visited so far from the repository.

Throws:
RepositoryException
RepositoryException
QueryEvaluationException
QueryEvaluationException
MalformedQueryException

loadQueue

public int loadQueue()
Add URLs to the queue from a file on the disk. Format is one URL per line.

Returns:
number of URLs loaded

saveQueue

public void saveQueue()
Save the status of the queue to the disk. Format is one URL per line.


loadBlacklist

public int loadBlacklist()
Add prefixes to the blacklist from a file on the disk. Format is one prefix per line.

Returns:
number of URLs loaded

saveBlacklist

public void saveBlacklist()
                   throws IOException
Save the status of the blacklist to the disk. Format is one prefix per line.

Throws:
IOException

stop

public void stop()
          throws IOException
Stop the scutter. The thread will sleep for a specified time to allow the scutter to finish.

Throws:
IOException

setDomainPattern

public void setDomainPattern(Pattern p)
Sets a pattern (regular expression) for limiting crawling to those URLs that match this pattern. In other words, this pattern defines the whitelist.

Parameters:
p - Pattern to use

getDomainPattern

public Pattern getDomainPattern()
Return the pattern used for the whitelist.

Returns:
Pattern

setStoreMetadata

public void setStoreMetadata(boolean metadata)
Set whether the scutter should produce metadata.

Parameters:
metadata -

getStoreMetadata

public boolean getStoreMetadata()
Determine whether the scutter is set to produce metadata.

Returns:
Flag indicating if the metadata feature is turned on

setAutoBlackList

public void setAutoBlackList(boolean blacklist)
Set whether the scutter should automatically put sites on blacklist after a number of profiles has been collected from that site.

Parameters:
metadata -

getAutoBlackList

public boolean getAutoBlackList()
Determine whether automatic blacklisting is enabled.

Returns:
Flag indicating if the metadata feature is turned on

setSizeLimit

public void setSizeLimit(int sizelimit)
Set the size limit for files to be loaded

Parameters:
sizelimit -

getSizeLimit

public int getSizeLimit()
Returns:
Size limit for files to be loaded

getMaxThreads

public int getMaxThreads()
Returns:
Returns the maxThreads.

setMaxThreads

public void setMaxThreads(int threads)
Parameters:
threads - The maxThreads to set.

run

public void run()
Main method that loops infinitely or until the scutter is stopped.

Specified by:
run in interface Runnable

addURL

public void addURL(URL url)
Add a single URL to the queue.

Parameters:
url -

inBlacklist

protected boolean inBlacklist(URL url)
Check if a URL is on the blacklist.

Parameters:
url -
Returns:

getBlacklistFile

public File getBlacklistFile()
Returns:
Returns the blacklistFile.

setBlacklistFile

public void setBlacklistFile(File blacklistFile)
Parameters:
blacklistFile - The blacklistFile to set.

getQueueFile

public File getQueueFile()
Returns:
Returns the queueFile.

setQueueFile

public void setQueueFile(File queueFile)
Parameters:
queueFile - The queueFile to set.

setContext

public void setContext(Resource[] context)

getContext

public Resource[] getContext()


Copyright © 2004-2008 Aduna. All Rights Reserved.