|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.openrdf.elmo.scutter.Scutter
public class Scutter
Scutter is the main class of the RDF crawler (scutter). It can be invoked from the command line or built into applications. Scutter implements Runnable so that it can be run as a background thread.
| Field Summary | |
|---|---|
static int |
DEFAULT_MAXTHREADS
|
static int |
DEFAULT_SIZE_LIMIT
Size limit (in KB) for files to be parsed. |
static int |
LINKEDQUEUE_SIZE
|
static int |
STOP_TIME
Time to wait for Scutter to finish after stopped (in sec). |
| Constructor Summary | |
|---|---|
Scutter(RetrieverFactory factory)
Create a new scutter. |
|
| Method Summary | |
|---|---|
void |
addURL(URL url)
Add a single URL to the queue. |
protected void |
addVisited(URL url)
|
boolean |
clear()
Clear the queue and visited lists. |
boolean |
getAutoBlackList()
Determine whether automatic blacklisting is enabled. |
File |
getBlacklistFile()
|
Resource[] |
getContext()
|
Pattern |
getDomainPattern()
Return the pattern used for the whitelist. |
int |
getMaxThreads()
|
List |
getQueue()
Get the queue. |
File |
getQueueFile()
|
int |
getSizeLimit()
|
boolean |
getStoreMetadata()
Determine whether the scutter is set to produce metadata. |
Set |
getVisited()
Get the set of URLs visited so far. |
protected boolean |
inBlacklist(URL url)
Check if a URL is on the blacklist. |
void |
initQueue(String[] urls)
Add a list of URLs to the queue |
int |
loadBlacklist()
Add prefixes to the blacklist from a file on the disk. |
int |
loadQueue()
Add URLs to the queue from a file on the disk. |
protected void |
loadVisited()
Load the URLs of sources visited so far from the repository. |
static void |
main(String[] args)
Scutter command-line tool. |
void |
run()
Main method that loops infinitely or until the scutter is stopped. |
void |
saveBlacklist()
Save the status of the blacklist to the disk. |
void |
saveQueue()
Save the status of the queue to the disk. |
void |
setAutoBlackList(boolean blacklist)
Set whether the scutter should automatically put sites on blacklist after a number of profiles has been collected from that site. |
void |
setBlacklistFile(File blacklistFile)
|
void |
setContext(Resource[] context)
|
void |
setDomainPattern(Pattern p)
Sets a pattern (regular expression) for limiting crawling to those URLs that match this pattern. |
void |
setMaxThreads(int threads)
|
void |
setQueueFile(File queueFile)
|
void |
setSizeLimit(int sizelimit)
Set the size limit for files to be loaded |
void |
setStoreMetadata(boolean metadata)
Set whether the scutter should produce metadata. |
void |
stop()
Stop the scutter. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final int LINKEDQUEUE_SIZE
EDU.oswego.cs.dl.util.concurrent.BoundedLinkedQueue,
Constant Field Valuespublic static final int DEFAULT_MAXTHREADS
EDU.oswego.cs.dl.util.concurrent.PooledExecutor,
Constant Field Valuespublic static final int STOP_TIME
public static final int DEFAULT_SIZE_LIMIT
| Constructor Detail |
|---|
public Scutter(RetrieverFactory factory)
throws Exception
repository - Sesame repository to be used for storing the data
Exception| Method Detail |
|---|
public static final void main(String[] args)
throws Exception
args - First argument is the URL of Sesame server, second argument is
the repository name. Remaining arguments are interpreted as
URLs and are used to initialize the queue.
Exceptionpublic void initQueue(String[] urls)
urls - Array of strings (URLs)public List getQueue()
public Set getVisited()
public boolean clear()
protected void addVisited(URL url)
protected void loadVisited()
RepositoryException
RepositoryException
QueryEvaluationException
QueryEvaluationException
MalformedQueryExceptionpublic int loadQueue()
public void saveQueue()
public int loadBlacklist()
public void saveBlacklist()
throws IOException
IOException
public void stop()
throws IOException
IOExceptionpublic void setDomainPattern(Pattern p)
p - Pattern to usepublic Pattern getDomainPattern()
public void setStoreMetadata(boolean metadata)
metadata - public boolean getStoreMetadata()
public void setAutoBlackList(boolean blacklist)
metadata - public boolean getAutoBlackList()
public void setSizeLimit(int sizelimit)
sizelimit - public int getSizeLimit()
public int getMaxThreads()
public void setMaxThreads(int threads)
threads - The maxThreads to set.public void run()
run in interface Runnablepublic void addURL(URL url)
url - protected boolean inBlacklist(URL url)
url -
public File getBlacklistFile()
public void setBlacklistFile(File blacklistFile)
blacklistFile - The blacklistFile to set.public File getQueueFile()
public void setQueueFile(File queueFile)
queueFile - The queueFile to set.public void setContext(Resource[] context)
public Resource[] getContext()
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||