openRDF.org Welcome Guest   | Login   
  Search  
  Index  | Recent Threads  | Who's Online  | User List  | Search  | Help  | RSS feeds

Forum has been closed down
This forum has been closed down due to extensive spamming activities. Please use the mailing list instead.


Quick Go »
Thread Status: Normal
Total posts in this thread: 2
[Add To My Favorites] [Watch this Thread]
Author
Previous Thread This topic has been viewed 16646 times and has 1 reply Next Thread
Jan 24, 2005 10:17:49 AM

arjohn
OpenRDF project lead
Member's Avatar

The Netherlands
Joined: Jan 23, 2004
Posts: 1289
Status: Offline
Turtle Tuples: Turtle-based query result format

Hi all,

What started as a search for ways to compact the XML-based query result format for Sesame, resulted in an idea for a completely new format: one that is based on Turtle. Please share your thoughts on whether this is a good or a bad idea with me.

The simple idea is to represent a table-like query result as a document of tab- (or whitespace-) separated Turtle values. Each line of tab-separated values represents one row of the query result. Just like in Turtle, @prefix directives can be used to map namespaces to prefixes. A format-specific @header directive is used to give names to the table's columns and the '*' character is used to indicate NULL values. The following shows an example query result:
@header "Country"  "Name"  "NatRes".
@prefix ciafb: <http://www.odci.gov/cia/publications/factbook/geos/ag.html>.
ciafb:ag.html "Algeria" "natural gas".
ciafb:ag.html "Algeria" "petroleum".
ciafb:ag.html "Algeria" "iron ore".
ciafb:ly.html "Libya" "natural gas".
ciafb:ly.html "Libya" "petroleum".
ciafb:bs.html "Bassas da India" *.

(Please note that the above does not describe RDF triples, but tuples which happen to have 3 values.)

Clearly, this is a very simple format. Advantages of this format over an XML-based format are:
  • Less overhead. In some quick tests I found the overhead in the form of XML-tagging to be some 30% of the text.
  • Easy to parse format. This particular format is even simpler than Turtle itself thanks to its regular structure. A (bulky) XML parser is not needed, reducing the weight of code that is loaded over the net when using applets, for example.
  • Easy to write format. One important feature of Turtle that is not available in XML is the ability to introduce new namespace prefixes anywhere in the document, and not just in the header. This makes it much easier to use namespace prefixes when the document needs to be written in a single pass. Using namespace prefixes in an XML document requires the writer to have knowledge about the namespaces that are used while writing the document header. Using namespace prefixes is an important mechamism to reduce the size of the query results document.
  • No issues with the encoding of 'weird' Unicode characters. XML has some restrictions on which Unicode characters are allowed, whereas Turtle does not. A related posting can be found here.
Further compaction of the query results document is possible. As can be observed in the example query results document, values in some of the columns are highly repetitive. In my experience, this is very common in table-like query results. This repetitiveness can be used to shrink the document's size even further by using a special token to indicate that the value concerned is equal to the value in the same column on the previous row. I've chosen the '=' character for this token:
@header "Country"  "Name"  "NatRes".
@prefix ciafb: <http://www.odci.gov/cia/publications/factbook/geos/ag.html>.
ciafb:ag.html "Algeria" "natural gas".
= = "petroleum".
= = "iron ore".
ciafb:ly.html "Libya" "natural gas".
= = "petroleum".
ciafb:bs.html "Bassas da India" *.


Well, what do you think?

Arjohn
----------------------------------------
Arjohn Kampman, OpenRDF project lead, Aduna
----------------------------------------
[Edit 1 times, last edit by arjohn at Jan 24, 2005 10:27:30 AM]
Show Printable Version of Post        Hidden to Guest [Link] Report threatening or abusive post: please login first  Go to top 
Feb 8, 2005 1:08:02 PM

arjohn
OpenRDF project lead
Member's Avatar

The Netherlands
Joined: Jan 23, 2004
Posts: 1289
Status: Offline
Re: Turtle Tuples: Turtle-based query result format

The past week I have been implemented both the protocol described above and a special purpose binary protocol. Performance tests show that the TurtleTuples format is sometimes slower and sometimes faster than the XML-format. The binary format is faster than both of them.

One of the tests was based on querying a "remote" server running on localhost that contained the wordnet schema and nouns. The repository that was used was a non-inferencing memory repository. The performance was measured by sending queries from a client which read back the results and counted them. The query that was used on the wordnet data was:
select *
from {X} <wn:wordForm> {Y}
using namespace
wn = <!http://www.cogsci.princeton.edu/~wn/schema/>
This query yielded a query results table with 174002 rows. The times spend from the moment the query was send to the server until the last row had been counted were:
XML           : 12688 ms (100%)
Turtle-Tuples : 9724 ms (77%)
Binary : 5839 ms (46%)
Based on these and other results, we have decided to include the binary result format in Sesame and to ignore the TurtleTuples format for now. The new stuff will be included in the upcoming Sesame 1.1.1 release. Documentation for the binary result format can be found in the javadoc of interface org.openrdf.sesame.query.BinaryTableResultConstants (available through CVS only, for now).
----------------------------------------
Arjohn Kampman, OpenRDF project lead, Aduna
----------------------------------------
[Edit 1 times, last edit by arjohn at Feb 8, 2005 1:09:44 PM]
Show Printable Version of Post        Hidden to Guest [Link] Report threatening or abusive post: please login first  Go to top 
[Show Printable Version of Thread]