TREC Knowledge Base Acceleration

Supporters:

KBA Stream Corpus 2014

The KBA Stream Corpus 2014 was released in the third week of May 2014

For full details on KBA StreamCorpus, see the home page in s3: http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html

The TREC KBA Stream Corpus has three improvements over the kba-stream-corpus-2013:

See streamcorpus.org for more info on tools for processing the corpus.

Every document has a time stamp in the time range from October 2011 through mid February 2013.

We will post a complete list of paths in S3 when it is available.

Directory Structure

The data is stored in a shallow directory hierarchy:

	  /.../YYYY-MM-DD-HH/<substream source name>.<item count>.<md5>.xz.gpg
	

'YYYY-MM-DD-HH' denotes a year-month-day-hour in UTC.

'md5' represents the md5 hexdigest of that chunks decrypted uncompressed data. The number of stream_item instances per file varies between few dozen and five hundred. This directory structure enables a variety of processing approaches, including multi-step mapreduce jobs.

Compressed and Encrypted

After serializing the data with thrift, we have compressed each chunk with XZ and encrypted it with GPG. The GPG keys are provided by NIST to organizations that sign the data use agreements.

NER, dependency parsing, in-doc coref chains

We are running BBN's Serif named entity recognizer, within-doc coreference resolver, and dependency parser on the corpus.

Thrift Serialization

The data inside each chunk file is serialized with thrift, which you can use to generate client code in most programming languages. For KBA 2014, we have extended this thrift definition. Read here for more details: https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus.thrift.

See the streamcorpus project in github for tools to help you interact with streamcorpus chunks.

Coping with the Big Data

While the corpus is large, each individual hour is only 10^5 docs. Teams have exploited this in several ways, including:

Serif

BBN's Serif tagger has been run to generate StreamItem.body.sentences['serif'] output on all English-like documents in the corpus. In addition to Token.entity_type, Token.mention_type, and Token.equiv_id (within-doc coref), this also provides Token.dependency_path and Token.parent_id

Each dependency path shows the complete path through the constituent parse tree to get from a word to the word it depends on. This path consists of two parts: walking up the tree from the source word to the common ancestor node; and then walking down the tree to the word that the source word depends on. These are expressed as follows:

   PATH     := PATH_UP CTAG PATH_DN
   PATH_UP  := "/" | "/" CTAG PATH_UP
   PATH_DN  := "\" | "\" CTAG PATH_DN
   CTAG     := S | NP | PP | VP | etc...

Where PATH_UP is the path up from the source word to the common ancestor, the "CTAG" in the "PATH" expansion is the constituent tag for the common ancestor, and PATH_DN is the path back down to the word that the source word depends on. The "root" node (the one whose "parent_id" is -1) will always have an empty PATH_DN. Here's an example sentence:

> 0 Comment    -1  /NPA/NP\
> 1 on          0  /PP/NP\NPA\
> 2 Third       3  /NPA\
> 3 Baby        1  /NPA/PP\
> 4 for         0  /PP/NP\NPA\
> 5 Umngani     4  /NPA/PP\
> 6 by          0  /PP/NP\NPA\
> 7 ALLE        6  /NPP/PP\

And here's the corresponding parse tree (with constituent tags, but *not* part of speech tags):

   (NP (NPA Comment)
       (PP on (NPA Third Baby))
       (PP for (NPA Umnagi))
       (PP by (NPP ALLE)))

To take an example, if we start at the word "for" (word 4), and trace the path to the word it depends on (word 0), then we go up the tree from "for->PP->NP", and then back down the tree from "NP->NPA->Comment". Putting those paths together, we get "/PP/NP\NPA", which is indeed the dependency label shown.

To reconstruct the parse tree that it came from, we can start with the root node ("Comment") -- based on its PATH_UP, we have:

     (NP (NPA Comment))

Then we can add in the words that are dependent on it (words 1, 4, and 6), to get:

     (NP (NPA Comment) (PP on) (PP for) (PP by))

(Note that the complete PATH_DN is actually more information that we really need to reconstruct the tree -- in particular, all we really need to know is how long PATH_DN is -- i.e., "how high" to attach the dependent). Moving on, we attach the words that are dependent on the prepositions to get:

   (NP (NPA Comment)
       (PP on (NPA Baby))
       (PP for (NPA Umnagi))
       (PP by (NPP ALLE)))

And finally we attach "Third" to "Baby" using the very short path "/NPA\" to get back to the original tree:

   (NP (NPA Comment)
       (PP on (NPA Third Baby))
       (PP for (NPA Umnagi))
       (PP by (NPP ALLE)))

Obtaining the Corpus

See the corpus access page at NIST: http://trec.nist.gov/data/kba.html

The corpus is available in Amazon Web Services (AWS) S3: