TREC Knowledge Base Acceleration


KBA Stream Corpus 2013

This page is archival. The KBA Stream Corpus 2013 is being updated in the KBA Stream Corpus 2014.

The TREC KBA Stream Corpus is the next generation of the kba-stream-corpus-2012.

To support the KBA tasks in TREC 2013, and also other tasks using the KBA corpus, we assembled a corpus in which every document has a time stamp in the time range from October 2011 through January 2013. This corpus is distributed by NIST (see below).

The content from the 2012 corpus is included in the 2013 corpus and provides the first 4973 hours of this extended corpus. The corpus is available in Amazon S3 and we recommend that you process the corpus using Amazon EC2.

Directory Structure

The files containing JSON lines are gzipped and stored in a shallow directory hierarchy:

	  /stream-corpus/YYYY-MM-DD-HH/<substream>.<item count>.<md5>.xz.gpg

'YYYY-MM-DD-HH' denotes a year-month-day-hour in UTC.

'md5' represents the md5 hexdigest of that chunks decrypted uncompressed data. The number of stream_item instances per file varies between a few hundred and a few hundred thousand. This directory structure enables a variety of processing approaches, including multi-step mapreduce jobs.

Compressed and Encrypted

After serializing the data with thrift, we have compressed each chunk with XZ and encrypted it with GPG. The GPG keys are provided by NIST to organizations that sign the data use agreements.

Sentence Chunking, NER, and in-doc coref chains

We are preparing to re-run an entity tagger on the corpus to obtain sentence chunking and also in-doc coref chains. This will extend the 'ner' content from the 2012 corpus, which is:

TokenID, Token, Lemma, PartOfSpeechTag, EntityMentionType, StartOffset, EndOffset
Details will be posted here when the new NER tagging is defined.

Thrift Serialization

The data inside each chunk file is serialized with thrift, which you can use to generate client code in most programming languages. For KBA 2012, we used For KBA 2013, we have extended this thrift definition. Read here for more details:

See the streamcorpus project in github for tools to help you interact with streamcorpus chunks.

This old script uses the original v0_1_0 thrift definition:

kba_thrift_verify uses the stats.json file that is present in every directory of the corpus to confirm that all of the data is present in that directory. Note that the kba_thrift_verify tool relies on four components that are not part of the python standard library:

Obtaining the Corpus

See the corpus access page at NIST: