KBA Stream Corpus 2013
This page is archival. The KBA Stream Corpus 2013 is being updated in the KBA Stream Corpus 2014.
The TREC KBA Stream Corpus is the next generation of the kba-stream-corpus-2012.
To support the KBA tasks in TREC 2013, and also other tasks using the KBA corpus, we assembled a corpus in which every document has a time stamp in the time range from October 2011 through January 2013. This corpus is distributed by NIST (see below).
The content from the 2012 corpus is included in the 2013 corpus and provides the first 4973 hours of this extended corpus. The corpus is available in Amazon S3 and we recommend that you process the corpus using Amazon EC2.
Directory Structure
The files containing JSON lines are gzipped and stored in a shallow directory hierarchy:
/stream-corpus/YYYY-MM-DD-HH/<substream>.<item count>.<md5>.xz.gpg
'YYYY-MM-DD-HH' denotes a year-month-day-hour in UTC.
'md5' represents the md5 hexdigest of that chunks decrypted uncompressed data. The number of stream_item instances per file varies between a few hundred and a few hundred thousand. This directory structure enables a variety of processing approaches, including multi-step mapreduce jobs.
Compressed and Encrypted
After serializing the data with thrift, we have compressed each chunk with XZ and encrypted it with GPG. The GPG keys are provided by NIST to organizations that sign the data use agreements.
Sentence Chunking, NER, and in-doc coref chains
We are preparing to re-run an entity tagger on the corpus to obtain sentence chunking and also in-doc coref chains. This will extend the 'ner' content from the 2012 corpus, which is:
TokenID, Token, Lemma, PartOfSpeechTag, EntityMentionType, StartOffset, EndOffsetDetails will be posted here when the new NER tagging is defined.
Thrift Serialization
The data inside each chunk file is serialized with thrift, which you can use to generate client code in most programming languages. For KBA 2012, we used https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus-v0_1_0.thrift For KBA 2013, we have extended this thrift definition. Read here for more details: https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus.thrift.
See the streamcorpus project in github for tools to help you interact with streamcorpus chunks.
This old script uses the original v0_1_0 thrift definition:
kba_thrift_verify uses the stats.json file that is present in every directory of the corpus to confirm that all of the data is present in that directory. Note that the kba_thrift_verify tool relies on four components that are not part of the python standard library:
- The python classes generated by thrift --gen (see above)
- The python bindings to thrift,
- XZ utils command line tools, which it uses as a child process,
- GnuPG, for decrypting the corpus using the key provided to you by NIST, which it also uses as a child process.
Obtaining the Corpus
See the corpus access page at NIST: http://trec.nist.gov/data/kba.html