KBA Stream Corpus 2014
The KBA Stream Corpus 2014 was released in the third week of May 2014
For full details on KBA StreamCorpus, see the home page in s3: http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html
The TREC KBA Stream Corpus has three improvements over the kba-stream-corpus-2013:
- BBN's Serif NLP tools, including within-doc coref and dependency parses, are being run on all English-like documents.
- Better character encoding detection and conversion to UTF8, and more documents have clean_visible.
- epoch_ticks always agree with zulu_timestamp and the date_hour directory.
See streamcorpus.org for more info on tools for processing the corpus.
Every document has a time stamp in the time range from October 2011 through mid February 2013.
We will post a complete list of paths in S3 when it is available.
Directory Structure
The data is stored in a shallow directory hierarchy:
/.../YYYY-MM-DD-HH/<substream source name>.<item count>.<md5>.xz.gpg
'YYYY-MM-DD-HH' denotes a year-month-day-hour in UTC.
'md5' represents the md5 hexdigest of that chunks decrypted uncompressed data. The number of stream_item instances per file varies between few dozen and five hundred. This directory structure enables a variety of processing approaches, including multi-step mapreduce jobs.
Compressed and Encrypted
After serializing the data with thrift, we have compressed each chunk with XZ and encrypted it with GPG. The GPG keys are provided by NIST to organizations that sign the data use agreements.
NER, dependency parsing, in-doc coref chains
We are running BBN's Serif named entity recognizer, within-doc coreference resolver, and dependency parser on the corpus.
Thrift Serialization
The data inside each chunk file is serialized with thrift, which you can use to generate client code in most programming languages. For KBA 2014, we have extended this thrift definition. Read here for more details: https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus.thrift.
See the streamcorpus project in github for tools to help you interact with streamcorpus chunks.
- The python classes generated by thrift --gen (see above)
- The python bindings to thrift,
- XZ utils command line tools, which it uses as a child process,
- GnuPG, for decrypting the corpus using the key provided to you by NIST, which it also uses as a child process.
Coping with the Big Data
While the corpus is large, each individual hour is only 10^5 docs. Teams have exploited this in several ways, including:
- Pre-indexing many hourly chunks in a search engine like indri, solr, elasticsearch, and then simulate an hourly stream by issuing queries restricted to each hour in sequential order. In implementing such an approach, one should avoid using future information by configuring ranking algorithms to not not use statistics future documents.
- Batch processing: you can iterate over the corpus as a sequence of ~4300 batches. This can be implemented using a MapReduce framework like Hadoop or even BashReduce.
Serif
BBN's Serif tagger has been run to generate StreamItem.body.sentences['serif'] output on all English-like documents in the corpus. In addition to Token.entity_type, Token.mention_type, and Token.equiv_id (within-doc coref), this also provides Token.dependency_path and Token.parent_id
Each dependency path shows the complete path through the constituent parse tree to get from a word to the word it depends on. This path consists of two parts: walking up the tree from the source word to the common ancestor node; and then walking down the tree to the word that the source word depends on. These are expressed as follows:
PATH := PATH_UP CTAG PATH_DN PATH_UP := "/" | "/" CTAG PATH_UP PATH_DN := "\" | "\" CTAG PATH_DN CTAG := S | NP | PP | VP | etc...
Where PATH_UP is the path up from the source word to the common ancestor, the "CTAG" in the "PATH" expansion is the constituent tag for the common ancestor, and PATH_DN is the path back down to the word that the source word depends on. The "root" node (the one whose "parent_id" is -1) will always have an empty PATH_DN. Here's an example sentence:
> 0 Comment -1 /NPA/NP\ > 1 on 0 /PP/NP\NPA\ > 2 Third 3 /NPA\ > 3 Baby 1 /NPA/PP\ > 4 for 0 /PP/NP\NPA\ > 5 Umngani 4 /NPA/PP\ > 6 by 0 /PP/NP\NPA\ > 7 ALLE 6 /NPP/PP\
And here's the corresponding parse tree (with constituent tags, but *not* part of speech tags):
(NP (NPA Comment) (PP on (NPA Third Baby)) (PP for (NPA Umnagi)) (PP by (NPP ALLE)))
To take an example, if we start at the word "for" (word 4), and trace the path to the word it depends on (word 0), then we go up the tree from "for->PP->NP", and then back down the tree from "NP->NPA->Comment". Putting those paths together, we get "/PP/NP\NPA", which is indeed the dependency label shown.
To reconstruct the parse tree that it came from, we can start with the root node ("Comment") -- based on its PATH_UP, we have:
(NP (NPA Comment))
Then we can add in the words that are dependent on it (words 1, 4, and 6), to get:
(NP (NPA Comment) (PP on) (PP for) (PP by))
(Note that the complete PATH_DN is actually more information that we really need to reconstruct the tree -- in particular, all we really need to know is how long PATH_DN is -- i.e., "how high" to attach the dependent). Moving on, we attach the words that are dependent on the prepositions to get:
(NP (NPA Comment) (PP on (NPA Baby)) (PP for (NPA Umnagi)) (PP by (NPP ALLE)))
And finally we attach "Third" to "Baby" using the very short path "/NPA\" to get back to the original tree:
(NP (NPA Comment) (PP on (NPA Third Baby)) (PP for (NPA Umnagi)) (PP by (NPP ALLE)))
Obtaining the Corpus
See the corpus access page at NIST: http://trec.nist.gov/data/kba.html
The corpus is available in Amazon Web Services (AWS) S3:
- We recommend using Amazon's Elastic Compute Cloud (EC2) and Elastic Map Reduce (EMR) tools to process the corpus. While this will cost you compute charges, Amazon is hosting the corpus for free in this bucket:
s3://aws-publicdatasets/trec/kba/index.html
A useful tool for interacting with the corpus in S3 is http://s3tools.org/s3cmd. - You can also retrieve the corpus using wget, like this:
## Fetch the list of directory names -- date-hour strings wget http://s3.amazonaws.com/aws-publicdatasets/trec/kba/(TBD...) ## Use GNU parallel to make multiple wget requests in parallel. ## The --continue flag makes this restartable. cat dir-names.txt | parallel -j 10 --eta 'wget --recursive --continue --no-host-directories --no-parent --reject "index.html*" http://s3.amazonaws.com/aws-publicdatasets/trec/kba/(TBD...)/{}/index.html'