TREC KBA 2014: Overview
Knowledge Base Acceleration (KBA) is an open evaluation in NIST's Text Retrieval Conference (TREC). KBA addresses this fundamental question:
filter a stream of documents to
accelerate users filling in knowledge gaps.
KBA 2014 is completed! See papers and data!
Specfically, KBA posses this question in a stream filtering context and uses entities (people, organizations, buildings) as the subjects of interest. The dossiers are profiles about the entities from a knowledge base (KB). As the stream of documents progresses, the entities change and evolve, so KBA systems must detect when vital, new information appears that would motivate an update to the KB.
KBA provides training data in the form of ground truth labels from the beginning of the stream. This simulates a human user training a filter with feedback (positive and negative examples) and then leaving the system to run open loop as the stream continues.
KBA provides a predefined list of entity profiles as query targets and a corpus organized in hourly batches of 105 texts. Considering each hourly batch in chronological order, KBA systems perform three tasks:
- Vital Filtering: Identify "vital" documents containing timely, new information that should update the profile.
- Streaming Slot Filling: Identify short phrases that fill-in basic attributes of the entity such as alternate names and birth date.
- Accelerate & Create (new open task!): Help the user rapidly assimilate new information into the KB.
|now:||NIST accepting KBA Data Access Agreements.|
|End of May:||NIST accepting TREC registration closes! You must register to get KBA ground truth data and queries.|
|May:||Updated 2014 corpus available for download.|
|End of June:||Queries and all ground truth data released.|
|End of August:||Run submissions due at NIST.|
|October:||Submission deadline for TREC Notebook papers.|
|November:||TREC Conference in Gaithersburg, MD.|
- The Streaming Slot Filling (SSF) task is wonderfully simplified. All participants should attempt SSF.
- The slot equivalence class concept from SSF 2013 has been deprecated and is not used in 2014.
- All target entities are selected from a single geographic region centered on the region between Seattle and Vancouver. This region was specially selected to prepare for cross-language Chinese-English KBA in the future.
- See streamcorpus.org for more info on tools for processing the corpus.
- For each entity, NIST assessors will populate a simple KB profile with basic slot values. Where possible, the NIST assessors will associate Wikipedia profiles from the enwiki-20120104-pages-articles.xml.xz snapshot. Many entities will not have any external profile, i.e. will be "nil" entities or "ghosts".
- When possible, NIST assessors will associate online profiles from other tools, such as Twitter or LinkedIn, to the KBA entity profiles. However, these tools generally do not enable time filtering to prevent incorprating future information from the perspective of the simulation, so these other external profiles can only be used in the Accelerate & Create task.
- The profiles created by NIST assessors will be hidden as evaluation data for the SSF task, so only those entities with Wikipedia profiles from the 2012-01-04 snapshot will have external profiles that can be used by CCR systems. It is a crucial aspect of CCR that the only available data about an entity may be examples from the corpus. To that end, we will ensure that all entities have some amount of document rating data for training in the beginning of the stream.
- KBA has a third task called Accelerate & Create that can use all of the ground truth data and is open to any creative use of the track data aimed at the research question above.