TREC Knowledge Base Acceleration


Background on KBA

Knowledge Base Acceleration (KBA) is an open evaluation in NIST's Text Retrieval Conference (TREC). KBA addresses this fundamental question:

Given a rich dossier on a subject,
filter a stream of documents to
accelerate users filling in knowledge gaps.

Specfically, KBA posses this question in a stream filtering context and uses entities (people, organizations, buildings) as the subjects of interest. The dossiers are profiles about the entities from a knowledge base (KB). As the stream of documents progresses, the entities change and evolve, so KBA systems must detect when vital, new information appears that would motivate an update to the KB.

KBA provides training data in the form of ground truth labels from the beginning of the stream. This simulates a human user training a filter with feedback (positive and negative examples) and then leaving the system to run open loop as the stream continues.

KBA provides a predefined list of entity profiles as query targets and a corpus organized in hourly batches of 105 texts. Considering each hourly batch in chronological order, KBA systems perform various knowledge discovery tasks -- see the specific measurement goals for each year's tasks.

Why streams and entities?

What are knowledge bases and why accelerate them?

Here is an example of an entity from Wikipedia that illustrates several properties that make KBA research interesting:

KBA is related to several existing research activities in text analytics and text retrieval, including entity linking, relation extraction, knowledge base population, and topic detection & tracking. KBA combines elements of these lines of thinking by asking researchers to invent systems to participate in the human-driven process of assimilating information into a knowledge base (KB), like Wikipedia (WP) or Freebase (FB).

Incoming streams of new information are so large that even if a content routing engine perfectly connected each piece of inbound content with appropriate human curators and KB nodes, the humans would still fall behind. Thus, a routing system must actually run open loop without human feedback for extended periods of time accumulating evidence about entities in the KB.

The KBA track is a forum for examining issues related to creating such systems, including:

To begin studying these issues, we are generating a time-stamped corpus of Web, news, and social media over a multi-month period. We are in the process of creating training & evaluation data by manually labeling this corpus with passage selections associated with KB nodes. In future years, we may consider other knowledge bases and streams, such as the stream of new articles in PubMed and KBs about proteins.

For the first year of KBA (TREC 2012), we conducted a simple filtering task and had 11 teams submit runs from 43 algorithmic approaches. TREC KBA 2013 had 14 teams and more than 140 submissions. KBA is continuing in 2014.

Systems in KBA 2012 & 2013 used a variety of approaches. Given the large training set on some entities, machine learning approaches that used just words and phrases as features were often above the median in vital filtering. Given the rich structure of links in WP, mining rich features from the KB also performed well. Several teams explored temporality within the CCR context.