TREC Knowledge Base Acceleration

Supporters:

TREC KBA 2014: Vital Filtering

Knowledge Base Acceleration (KBA) is an open evaluation in NIST's Text Retrieval Conference (TREC). KBA addresses this fundamental question:

Given a rich dossier on a subject,
filter a stream of documents to
accelerate users filling in knowledge gaps.

The Vital Filtering task is the foundation task in KBA. All systems must perform this task. This task is also known as "Cumulative Citation Recommendation" or "CCR", and repeats the KBA 2012 and KBA 2013 task with improvements.

Basic Rules

The run submission format requires:task_id: kba-ccr-2014

Systems must iterate over the hourly directories of data in chronological order, hour by hour. Systems must not go back to promote documents using information from the future. There is no limit on the number of documents that may be included in a run submission.

All of the filtering topics and annotations from the beginning of the stream will be available to all teams at the start of the evaluation.

All entities will be specified by a "target_id" URL in Wikipedia or a URL invented for the track. Since many entities will not have a Wikipedia profile, the definition of the entity will consist of the training documents with byte offsets to mentions of the entity.

The "no future info" rule is still in effect: for any given date hour, systems may only access information about the entity from the past. Systems that use Twitter or Wikipedia APIs to access information about the entity must filter the data from those APIs to only consider information that was available before the date_hour being processed.

While all of the ground truth data for KBA will be released with the queries to support the third task, Vital Filtering systems may only use the ground truth data up to the specific cutoff time.

It is possible for a document to be vital for multiple target entities.

Only those documents that have 'clean_visible' text are candidates for the task. The NIST Assessors are instructed to discard any documents that are not primarily English.

Entity Selection Process

Instead of having the organizers hand-pick entities as we did in previous years, we are setting up the assessors to hand-pick entities from within a geographic domain. This will help ensure that the entities have more uniform coverage in the stream, and will hopefully find more interrelated entities.

Assessor Guidelines

kba-ccr-2014 has no novelty requirement, so if Justin Bieber were a target entity (he is not) and he happens to produce a new album, and two hundred StreamItems (documents) announce it within a very short time frame, e.g. one day, then in principle they are all citation worthy -- they all contain information that would update an already up-to-date profile.

The hard part of CCR is modeling the notion of citation worthiness and vitality: what would motivate a change and what does "already up-to-date" mean? The assessors are instructed to learn the background information about the entity, and then to adopt a subjective timeframe for how rapidly new information transitions to background information based on their own style and sense of the rate of change of the entity. Generally, the timeframe is less than a week and more than one hour. Since multiple reports about a change often provide a diversity of perspectives and nuance, there is a natural period of re-equilibration that accompanies each substantive change. The duration of this updating window is subjective and a key aspect of vital filtering, because this is a user-centric task.

Regarding what kind of information qualifies as "motivating" a change, the NIST assessors are instructed to approach each entity as though they are building a detailed profile or dossier appropriate to that specific entity. Some entities have more exciting/dramatic updates than others. Assessors must pick a subjective threshold for what to include. The threshold is generally above recording that the entity was mentioned in a particular newspaper (otherwise every mention would be inherently vital). The threshold is generally sensitive enough to include explicit meeting or place/time events involving the entity.

Since some entities are in Wikipedia, the assessors mental model of a profile should look like a completed Wikipedia article. Other entities are less well known, and might not meet the notoriety requirements of Wikipedia -- in these cases, the NIST assessors are instructed to consider a profile appropriate for the entity, such as a Freebase article. The profile and its content should match the nature of the entity.

Relation to Streaming Slot Filling

Assessors treat the KBA corpus as the universe of available information for filling slots on profiles. By definition, any document that fills a slot is vital. Therefore, two forms of vital documents occur:

  1. Documents describing current events that affect the entity
  2. Documents that fill a previously empty slot on the profile

For entities with Wikipedia articles in the enwiki-20120104-pages-articles.xml.xz snapshot, some of the slots may already be filled at the start of the streamcorpus time range.

Pre-Hoc Judging

Both CCR and SSF are judged pre-hoc, and all the CCR annotation for an early portion of the stream are provided as training data for all the pariticpants to use.

Documents available to the assessors will be selected from the billion-document stream corpus using a high-recall (low-precision) name matching for the target geographic region, website hostnames relevant to that region, and/or surface form names of the entities. Participants' systems will probably find some true positive results that were not available to the annotators. In 2012, we assessed the recall of the pre-hoc judging process and concluded that for most entities it was over 90%.

While some systems may discover interesting pockets of unjudged documents, the pre-hoc judging process provides a valid and efficient means of comparing system's approaches without pooling results from systems and re-judging post-hoc.

Rating Levels

The 2012 and 2013 annotations required assessors to input "contains_mention" as well as rating level. For 2014, we have simplified this: the two highest rating levels (vital and useful) imply contains_mention=True; the lowest rating level (garbage) implies contains_mention=False. If necessary for computing some statistic, rating=Neutral(0) also implies contains_mention=True, however neutral documents are often best ignored as not containing substantive positive or negative examples of mentions to the entity.

In 2013, KBA annotation had eight possible states from the cross-product of contains_mention=True|False and rating=-1,0,1,2. For 2014, we have reduced this to four possible states by eliminating ambiguous corner cases. In 2013, there was relatively low assessor agreement on rating=-1 versus rating=0, and contains_mention is always True for rating=1,2. The 2013 data can be mapped into this smaller set with these rules:

if contains_mention == False:
    ## all non-mentioning are now garbage
    rating_2014 = -1
elif rating_2013 == -1:
    ## was garbage and contains mentions, so change to neutral
    rating_2014 = 0
else:
    rating_2014 = rating_2013

Metrics

The primary metric for vital filtering (CCR) is maximum macro-averaged F_1 measure. F_1 is a function of confidence cutoff. By sweeping the cutoff, we obtain a range of precision (P) & recall (R) scores for each target entity. After averaging P and R across the set of target queries, we then compute F_1 at each confidence threshold and take the maximum F_1 as the single score for the system. The SSF metric will be as similar as possible.

We are also interested in ranking measures and temporally oriented measures, and may add other secondary metrics.