Knowledge Base Acceleration (KBA) -- a track in NIST's TREC 2012

TREC KBA 2014: Vital Filtering

Knowledge Base Acceleration (KBA) is an open evaluation in NIST's Text Retrieval Conference (TREC). KBA addresses this fundamental question:

Given a rich dossier on a subject,
filter a stream of documents to
accelerate users filling in knowledge gaps.

The Vital Filtering task is the foundation task in KBA. All systems must perform this task. This task is also known as "Cumulative Citation Recommendation" or "CCR", and repeats the KBA 2012 and KBA 2013 task with improvements.

Basic Rules

The run submission format requires:task_id: kba-ccr-2014

Systems must iterate over the hourly directories of data in chronological order, hour by hour. Systems must not go back to promote documents using information from the future. There is no limit on the number of documents that may be included in a run submission.

All of the filtering topics and annotations from the beginning of the stream will be available to all teams at the start of the evaluation.

All entities will be specified by a "target_id" URL in Wikipedia or a URL invented for the track. Since many entities will not have a Wikipedia profile, the definition of the entity will consist of the training documents with byte offsets to mentions of the entity.

The "no future info" rule is still in effect: for any given date hour, systems may only access information about the entity from the past. Systems that use Twitter or Wikipedia APIs to access information about the entity must filter the data from those APIs to only consider information that was available before the date_hour being processed.

While all of the ground truth data for KBA will be released with the queries to support the third task, Vital Filtering systems may only use the ground truth data up to the specific cutoff time.

It is possible for a document to be vital for multiple target entities.

Only those documents that have 'clean_visible' text are candidates for the task. The NIST Assessors are instructed to discard any documents that are not primarily English.

Entity Selection Process

Instead of having the organizers hand-pick entities as we did in previous years, we are setting up the assessors to hand-pick entities from within a geographic domain. This will help ensure that the entities have more uniform coverage in the stream, and will hopefully find more interrelated entities.

Assessor Guidelines

kba-ccr-2014 has no novelty requirement, so if Justin Bieber were a target entity (he is not) and he happens to produce a new album, and two hundred StreamItems (documents) announce it within a very short time frame, e.g. one day, then in principle they are all citation worthy -- they all contain information that would update an already up-to-date profile.

The hard part of CCR is modeling the notion of citation worthiness and vitality: what would motivate a change and what does "already up-to-date" mean? The assessors are instructed to learn the background information about the entity, and then to adopt a subjective timeframe for how rapidly new information transitions to background information based on their own style and sense of the rate of change of the entity. Generally, the timeframe is less than a week and more than one hour. Since multiple reports about a change often provide a diversity of perspectives and nuance, there is a natural period of re-equilibration that accompanies each substantive change. The duration of this updating window is subjective and a key aspect of vital filtering, because this is a user-centric task.

Regarding what kind of information qualifies as "motivating" a change, the NIST assessors are instructed to approach each entity as though they are building a detailed profile or dossier appropriate to that specific entity. Some entities have more exciting/dramatic updates than others. Assessors must pick a subjective threshold for what to include. The threshold is generally above recording that the entity was mentioned in a particular newspaper (otherwise every mention would be inherently vital). The threshold is generally sensitive enough to include explicit meeting or place/time events involving the entity.

Since some entities are in Wikipedia, the assessors mental model of a profile should look like a completed Wikipedia article. Other entities are less well known, and might not meet the notoriety requirements of Wikipedia -- in these cases, the NIST assessors are instructed to consider a profile appropriate for the entity, such as a Freebase article. The profile and its content should match the nature of the entity.

Relation to Streaming Slot Filling

Assessors treat the KBA corpus as the universe of available information for filling slots on profiles. By definition, any document that fills a slot is vital. Therefore, two forms of vital documents occur:

Documents describing current events that affect the entity
Documents that fill a previously empty slot on the profile

For entities with Wikipedia articles in the enwiki-20120104-pages-articles.xml.xz snapshot, some of the slots may already be filled at the start of the streamcorpus time range.

Pre-Hoc Judging

Both CCR and SSF are judged pre-hoc, and all the CCR annotation for an early portion of the stream are provided as training data for all the pariticpants to use.

Documents available to the assessors will be selected from the billion-document stream corpus using a high-recall (low-precision) name matching for the target geographic region, website hostnames relevant to that region, and/or surface form names of the entities. Participants' systems will probably find some true positive results that were not available to the annotators. In 2012, we assessed the recall of the pre-hoc judging process and concluded that for most entities it was over 90%.

While some systems may discover interesting pockets of unjudged documents, the pre-hoc judging process provides a valid and efficient means of comparing system's approaches without pooling results from systems and re-judging post-hoc.

Rating Levels

The 2012 and 2013 annotations required assessors to input "contains_mention" as well as rating level. For 2014, we have simplified this: the two highest rating levels (vital and useful) imply contains_mention=True; the lowest rating level (garbage) implies contains_mention=False. If necessary for computing some statistic, rating=Neutral(0) also implies contains_mention=True, however neutral documents are often best ignored as not containing substantive positive or negative examples of mentions to the entity.

In 2013, KBA annotation had eight possible states from the cross-product of contains_mention=True|False and rating=-1,0,1,2. For 2014, we have reduced this to four possible states by eliminating ambiguous corner cases. In 2013, there was relatively low assessor agreement on rating=-1 versus rating=0, and contains_mention is always True for rating=1,2. The 2013 data can be mapped into this smaller set with these rules:

if contains_mention == False:
    ## all non-mentioning are now garbage
    rating_2014 = -1
elif rating_2013 == -1:
    ## was garbage and contains mentions, so change to neutral
    rating_2014 = 0
else:
    rating_2014 = rating_2013

Vital: rating=2 means that the document contains information that at the time it entered the stream would motivate an update to the entity's dossier with either a new slot value or timely, new info about the entity's current state, actions, or situation. The new info must motivate a change to an already up-to-date knowledge base article. Special cases include a deceased entity, in which case the entity's estate is the current embodiment of the entity and changes to the estate might trigger vital updates. The boundary between useful and vital is typified by documents in which the temporal context is vague or requires thought to ascertain, such as a job posting by an organization in which the assessor can deduce that the date on the job posting implies that the organization recently changed by opening the position. Assessors must judge whether such content is timeless (therefore Useful) or happening in time (therefore Vital).
(No change from 2013.)
Useful: rating=1 possibly citable but not timely, e.g. background bio, primary or secondary source. After the basic slots are filled, this kind of text is merely useful not vital. The boundary between neutral and useful is typified by documents that provide little detail about an entity, such as "Curduroy Mansions, by Alex McCall-Smith, is on the best seller list." Assessors must judge whether such content would be useful as a citable reference in the initial compilation of the profile for this particular entity.
(No change from 2013.)
Neutral: rating=0 informative but not citable, e.g. tertiary source like WP article itself not relevant. The boundary between garbage and neutral is typified by mentioning documents that provide very little info about entity, e.g. entity name used in product name or a passing reference like "this books plot reminds me of Alexander McCall Smith." Assessors must make a subjective judgment based on the context and the particular entity whether such texts are neutral or garbage.
(No change from 2013.)
Garbage: (aka non-mentioning) rating=‐1 no information about target entity, e.g. a surface form name appears and context confirms that it is a different entity, also includes spam or junk data that clearly does not mention the entity. If the assessor is unsure whether the author intended to refer to the entity, then it is garbage.
Change from 2013: Garbage always means non-mentioning.

Metrics

The primary metric for vital filtering (CCR) is maximum macro-averaged F_1 measure. F_1 is a function of confidence cutoff. By sweeping the cutoff, we obtain a range of precision (P) & recall (R) scores for each target entity. After averaging P and R across the set of target queries, we then compute F_1 at each confidence threshold and take the maximum F_1 as the single score for the system. The SSF metric will be as similar as possible.

We are also interested in ranking measures and temporally oriented measures, and may add other secondary metrics.