TREC Knowledge Base Acceleration

Supporters:

Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)

Researchers at Google annotated the English-language pages from the TREC KBA Stream Corpus 2014 with links to Freebase. The annotation was performed automatically and are imperfect. For each entity recognized with high confidence an annotation with a link to Freebase is provided (see the details below).

For any questions, join this discussion forum: https://groups.google.com/group/streamcorpus.

Data Description

The entity annotations are for the TREC KBA Stream Corpus 2014. These annotations are freely available. The annotation data for the corpus is provided as a collection of 2000 files (the partitioning is somewhat arbitrary) that total 196 GB, compressed (gz). Each file contains annotations for a batch of pages and the entities identified on the page. These annotations are freely available. Here is an except from an annotation file:

trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/2011-11-01-18/social-230-796...b4.sc.xz.gpg#1320170878-1ede332c6b9dda59398a647bcfb1408f
Jan Schakowsky 82 96 1 9.5480746e-06 /m/0256nw
Schakowsky 416 21306 1 9.5480746e-06 /m/0256nw
Congress 122 130 0.97179073 0.0019803634 /m/07t31
SNAP 343 347 0.89219677 0.00039565749 /m/030gj7
Barbara Lee 781 792 0.99981463 7.9634192e-05 /m/024sjy

In this example,

Some documents are not included because they do not contain any mentions of Freebase entities recognized with high confidence.

There are 394,051,027 documents with at least one entity annotated. There are over 9.4 billion entity mentions with links to Freebase. On average, Stream Corpus documents have 19 annotated mentions per document. Additional statistics are available at fakba1_stats.txt.

Estimating the overall accuracy of the annotations is difficult given the size of the data. Based on a small random sample of annotated documents evaluated by trained linguistics. It was found that approximately 9% of the mentions assigned a MID may be linked to an incorrect Freebase entity. Approximately 8% of the mentions that should be assigned MIDs are missed.

In addition, the query entities from the KBA 2014 TREC track were manually annotated with Freebase MIDs (or NIL). The annotated queries are available at trec-kba-2014-ccr-queries-annotated.tsv. The annotations are in a tab-separated file with four fields: the KBA entity id, KBA entity name, Freebase MID or "NIL" (manually assigned), and the KBA entity type. Approximately 45% of the target query entities occur in Freebase. For the 49 query entities in Freebase, we measured the document-level macro-averaged precision to be greater than 80% and the macro-averaged recall to be greater than 60% using the documents judged as "central" for the entity in the 2014 training and evaluation sets.

Citation

If you use this data in a publication, please cite it as:

Please also include in the citation the following URL(s) where the data is available (http://trec-kba.org/data/fakba1/).

Other Data Releases

If you would like to learn more about data releases from Google, you may wish to consider subscribing to this low-traffic mailing list: http://goo.gl/MJb3A.

Acknowledgments

This data set was prepared by Jeffrey Dalton, Evgeniy Gabrilovich, Michael Ringgaard, Amarnag Subramanya at Google.

Thanks to John Giannandrea, Jesse Saba Kirchner, Amanda Morris, Dave Orr, and Fernando Pereira for making this release possible.

Thanks to John R. Frank (MIT) for providing the Stream Corpus and for his assistance throughout the annotation process.