Google Freebase Entity Annotations of the TREC Knowledge Base Acceleration Stream Corpus 2014 (FAKBA1)

Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)

Researchers at Google annotated the English-language pages from the TREC KBA Stream Corpus 2014 with links to Freebase. The annotation was performed automatically and are imperfect. For each entity recognized with high confidence an annotation with a link to Freebase is provided (see the details below).

For any questions, join this discussion forum: https://groups.google.com/group/streamcorpus.

Data Description

The entity annotations are for the TREC KBA Stream Corpus 2014. These annotations are freely available. The annotation data for the corpus is provided as a collection of 2000 files (the partitioning is somewhat arbitrary) that total 196 GB, compressed (gz). Each file contains annotations for a batch of pages and the entities identified on the page. These annotations are freely available. Here is an except from an annotation file:

trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/2011-11-01-18/social-230-796...b4.sc.xz.gpg#1320170878-1ede332c6b9dda59398a647bcfb1408f

Jan Schakowsky 82 96 1 9.5480746e-06 /m/0256nw

Schakowsky 416 21306 1 9.5480746e-06 /m/0256nw

Congress 122 130 0.97179073 0.0019803634 /m/07t31

SNAP 343 347 0.89219677 0.00039565749 /m/030gj7

Barbara Lee 781 792 0.99981463 7.9634192e-05 /m/024sjy

In this example,

"trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/2011-11-01-18/social-230-796....b4.sc.xz.gpg#1320170878-1ede332c6b9dda59398a647bcfb1408f" is the name and location of the source of the document. The last element (after the #) is the KBA stream-corpus identifier.
"Jan Schakowsky" is the entity mention string in the text.
82 and 96 are the beginning and end byte offsets of the entity mention in the raw content.
1 is the posterior of the entity given the mention string and the document context.
9.5480746e-06 is the posterior given just the document context (ignoring the mention itself).
/m/0256nw is the Freebase identifier for the entity. To see the entity in Freebase: http://www.freebase.com/m/0256nw

Some documents are not included because they do not contain any mentions of Freebase entities recognized with high confidence.

There are 394,051,027 documents with at least one entity annotated. There are over 9.4 billion entity mentions with links to Freebase. On average, Stream Corpus documents have 19 annotated mentions per document. Additional statistics are available at fakba1_stats.txt.

Estimating the overall accuracy of the annotations is difficult given the size of the data. Based on a small random sample of annotated documents evaluated by trained linguistics. It was found that approximately 9% of the mentions assigned a MID may be linked to an incorrect Freebase entity. Approximately 8% of the mentions that should be assigned MIDs are missed.

In addition, the query entities from the KBA 2014 TREC track were manually annotated with Freebase MIDs (or NIL). The annotated queries are available at trec-kba-2014-ccr-queries-annotated.tsv. The annotations are in a tab-separated file with four fields: the KBA entity id, KBA entity name, Freebase MID or "NIL" (manually assigned), and the KBA entity type. Approximately 45% of the target query entities occur in Freebase. For the 49 query entities in Freebase, we measured the document-level macro-averaged precision to be greater than 80% and the macro-averaged recall to be greater than 60% using the documents judged as "central" for the entity in the 2014 training and evaluation sets.

Citation

If you use this data in a publication, please cite it as:

Jeffrey Dalton, John R. Frank, Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya, "FAKBA1: Freebase annotation of TREC KBA Stream Corpus, Version 1 (Release date 2015-01-26, Format version 1, Correction level 0)", January 2015.

Please also include in the citation the following URL(s) where the data is available (http://trec-kba.org/data/fakba1/).

Other Data Releases

If you would like to learn more about data releases from Google, you may wish to consider subscribing to this low-traffic mailing list: http://goo.gl/MJb3A.

Acknowledgments

This data set was prepared by Jeffrey Dalton, Evgeniy Gabrilovich, Michael Ringgaard, Amarnag Subramanya at Google.

Thanks to John Giannandrea, Jesse Saba Kirchner, Amanda Morris, Dave Orr, and Fernando Pereira for making this release possible.

Thanks to John R. Frank (MIT) for providing the Stream Corpus and for his assistance throughout the annotation process.

Jan Schakowsky	82	96	1	9.5480746e-06	/m/0256nw
Schakowsky	416	21306	1	9.5480746e-06	/m/0256nw
Congress	122	130	0.97179073	0.0019803634	/m/07t31
SNAP	343	347	0.89219677	0.00039565749	/m/030gj7
Barbara Lee	781	792	0.99981463	7.9634192e-05	/m/024sjy