Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)
Researchers at Google annotated the English-language pages from the TREC KBA Stream Corpus 2014 with links to Freebase. The annotation was performed automatically and are imperfect. For each entity recognized with high confidence an annotation with a link to Freebase is provided (see the details below).
For any questions, join this discussion forum: https://groups.google.com/group/streamcorpus.
Data DescriptionTREC KBA Stream Corpus 2014. These annotations are freely available. The annotation data for the corpus is provided as a collection of 2000 files (the partitioning is somewhat arbitrary) that total 196 GB, compressed (gz). Each file contains annotations for a batch of pages and the entities identified on the page. These annotations are freely available. Here is an except from an annotation file:
In this example,
- "trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/2011-11-01-18/social-230-796....b4.sc.xz.gpg#1320170878-1ede332c6b9dda59398a647bcfb1408f" is the name and location of the source of the document. The last element (after the #) is the KBA stream-corpus identifier.
- "Jan Schakowsky" is the entity mention string in the text.
- 82 and 96 are the beginning and end byte offsets of the entity mention in the raw content.
- 1 is the posterior of the entity given the mention string and the document context.
- 9.5480746e-06 is the posterior given just the document context (ignoring the mention itself).
- /m/0256nw is the Freebase identifier for the entity. To see the entity in Freebase: http://www.freebase.com/m/0256nw
Some documents are not included because they do not contain any mentions of Freebase entities recognized with high confidence.
There are 394,051,027 documents with at least one entity annotated. There are over 9.4 billion entity mentions with links to Freebase. On average, Stream Corpus documents have 19 annotated mentions per document. Additional statistics are available at fakba1_stats.txt.
Estimating the overall accuracy of the annotations is difficult given the size of the data. Based on a small random sample of annotated documents evaluated by trained linguistics. It was found that approximately 9% of the mentions assigned a MID may be linked to an incorrect Freebase entity. Approximately 8% of the mentions that should be assigned MIDs are missed.
In addition, the query entities from the KBA 2014 TREC track were manually annotated with Freebase MIDs (or NIL). The annotated queries are available at trec-kba-2014-ccr-queries-annotated.tsv. The annotations are in a tab-separated file with four fields: the KBA entity id, KBA entity name, Freebase MID or "NIL" (manually assigned), and the KBA entity type. Approximately 45% of the target query entities occur in Freebase. For the 49 query entities in Freebase, we measured the document-level macro-averaged precision to be greater than 80% and the macro-averaged recall to be greater than 60% using the documents judged as "central" for the entity in the 2014 training and evaluation sets.
CitationIf you use this data in a publication, please cite it as:
- Jeffrey Dalton, John R. Frank, Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya, "FAKBA1: Freebase annotation of TREC KBA Stream Corpus, Version 1 (Release date 2015-01-26, Format version 1, Correction level 0)", January 2015.
Please also include in the citation the following URL(s) where the data is available (http://trec-kba.org/data/fakba1/).
Other Data Releases
If you would like to learn more about data releases from Google, you may wish to consider subscribing to this low-traffic mailing list: http://goo.gl/MJb3A.
This data set was prepared by Jeffrey Dalton, Evgeniy Gabrilovich, Michael Ringgaard, Amarnag Subramanya at Google.
Thanks to John Giannandrea, Jesse Saba Kirchner, Amanda Morris, Dave Orr, and Fernando Pereira for making this release possible.
Thanks to John R. Frank (MIT) for providing the Stream Corpus and for his assistance throughout the annotation process.