At the simplest level, there are three classes:
- content-item.json, which holds 'raw' data and its transformed versions, such as 'cleansed' and 'ner'.
- http-metadata.json, which holds metadata from retrieving a document from a web server.
- corpus-item.json, which has a content-item called 'body', and has 'source_metadata' that could be an instance of http-metadata.json.
The TREC KBA stream corpus consists of instances
of stream-item.json, which
extends corpus-item.json with
stream_time and stream_id to give the corpus a temporal
ordering.
The TREC KBA stream corpus contains three subcorpora, which
have distinct 'source_metadata':