Skip to main content
Version: 2.0

NLP Model Training

Introduction

NLP-model-training is a microservice responsible for building and training models within python via TensorFlow.

Project Structure

NLP-model-training is designed as a simple command line interface (CLI) in order to train natural language processing (NLP) models. As such it has the commands prep, train, eval, cleanup, update_model_metadata_registry, and save.

The train command contains a complete end-to-end NLP model training pipeline with the tasks build_model, preprocess, combine_datasets, cache, train, and model_policy.

CLI commands

prep

prep retrieves all the necessary dependencies in order to train a specified model and sets up the basic expected local training directory structure.

Parameters

parameterusagerequired booldefaultchoicesdeprecation notice
datasetsJSON object with dataset info to train uponTrueNonesee Airflow DocumentationN/A
modelName of model to trainTrueNonenamed_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_modelN/A
gg_versionGraphGrid version to useTrueNone1.4.0, 2.0.0, latest (only with local Airflow development)N/A
request_bodyAccepts a raw GraphGrid Airflow request bodyFalseNonesee hereN/A
log_levelFlag to determine the level of loggingFalseINFODEBUG, INFO, WARN, ERRORN/A
dataset_bucketname of the bucket which contains the training and evaluation datasetsTrueNonegraphgrid-datasetsN/A
bert_bucketname of the bucket which contains a BERT modelTrueNonegraphgrid-nlp-modelsRemoved in 2.0.0
endpoint_urls3 or MinIO endpoint for dataset retrieval and saving resources out toTrueNonehttps://s3.us-west-2.amazonaws.com, http://Airflow-minio:9000 (only with local Airflow development)N/A
remote_bert_model_pathpath to the BERT model on remote cloud environmentTrueNoneresources/1.4/bertModel/Removed in 2.0.0
remote_vocab_pathpath to the vocab.txt on remote cloud environment for the associated BERT model on cloud storage serviceTrueNoneresources/1.4/vocab.txtRemoved in 2.0.0
top_level_pathLocal path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the prep task, otherwise it's expected to already exist.TrueNoneN/AN/A

train

train a NLP model based on the specified parameters. This command includes a full end to end training pipeline which first preprocesses text from the raw dataset/corpus format into a more standardized format (such as CoNLL). Following that transformation, data is processed further in preparation of being embedded through BERT. This includes tokenization, truncation, and tracking of sentence metadata (word lengths, indices, etc.). Once data is embedded through BERT, it undergoes further processing with post BERT embedding handlers. handlers vary heavily depending on the model, however, one common operation is averaging subword tokens (e.g. averaging the embedding for the multi token word "ham ##burger"). See below for a simplified example processing pipeline that a sample undergoes.

Screenshot

Once preprocessing through handlers is complete, data is cached for use during training. This allows models to only load the data that they are currently using during training to avoid massive memory consumption. After the models are trained, they are saved and a model policy is created documenting various characteristics of the previous training session (see here for information on model policies).

In addition to the training resources that are generated during training, a metadata.json file is generated at top_level_path/model_save/metadata.json which contains relevant information to propagate to the evaluation task. This file is used as a resource and reference for the model-metadata-registry, see here for more information regarding the model metadata registry.

Below is an example generated metadata.json file for a named_entity_recognition model.

{
"modelVersion": {
"modelName": "named_entity_recognition",
"trainingDataset": [
"conll"
],
"trainingAccuracy": 0.989,
"trainingLoss": 0.007,
"evalAccuracy": null,
"evalLoss": null,
"properties": {
"languages": [
"en"
]
}
}
}

Parameters

parameterusagerequired booldefaultchoicesdeprecation notice
datasetsJSON object with dataset info to train uponTrueNonesee Airflow DocumentationN/A
modelName of model to trainTrueNonenamed_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_modelN/A
retrainingboolean for whether model is being retrainedFalsefalsetrue, falseN/A
request_bodyAccepts a raw GraphGrid Airflow request bodyFalseNonesee hereN/A
log_levelFlag to determine the level of loggingFalseINFODEBUG, INFO, WARN, ERRORN/A
top_level_pathLocal path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the prep task, otherwise it's expected to already exist.TrueNoneN/AN/A

eval

eval reloads a trained model into memory from top_level_path/model_save/ and evaluates the model's performance on the evaluation test set (if one exists for the given model). Once that is complete, the metadata.json file from top_level_path/model_save/metadata.json is updated with the evaluation results.

Parameters

parameterusagerequired booldefaultchoicesdeprecation notice
datasetsJSON object with dataset info to train uponTrueNonesee Airflow DocumentationN/A
modelName of model to trainTrueNonenamed_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_modelN/A
top_level_pathLocal path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the prep task, otherwise it's expected to already exist.TrueNoneN/AN/A

save

save saves the trained model, resources, and dependencies out to the cloud storage environment as specified by the endpoint_url parameter.

Parameters

parameterusagerequired booldefaultchoicesdeprecation notice
modelName of model to trainTrueNonenamed_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_modelN/A
gg_versionGraphGrid version to useTrueNone1.4.0, 2.0.0, latest (only with local Airflow development)N/A
request_bodyAccepts a raw GraphGrid Airflow request bodyFalseNonesee hereN/A
endpoint_urls3 or MinIO endpoint for dataset retrieval and saving resources out toTrueNonehttps://s3.us-west-2.amazonaws.com, http://Airflow-minio:9000 (only with local Airflow development)N/A
save_bucketThe name of the bucket to save the model and resources to.TrueNonegraphgrid-nlp-modelsN/A

update_model_metadata_registry

update_model_metadata_registry updates the model metadata registry based on the history of the trained model, and the generated metadata.json file. For more information about the model metadata registry, see here.

Parameters

parameterusagerequired booldefaultchoicesdeprecation notice
modelName of model to trainTrueNonenamed_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_modelN/A
gg_versionGraphGrid version to useTrueNone1.4.0, 2.0.0, latest (only with local Airflow development)N/A
endpoint_urls3 or MinIO endpoint for dataset retrieval and saving resources out toTrueNonehttps://s3.us-west-2.amazonaws.com, http://Airflow-minio:9000 (only with local Airflow development)N/A
save_bucketThe name of the bucket to save the model and resources to.TrueNonegraphgrid-nlp-modelsN/A
top_level_pathLocal path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the prep task, otherwise it's expected to already exist.TrueNoneN/AN/A

cleanup

cleanup cleans up the local training directory and all the related resources.

Parameters

parameterusagerequired booldefaultchoicesdeprecation notice
modelName of model to trainTrueNonenamed_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_modelN/A
top_level_pathLocal path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the prep task, otherwise it's expected to already exist.TrueNoneN/AN/A

Model Policies

Model policies are a means of documenting the basic characteristics of a trained GraphGrid NLP model. A model policy informs any downstream project/app/etc. how to interact with the model. This is mostly seen through the inputTensors, outputTensors, and requiredOpsFiles fields.

Here is an example model policy produced from a named_entity_recognition training session.

{
"corpora": [
"conll"
],
"languages": [
"en"
],
"checksum": "R1FKcWbcfmTwYY3ILeMaEw==",
"modelName": "20210223T221627-nerModel",
"inputTensors": [
{
"name": "masking_input:0",
"shape": [
1,
128,
768
]
}
],
"task": "NAMED_ENTITY_RECOGNITION",
"testSet": "",
"outputTensors": [
{
"name": "dense_1/truediv:0",
"shape": [
1,
128,
9
],
"outputValues": [
"O",
"B-PER",
"I-PER",
"B-ORG",
"I-ORG",
"B-LOC",
"I-LOC",
"B-MISC",
"I-MISC"
]
}
],
"modelAccuracy": "0.98931605",
"trainingDate": "2021-02-23T22:15:58",
"requiredOpsFiles": [
"_lstm_ops.so"
]
}

Here is a table describing all the fields within a model policy.

FieldDescriptionType
corporaList of all the corpora/datasets the model was trained onList of strings
languagesThe languages the model was trained onList of strings
checksumA checksum that nlp-model-training generates for downstream projects to verify after retrievalString
modelNamename of the modelString
inputTensorsList of nested JSON objects documenting the name and shape of the input layerList of nested JSON objects
taskboot-nlp related task name for the specified modelString
testSetUnused/empty stringString
outputTensorsList of nested JSON objects documenting the name and shape of the output layerList of nested JSON objects
modelAccuracyThe model accuracy during trainingString
trainingDateThe date when the model was trainedString
requiredOpsFilesThe required Ops files in order to run the modelList of strings

Model Metadata Registry

The model metadata registry is a resource geared towards tracking all the trained models via the NLP-model-training pipeline and Airflow. This file lives at the top level of the bucket used for saving models (default com-graphgrid-nlp). The registry is responsible for which models are selected for packaging. As such, the newly trained model is compared to all the other currently trained models to determine whether it should be labeled as the current default model for packaging once training and evaluation is complete.

note

This file should never be altered manually.

Here is a truncated sample of the current GraphGrid model metadata registry which includes models for both the 1.4 and 2.0 platform versions.

{
"org": {
"default": {
"package": {
"1.4.0": {
"models": {
"part_of_speech_tagging": {
"modelVersion": "20210107T172128-posModel",
"location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/1.4.0/20210107T172128-posModel.tar.gz"
},
...
"sentiment_binary_lstm": {
"modelVersion": "20201222T054419-sentimentBinaryLstmModel",
"location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/1.4.0/20201222T054419-sentimentBinaryLstmModel.tar.gz"
}
}
},
"2.0.0": {
"models": {
"part_of_speech_tagging": {
"modelVersion": "20210521T215028-posModel",
"location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/2.0.0/20210521T215028-posModel.tar.gz"
},
...
"sentiment_categorical_lstm": {
"modelVersion": "20210325T110728-sentimentCategoricalLstmModel",
"location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/2.0.0/20210325T110728-sentimentCategoricalLstmModel.tar.gz"
}
}
}
},
"trainedModels": {
"part_of_speech_tagging": {
"20201218T150016-posModel": {
"modelName": "part_of_speech_tagging",
"trainingDataset": "conll",
"trainingAccuracy": 0.954,
"trainingLoss": 0.154,
"evalAccuracy": 0.931,
"evalLoss": 0.246,
"properties": {
"languages": [
"en"
]
},
"location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/1.4.0/20201218T150016-posModel.tar.gz",
"timestamp": "2020-12-18T15:00:16",
"platformVersion": "1.4.0"
},
...
"20210719T203352-posModel": {
"modelName": "part_of_speech_tagging",
"trainingDataset": [
"conll"
],
"trainingAccuracy": 0.945,
"trainingLoss": 0.035,
"evalAccuracy": 0.918,
"evalLoss": 0.056,
"properties": {
"languages": [
"en"
]
},
"location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/2.0.0/20210719T203352-posModel.tar.gz",
"timestamp": "2021-07-19T20:33:52",
"platformVersion": "2.0.0"
}
},
...
"sentiment_categorical_lstm": {
"20210325T110728-sentimentCategoricalLstmModel": {
"modelName": "sentiment_categorical_lstm",
"trainingDataset": [
"sst1"
],
"trainingAccuracy": 0.749,
"trainingLoss": 0.607,
"evalAccuracy": 0.487,
"evalLoss": 1.354,
"properties": {
"languages": [
"en"
]
},
"location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/2.0.0/20210325T110728-sentimentCategoricalLstmModel.tar.gz",
"timestamp": "2021-03-25T11:07:28",
"platformVersion": "2.0.0"
}
}
}
}
}
}

Training pipeline tasks

As the name implies, the train CLI command is the heart of NLP-model-training as it is responsible for running the complete NLP model training pipeline.

Some of the pipeline tasks are shared by multiple different models (e.g. BERT embedding), while some other models require custom or unique steps (e.g. translation models). The pipeline_config module dictates which exact pipeline steps each model performs.

build_model

build_model builds and compiles a GraphGrid NLP model. The following are all the possible trainable/buildable models: named_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_model.

preprocess

preprocess processes a given corpus/dataset's raw samples and transforms them into a more standardized format.

There are currently 10 preprocessing modules, which follow a naming scheme of preprocess_MODEL_CORPUS, unless the corpus is shared across different models in which case the model field is dropped.

combine_datasets

combine_datasets loads all preprocessed datasets into memory, concatenates them, and then shuffles them. In training scenarios where only a single corpus is used for training, this step still occurs, but simply shuffles the dataset as there are not multiple datasets to combine.

cache

cache transforms the preprocessed dataset and caches it inside the local training directory. For all models other than translation, this includes embedding the samples through BERT.

train

train trains a given model with the cached training samples from the cache task. Once training is complete, the model is saved within the top_level_path/model_save directory.

model_policy

model_policy generates a model policy for the newly trained NLP model. For information regarding model policies see here.

Pipeline parameters

The NLP pipeline defines parameter dataclasses to ease interfaces across distinct model training pipeline steps. For example, the module build_model_parameters defines the dataclass BuildModelParameters which includes the fields model_name, retraining, and retraining_path. Each parameter dataclass allows for a single and simple interface between pipeline steps which places the responsibility of individual parameter retrieval on downstream pipeline tasks. For example, if two different models have distinct model building modules which require different parameters, we can embed both within BuildModelParameters and have the distinct modules retrieve only the parameters that they need. This then allows us to define our pipeline through lambda functions that take in the parameter dataclasses.

In the following sections, we'll go through each individual parameter dataclass and list out the associated parameters.

BuildModelParameters

FieldDescriptionType
model_nameThe name of the modelString
retrainingFlag to determine whether a pretrained model exists locallyBoolean
retraining_pathPath to local pretrained modelString

CacheParameters

FieldDescriptionType
model_nameThe name of the modelString
cache_pathPath to cache training samplesString
outfile_pathPath to preprocessed samples file or directoryString

CombineDatasetsParameters

FieldDescriptionType
datasetsDictionary with each dataset being an entry with relevant informationDictionary
outfile_pathPath to preprocessed samples file or directoryString

ModelPolicyParameters

FieldDescriptionType
modelThe trained model[BaseModel, TranslationModel]
datasetsList of all the datasets the model was trained onList
historyHistory object that contains training performance metrics[tf.keras.callbacks.History, DummyHistory]
model_policy_pathPath to save the model policyString
languagesList of all the languages the model was trained onList

PreprocessingParameters

FieldDescriptionType
infile_pathPath to load raw dataset/corpusString
outfile_pathPath to save preprocessed samples file or directoryString
datasetName of the dataset or corpus to preprocessString
is_posFlag to mark part-of-speech modelsBoolean

TrainingParameters

FieldDescriptionType
modelThe trained model[BaseModel, TranslationModel]
model_nameThe name of the modelString
cache_pathPath to cache training samplesString
model_save_pathPath to save the modelString

Data transformation specifications by model

All samples and data undergoes different types of transformations before being fed directly into our models. This section breaks down each model's transformations in sequential order and documents what occurs at each step.

named_entity_recognition

Training pipeline stepInput shapeInput shape typeOutput shapeOutput shape typeLink
preprocess_conll(None,)String(None,)List of stringshere
tokenization(batch_size,)Batch of strings(batch_size, None, None)tf.RaggedTensorhere
BERT(batch_size, None, None)tf.RaggedTensor[(batch_size, 768), (batch_size, 128, 768)]tf.Tensorhere
conll_handler(batch_size, 128, 768)tf.Tensor(batch_size, 128, 768)tf.Tensorhere

part_of_speech_tagging

These transformation steps are identical to those of named_entity_recognition.

relation_extraction

Training pipeline stepInput shapeInput shape typeOutput shapeOutput shape typeLink
preprocess_re_kbp37(None,)String(None,)List of stringshere
tokenization(batch_size,)Batch of strings(batch_size, None, None)tf.RaggedTensorhere
BERT(batch_size, None, None)tf.RaggedTensor[(batch_size, 768), (batch_size, 128, 768)]tf.Tensorhere
handle_re_processing(batch_size, 128, 768)tf.Tensor(batch_size, 128, 808)tf.Tensorhere

Relation extraction handler

The relation extraction handler takes BERT embeddings as its input and calculates the distance between the mentions and relative words within the sample, then embeds their positions. This yields a tensor of shape (batch_size, 128, 808). Relation extraction iterates token-by-token through the BERT embedding to calculate the distance between the token and the mentions. This yields two distances for every word (the distance is 0 when analyzing the mentions themselves). These distances are then passed through both sin and cos functions and expanded to a dimensionality of 20. As there are two mentions and 20 corresponding position embeddings, this allows the final output shape to arrive at (batch_size, 128, 768 + 20 + 20).

coreference_resolution

Training pipeline stepInput shapeInput shape typeOutput shapeOutput shape typeLink
preprocess_crr_gap, preprocess_crr_wikicoref(None,)String(None,)List of stringshere
tokenization(batch_size,)Batch of strings(batch_size, None, None)tf.RaggedTensorhere
BERT(batch_size, None, None)tf.RaggedTensor[(batch_size, 768), (batch_size, 128, 768)]tf.Tensorhere
handle_crr_processing(batch_size, 128, 768)tf.Tensor(batch_size, 13854)tf.Tensorhere

Coreference resolution handler

The coreference resolution handler takes BERT embeddings as its input and calculates relative positional information about the sample and then creates a flattened tensor of all the information of shape (batch_size, 13854). Which can be broken down in the following manner: ((768 * 9 + 1 + 9) * 2 + 1 + 9).

The initial 768 * 9 are the following values:

IndexDescription
0The first corresponding token embedding for the mention
1The last corresponding token embedding for the mention
2The first token embedding before the mention (e.g. mention index - 1)
3The second token embedding before the mention (e.g. mention index - 2)
4The first token embedding after the mention (e.g. mention index + 1)
5The second token embedding after the mention (e.g. mention index + 2)
6The mean of the 5 tokens following the mention
7The mean of the 5 tokens before the mention
8The pooled output of the sample

As each embedded token has a shape of 768 and there are 9 different values.

The next singular appended value is the length of the mention within the sample.

The following 9 flat values are the "bucketized", i.e. one-hot encoded, sequence length of the tensor with depth 9. Finally, as coreference resolution is concerned with two different mentions, we repeat this same process for the second mention. Which is how we arrive at a shape of (batch_size, 13854).

keyphrase_extraction

Training pipeline stepInput shapeInput shape typeOutput shapeOutput shape typeLink
preprocess_keyphrase_semeval2017(None,)String(None,)List of stringshere
tokenization(batch_size,)Batch of strings(batch_size, None, None)tf.RaggedTensorhere
BERT(batch_size, None, None)tf.RaggedTensor[(batch_size, 768), (batch_size, 128, 768)]tf.Tensorhere
handle_keyphrase_processing(batch_size, 128, 768)tf.Tensor(batch_size, 128, 768)tf.Tensorhere

Keyphrase extraction handler

The keyphrase extraction handler takes BERT embeddings as its input and does not alter it in any way. Therefore, its exact output is the same as the BERT embedding, i.e. a tensor of shape (batch_size, 128, 768)

sentiment_analysis_binary and sentiment_analysis_categorical

Training pipeline stepInput shapeInput shape typeOutput shapeOutput shape typeLink
preprocess_sentiment_ssthere
tokenization(batch_size,)Batch of strings(batch_size, None, None)tf.RaggedTensorhere
BERT(batch_size, None, None)tf.RaggedTensor[(batch_size, 768), (batch_size, 128, 768)]tf.Tensorhere
handle_sentiment_lstm_processing, handle_sentiment_cat_lstm_processing(batch_size, 128, 768)tf.Tensor(batch_size, 128, 768)tf.Tensorhere

Sentiment analysis handler

The sentiment analysis handler takes BERT embeddings as its input and does not alter it in any way. Therefore, its exact output is the same as the BERT embedding, i.e. a tensor of shape (batch_size, 128, 768)

Shared transformations

Various transformations within the pipeline are shared across different models. This section goes through each shared transformation.

Tokenization

Tokenization takes in preprocessed samples and transforms them into the expected BERT input shape. This means that we first split sentences by whitespaces (as labels for specific tasks correspond words to labels), tokenize them through tensorflow_text.BertTokenizer, extract relevant metadata about the sentence (word indices, lengths, etc.), and then reshape the ragged tensor into the format that the helper function/model bert_pack_inputs expects.

The exact input is individual sentences in a simple string format. While the exact output is a ragged tensor with two nested rows, where row one defines the flat dimension of all the words in the sentence, and row two defines tensor sub-dimensions for sub tokens (e.g. indices where single words are comprised of multiple sub-tokens).

BERT embedding

BERT embedding takes tokenized sentences and processes them through BERT. This step is directly downstream from Tokenization, which means its input is the ragged tensor output from that step. The exact output for this step depends on whether the model expects pooled outputs (which can be thought of the entire culmination of the sentence), or sequence output (every token from the sentence). BERT defines a maximum sequence length for sentences, where any sentences that exceed that length in their number of tokens is truncated to the sequence length. For all of our models, the sequence length is defined as 128. Therefore, the exact output for this step is a either a tensor of shape (batch_size, 768), or (batch_size, 128, 768), where the former shape is for pooled outputs and the latter shape is for sequence outputs.

CoNLL handler

CoNLL handler is the post-BERT processing module that is shared for part_of_speech_tagging and named_entity_recognition models. As a handler, this step is directly downstream from BERT embedding, meaning it takes the sequence output from that step as its input. This step takes the mean of sub-tokens across the input batch, which aligns the number of inputs to the number of labels and allows for better model performance. This means that rather than retaining a multi-token word such as ham ##burger, we can take the mean of the BERT embeddings for the corresponding tokens ham and ##burger as they're the same word corresponding to the same label. The exact output of this step is (batch_size, 128, 768), however, all sub tensors within the sequence past the last word/token have be replaced with zero tensors.

Dataset Formats per model

Most of the NLP models utilize distinct datasets which use different formats. Therefore this section outlines each model's dataset and their basic formats.

named_entity_recognition

Computation Natural Language Learning (CoNLL)

"CoNLL2003: each word has to be on a separate line, and there must be an empty line after each sentence. A line must contain at least 2 columns, the first one being the word itself, the last one being the named entity. It does not matter if there are extra columns that contain tags or chunks in between. Tags have to be given in the IOB format (it can be IOB1 or IOB2)."

General sample format

Word stringPOS-tag IOB stringSyntactic chunk tag IOB stringNER-tag IOB string
JapanNNPI-NPI-LOC
'sPOSB-NPO
HashimotoNNPI-NPI-PER
leavesVBZI-VPO
BrazilNNPI-NPI-LOC
forINI-PPO
PeruNNPI-NPI-LOC
..OO

Raw example sentence

Japan NNP I-NP I-LOC
's POS B-NP O
Hashimoto NNP I-NP I-PER
leaves VBZ I-VP O
Brazil NNP I-NP I-LOC
for IN I-PP O
Peru NNP I-NP I-LOC
. . O O

Wikigold

"Wikigold uses the text of 149 articles from the May 22, 2008 dump of English Wikipedia. The articles were selected at random from all articles describing named entities, with a roughly equal proportion of article topics from each of the four CONLL-03 classes (LOC, MISC, ORG, PER)."

General sample format

WordPos-tag
010I-MISC
isO
theO
tenthO
albumO
fromO
JapaneseI-MISC
PunkO
TechnoO
bandO
TheI-ORG
MadI-ORG
CapsuleI-ORG
MarketsI-ORG
.O

Raw example sentence

010 I-MISC
is O
the O
tenth O
album O
from O
Japanese I-MISC
Punk O
Techno O
band O
The I-ORG
Mad I-ORG
Capsule I-ORG
Markets I-ORG
. O

Emerging and Rare Entities

"This shared task (WNUT2017 Shared Task) focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet "so.. kktny in 30 mins?!" - even human experts find the entity kktny hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities."

General sample format

WordNER-tag
@paulwalkO
ItO
'sO
theO
viewO
fromO
whereO
IO
'mO
livingO
forO
twoO
weeksO
.O
EmpireB-location
StateI-location
BuildingI-location
=O
ESBB-location
.O
PrettyO
badO
stormO
hereO
lastO
eveningO
.O

Raw example format

@paulwalk   O
It O
's O
the O
view O
from O
where O
I O
'm O
living O
for O
two O
weeks O
. O
Empire B-location
State I-location
Building I-location
= O
ESB B-location
. O
Pretty O
bad O
storm O
here O
last O
evening O
. O

part_of_speech_tagging

English CoNLL

Part_of_speech_tagging follows the same format as named_entity_recognition.

General sample format

This is the same as named_entity_recognition barring usage of the POS-tags column rather than NER-tags.

Raw example sentence

This is the same as named_entity_recognition.

Groningen Meaning Bank

"The Groningen Meaning Bank (GMB) consists of public domain English texts with corresponding syntactic and semantic representations. The GMB is developed at the University of Groningen. A multi-lingual version of the GMB is the Parallel Meaning Bank. A thorough description of the GMB can be found in the Handbook of Linguistic Annotation."

General sample format

Sentence #WordPOSTag
Sentence: 1ThousandsNNSO
ofINO
demonstratorsNNSO
haveVBPO
marchedVBNO
throughINO
LondonNNPB-geo
toTOO
protestVBO
theDTO
warNNO
inINO
IraqNNPB-geo
andCCO
demandVBO
theDTO
withdrawalNNO
ofINO
BritishJJB-gpe
troopsNNSO
fromINO
thatDTO
countryNNO
..O

Raw example sentence

Sentence: 1,Thousands,NNS,O
,of,IN,O
,demonstrators,NNS,O
,have,VBP,O
,marched,VBN,O
,through,IN,O
,London,NNP,B-geo
,to,TO,O
,protest,VB,O
,the,DT,O
,war,NN,O
,in,IN,O
,Iraq,NNP,B-geo
,and,CC,O
,demand,VB,O
,the,DT,O
,withdrawal,NN,O
,of,IN,O
,British,JJ,B-gpe
,troops,NNS,O
,from,IN,O
,that,DT,O
,country,NN,O
,.,.,O

relation_extraction

kbp37 (SemEval-2010 Task 8 and MIML-RE)

Within SemEeval-2010 Task 8, "there are 9 directional relations and an additional 'other' relation, resulting in 19 relation classes in total. Given a sentence and two target nominals, a prediction is counted as correct only when both the relation and its direction are correct. The performance is evaluated in terms of the F1 score defined by SemEval-2010 Task 8 (Hendrickx et al., 2009)." kbp37 is a modified combination of the datasets SemEval-2010 Task 8 and MIML-RE. kbp37 "contains 18 directional relations and an additional 'no_relation' relation, resulting in 37 relation classes."

General sample format

First line of a single sample contains the sentence number and associated text.

Sentence numberText
0" \ Thom Yorke \ of \ Radiohead \ has included the + for many of his signature distortion sounds using a variety of guitars to achieve various tonal options . "

The second line of a sample contains the relation for the sentence.

Relation
per:employee_of(e1,e2)

Raw example sentence

0    " <e1> Thom Yorke </e1> of <e2> Radiohead </e2> has included the + for many of his signature distortion sounds using a variety of guitars to achieve various tonal options . "
per:employee_of(e1,e2)

keyphrase_extraction

SemEval-2017 Task 10

"A corpus for the task is built from ScienceDirect open access publications and is available freely for participants, without the need to sign a copyright agreement. It consists of 500 journal articles evenly distributed among the domains Computer Science, Material Sciences and Physics. Three types of documents are provided: plain text documents, brat .ann standoff documents, and XML documents. Plain text documents contain sampled paragraphs, .ann files contain annotations for those paragraphs and .xml documents come with the original full article text. The training data part of the corpus consists of 350 documents, 50 are kept for development and 100 for testing."

General sample format

Text files consist of sentences with bracketed numbers added after keyphrases. Therefore, the formatting is simply:

Text
...which limits the allowed power density in fusion reactors [1] and is a severe technical challenge in itself [2]...

An annotation file is organized in the following manner:

IDLabelStart-offsetEnd-offsetSurface-form
T1Process107122fusion reactors
...............
T15Task267293severe technical challenge

Raw example sentence

Power and particle exhaust are crucial for the viability of any future fusion power plant concept. Heat in fusion reactors must be extracted through a wall and cannot be exhausted volumetrically, which limits the allowed power density in fusion reactors [1] and is a severe technical challenge in itself [2]. In addition, structural material changes resulting from neutron irradiation cause degradation in the heat exhaust capabilities of existing designs [3] and static surfaces can suffer severely from erosion due to impinging plasma particles [4,5]. It is concluded that conventional concepts and materials for plasma facing components (PFCs) reach their limits in terms of material lifetime and power exhaust at approximately 20MW/m2, which is presumably dramatically reduced to <10MW/m2 due to neutron damage in a D-T reactor [6] or even only half that value [7].

with corresponding annotation file:

T1    Process 107 122    fusion reactors
T2 Process 238 253 fusion reactors
T3 Process 464 479 static surfaces
T4 Material 0 26 Power and particle exhaust
T5 Process 365 384 neutron irradiation
T6 Task 615 639 plasma facing components
T7 Task 641 645 PFCs
* Synonym-of T7 T6
T8 Material 820 831 D-T reactor
T9 Process 322 349 structural material changes
T10 Material 322 341 structural material
T11 Material 520 546 impinging plasma particles
T12 Material 800 807 neutron
T13 Process 800 814 neutron damage
T14 Material 71 89 fusion power plant
T15 Task 267 293 severe technical challenge

OpenKP

"OpenKeyPhrase(OpenKP), [is] a large scale, open domain keyphrase extraction dataset. The dataset features 148,124 real world web documents along with a human annotation indicating the 1-3 most relevant keyphrases."

It consists of "~100,000 urls from the Bing Index to get a representative sample of true domain diversity. Additionally, we sampled ~40,000 urls from the MSMARCO QA corpus since it can be considered a representative sample of open domain web document search. Once the urls are selected they are provided to an expert judge who visits the website, explores its content and when they are done annotates 1-3 keyphrases in the document they believe to be most salient to the overall document."

General sample format

urltextVDOMKeyphrases
.../star-trek-discovery-season-1.htmlStar Trek Discovery Season 1 Director NA Actors Jason Isaacs...[{\"Id\":0,\"text\":\"Star Trek Discovery Season 1\",\"feature\":[44.0,728.0,78.0,45.0,1.0,0.0,1.0,0.0,20.0,0.0,44.0,728.0,78.0,45.0,1.0,0.0,1.0,0.0,20.0,0.0],\"start_idx\":0,\"end_idx\":5},...,{\"Id\":0,\"text\":\"Play Movie\",\"feature\":[673.0,83.0,2085.0,31.0,0.0,0.0,0.0,0.0,12.0,0.0,667.0,95.0,2086.0,29.0,1.0,0.0,0.0,0.0,12.0,0.0],\"start_idx\":309,\"end_idx\":311}]"[["Star", "Trek"], ["Jason", "Isaacs"], ["Doug", "Jones"]]

Raw example sample

{"url": "http://.../star-trek-discovery-season-1.html", "text": "Star Trek Discovery Season 1 Director NA Actors Jason Isaacs Doug Jones Shazad Latif Sonequa MartinGreen Genres SciFi Country USA Release Year 2017 Duration NA Synopsis Ten years before Kirk Spock and the Enterprise the USS Discovery discovers new worlds and lifeforms as one Starfleet officer learns to understand all things alien YOU ARE WATCHING Star Trek Discovery Season 1 000 000 Loaded Progress The video keeps buffering Just pause it for 510 minutes then continue playing Share Star Trek Discovery Season 1 movie to your friends Share to support Putlocker 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Version 1 Server Mega Play Movie Version 2 Server TheVideo Link 1 Play Movie Version 3 Server TheVideo Link 2 Play Movie Version 4 Server TheVideo Link 3 Play Movie Version 5 Server TheVideo Link 4 Play Movie Version 6 Server NowVideo Play Movie Version 7 Server NovaMov Play Movie Version 8 Server VideoWeed Play Movie Version 9 Server MovShare Play Movie Version 10 Server CloudTime Play Movie Version 11 Server VShare Link 1 Play Movie Version 12 Server VShare Link 2 Play Movie Version 13 Server VShare Link 3 Play Movie Version 14 Server VShare Link 4 Play Movie Version 15 Other Link 1 Play Movie Version 16 Other Link 2 Play Movie Version 17 Other Link 3 Play Movie Version 18 Other Link 4 Play Movie Version 19 Other Link 5 Play Movie Version 20 Other Link 6 Play Movie Version 21 Other Link 7 Play Movie Version 22 Other Link 8 Play Movie Version 23 Other Link 9 Play Movie Version 24 Other Link 10 Play Movie Version 25 Other Link 11 Play Movie Version 26 Other Link 12 Play Movie Version 27 Other Link 13 Play Movie Version 28 Other Link 14 Play Movie Version 29 Other Link 15 Play Movie", "VDOM": "[{\"Id\":0,\"text\":\"Star Trek Discovery Season 1\",\"feature\":[44.0,728.0,78.0,45.0,1.0,0.0,1.0,0.0,20.0,0.0,44.0,728.0,78.0,45.0,1.0,0.0,1.0,0.0,20.0,0.0],\"start_idx\":0,\"end_idx\":5},...,{\"Id\":0,\"text\":\"Play Movie\",\"feature\":[673.0,83.0,2085.0,31.0,0.0,0.0,0.0,0.0,12.0,0.0,667.0,95.0,2086.0,29.0,1.0,0.0,0.0,0.0,12.0,0.0],\"start_idx\":309,\"end_idx\":311}]", "KeyPhrases": [["Star", "Trek"], ["Jason", "Isaacs"], ["Doug", "Jones"]]}

translation

OPUS-100

"OPUS-100 is an English-centric, meaning that all training pairs include English on either the source or target side, multi-lingual corpus covering 100 languages. The OPUS collection is composed of multiple corpora, ranging from movie subtitles to GNOME documentation to the Bible. The dataset is not curated or balanced, instead opting for the simplest approach of downloading all corpora for each language pair and concatenating them."

General sample format

See OPUS Data Formats for an in depth explanation of the formatting.

However, "files are untokenized (raw format) and they may contain multiple sentences per line. They are aligned together to their corresponding sentence(s) in the other language. Empty alignments are excluded from the plain text files." Therefore, the basic format is simply:

Text
...
It also contributes to stability and détente in our part of the world.
There is wide popular support for this policy.
...

with an aligning text document for the target language:

Text
...
Elle contribue également à la stabilité et à la détente dans notre secteur du monde.
Cette politique recueille une large adhésion populaire.
...

Raw example sentence

"Plain text files are provided for each bitext in OPUS. The name follows the typical name conventions used in Moses, i.e. using file extensions that correspond to the language ID. For example, for the RF corpus the two files for English and French are called:"

RF.en-fr.en
RF.en-fr.fr

The contents of the English file looks like this:

Statement of Government Policy by the Prime Minister, Mr Ingvar Carlsson, at the Opening of the Swedish Parliament on Tuesday, 4 October, 1988.
Your Majesties, Your Royal Highnesses, Mr Speaker, Members of the Swedish Parliament.
Sweden's policy of neutrality is of decisive importance for our peace and independence.
It also contributes to stability and détente in our part of the world.
There is wide popular support for this policy.
It will be pursued with firmness and consistency.
...

And the corresponding French file looks like this:

Declaration de Politique Générale du Gouvernement présentée mardi 4 octobre 1988 devant le Riksdag par Monsieur Ingvar Carlsson, Premier Ministre.
Majestés, Altesses Royales, Monsieur le Président, Mesdames et Messieurs les députés!
La politique suédoise de neutralité revêt une importance capitale pour la paix et l' indépendance de notre pays.
Elle contribue également à la stabilité et à la détente dans notre secteur du monde.
Cette politique recueille une large adhésion populaire.
Elle sera poursuivie avec énergie et cohérence.
...

sentiment_analysis_binary and sentiment_analysis_categorical

SST-1 and SST-2 (extends MR)

MR: Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005). SST-1: Stanford Sentiment Treebank - an extension of MR but with train/dev/test splits provided and fine-grained labels. The labels take the form of very positive, positive, neutral, negative, and very negative. Note that data is actually provided at the phrase-level and hence we train the model on both phrases and sentences but only score on sentences at test time. Thus the training set is an order of magnitude larger than listed in the above table. SST-2 Same as SST-1 but with neutral reviews removed and binary labels.

General sample format

Both datasets follow the same format but differ in the range of their sentiment labels.

SentimentText
4victor rosa is leguizamo 's best movie work so far , a subtle and richly internalized performance

Raw example sentence (SST-1)

4 victor rosa is leguizamo 's best movie work so far , a subtle and richly internalized performance 

Raw example sentence (SST-2)

0 plotless collection of moronic stunts is by far the worst movie of the year .

coreference_resolution (antecedentScoring)

GAP and WikiCoref

"GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name). It's sampled from Wikipedia and released by Google AI Language for the evaluation of coreference resolution in practical applications." Wikicoref is "an English corpus annotated for anaphoric relations, where all documents are from the English version of Wikipedia." Wikicoref maintains the following directory structure:

Documents
Evaluation
Annotation
Output

Documents contains all the plain text samples for use during training. Annotation is in MMAX Project format, and "the markables are stored within /docName/Markables/docName_coref_level.xml." Evaluation "contains Conll-2011 Format" files for evaluation model accuracy. Output contains three subdirectories: Key, Dcoref, and Scoref. Key contains "WikiCoref in Table,Text Format". Dcoref contains "the output of Dcoref in Table,Text,StandfordXML format". Scoref contains "the output of Scoref in Table,Text,StandfordXML Format" respectively.

For more details about Wikicoref please refer to "WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles."

Formatted example sentence (GAP)

IDTextPronounPronoun-offsetAA-offsetA-corefBB-offsetB-corefURL
development-111Alice Perrers is the protagonist of Emma Campion's novel, The King's Mistress. She appears in Anya Seton's novel, Katherine.She79Alice Perrers0TRUEEmma Campion36FALSEhttp://en.wikipedia.org/wiki/Alice_Perrers

Raw example sentence (GAP)

development-111    Alice Perrers is the protagonist of Emma Campion's novel, The King's Mistress. She appears in Anya Seton's novel, Katherine.    She    79    Alice Perrers    0    TRUE    Emma Campion    36    FALSE    http://en.wikipedia.org/wiki/Alice_Perrers

Formatted example sentence (Wikicoref)

Any of the training files under the Documents directory are simply plain text files containing sentences. Therefore, the format is:

Text
Anatole France was a French poet, journalist, and novelist. He was born in Paris, and died in Saint-Cyr-sur-Loire...

Raw example sentence (Wikicoref)

Per the training file Documents/Anatole France:

Anatole France was a French poet, journalist, and novelist. He was born in Paris, and died in Saint-Cyr-sur-Loire. He was a successful novelist, with several best-sellers. Ironic and skeptical, he was considered in his day the ideal French man of letters...