Version: 2.0

NLP Model Training

Introduction

NLP-model-training is a microservice responsible for building and training models within python via TensorFlow.

Project Structure

NLP-model-training is designed as a simple command line interface (CLI) in order to train natural language processing (NLP) models. As such it has the commands prep, train, eval, cleanup, update_model_metadata_registry, and save.

The train command contains a complete end-to-end NLP model training pipeline with the tasks build_model, preprocess, combine_datasets, cache, train, and model_policy.

CLI commands

prep

prep retrieves all the necessary dependencies in order to train a specified model and sets up the basic expected local training directory structure.

Parameters

parameter	usage	required bool	default	choices	deprecation notice
datasets	JSON object with dataset info to train upon	True	None	see Airflow Documentation	N/A
model	Name of model to train	True	None	named_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_model	N/A
gg_version	GraphGrid version to use	True	None	1.4.0, 2.0.0, latest (only with local Airflow development)	N/A
request_body	Accepts a raw GraphGrid Airflow request body	False	None	see here	N/A
log_level	Flag to determine the level of logging	False	INFO	DEBUG, INFO, WARN, ERROR	N/A
dataset_bucket	name of the bucket which contains the training and evaluation datasets	True	None	graphgrid-datasets	N/A
bert_bucket	name of the bucket which contains a BERT model	True	None	graphgrid-nlp-models	Removed in 2.0.0
endpoint_url	s3 or MinIO endpoint for dataset retrieval and saving resources out to	True	None	https://s3.us-west-2.amazonaws.com, http://Airflow-minio:9000 (only with local Airflow development)	N/A
remote_bert_model_path	path to the BERT model on remote cloud environment	True	None	resources/1.4/bertModel/	Removed in 2.0.0
remote_vocab_path	path to the vocab.txt on remote cloud environment for the associated BERT model on cloud storage service	True	None	resources/1.4/vocab.txt	Removed in 2.0.0
top_level_path	Local path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the `prep` task, otherwise it's expected to already exist.	True	None	N/A	N/A

train

train a NLP model based on the specified parameters. This command includes a full end to end training pipeline which first preprocesses text from the raw dataset/corpus format into a more standardized format (such as CoNLL). Following that transformation, data is processed further in preparation of being embedded through BERT. This includes tokenization, truncation, and tracking of sentence metadata (word lengths, indices, etc.). Once data is embedded through BERT, it undergoes further processing with post BERT embedding handlers. handlers vary heavily depending on the model, however, one common operation is averaging subword tokens (e.g. averaging the embedding for the multi token word "ham ##burger"). See below for a simplified example processing pipeline that a sample undergoes.

Screenshot

Once preprocessing through handlers is complete, data is cached for use during training. This allows models to only load the data that they are currently using during training to avoid massive memory consumption. After the models are trained, they are saved and a model policy is created documenting various characteristics of the previous training session (see here for information on model policies).

In addition to the training resources that are generated during training, a metadata.json file is generated at top_level_path/model_save/metadata.json which contains relevant information to propagate to the evaluation task. This file is used as a resource and reference for the model-metadata-registry, see here for more information regarding the model metadata registry.

Below is an example generated metadata.json file for a named_entity_recognition model.

{
    "modelVersion": {
        "modelName": "named_entity_recognition",
        "trainingDataset": [
            "conll"
        ],
        "trainingAccuracy": 0.989,
        "trainingLoss": 0.007,
        "evalAccuracy": null,
        "evalLoss": null,
        "properties": {
            "languages": [
                "en"
            ]
        }
    }
}

Parameters

parameter	usage	required bool	default	choices	deprecation notice
datasets	JSON object with dataset info to train upon	True	None	see Airflow Documentation	N/A
model	Name of model to train	True	None	named_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_model	N/A
retraining	boolean for whether model is being retrained	False	false	true, false	N/A
request_body	Accepts a raw GraphGrid Airflow request body	False	None	see here	N/A
log_level	Flag to determine the level of logging	False	INFO	DEBUG, INFO, WARN, ERROR	N/A
top_level_path	Local path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the `prep` task, otherwise it's expected to already exist.	True	None	N/A	N/A

eval

eval reloads a trained model into memory from top_level_path/model_save/ and evaluates the model's performance on the evaluation test set (if one exists for the given model). Once that is complete, the metadata.json file from top_level_path/model_save/metadata.json is updated with the evaluation results.

Parameters

parameter	usage	required bool	default	choices	deprecation notice
datasets	JSON object with dataset info to train upon	True	None	see Airflow Documentation	N/A
model	Name of model to train	True	None	named_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_model	N/A
top_level_path	Local path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the `prep` task, otherwise it's expected to already exist.	True	None	N/A	N/A

save

save saves the trained model, resources, and dependencies out to the cloud storage environment as specified by the endpoint_url parameter.

Parameters

parameter	usage	required bool	default	choices	deprecation notice
model	Name of model to train	True	None	named_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_model	N/A
gg_version	GraphGrid version to use	True	None	1.4.0, 2.0.0, latest (only with local Airflow development)	N/A
request_body	Accepts a raw GraphGrid Airflow request body	False	None	see here	N/A
endpoint_url	s3 or MinIO endpoint for dataset retrieval and saving resources out to	True	None	https://s3.us-west-2.amazonaws.com, http://Airflow-minio:9000 (only with local Airflow development)	N/A
save_bucket	The name of the bucket to save the model and resources to.	True	None	graphgrid-nlp-models	N/A

update_model_metadata_registry

update_model_metadata_registry updates the model metadata registry based on the history of the trained model, and the generated metadata.json file. For more information about the model metadata registry, see here.

Parameters

parameter	usage	required bool	default	choices	deprecation notice
model	Name of model to train	True	None	named_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_model	N/A
gg_version	GraphGrid version to use	True	None	1.4.0, 2.0.0, latest (only with local Airflow development)	N/A
endpoint_url	s3 or MinIO endpoint for dataset retrieval and saving resources out to	True	None	https://s3.us-west-2.amazonaws.com, http://Airflow-minio:9000 (only with local Airflow development)	N/A
save_bucket	The name of the bucket to save the model and resources to.	True	None	graphgrid-nlp-models	N/A
top_level_path	Local path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the `prep` task, otherwise it's expected to already exist.	True	None	N/A	N/A

cleanup

cleanup cleans up the local training directory and all the related resources.

Parameters

parameter	usage	required bool	default	choices	deprecation notice
model	Name of model to train	True	None	named_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_model	N/A
top_level_path	Local path to cache training dependencies/resources/etc. NOTE: this path must not exist if running the `prep` task, otherwise it's expected to already exist.	True	None	N/A	N/A

Model Policies

Model policies are a means of documenting the basic characteristics of a trained GraphGrid NLP model. A model policy informs any downstream project/app/etc. how to interact with the model. This is mostly seen through the inputTensors, outputTensors, and requiredOpsFiles fields.

Here is an example model policy produced from a named_entity_recognition training session.

{
    "corpora": [
        "conll"
    ],
    "languages": [
        "en"
    ],
    "checksum": "R1FKcWbcfmTwYY3ILeMaEw==",
    "modelName": "20210223T221627-nerModel",
    "inputTensors": [
        {
            "name": "masking_input:0",
            "shape": [
                1,
                128,
                768
            ]
        }
    ],
    "task": "NAMED_ENTITY_RECOGNITION",
    "testSet": "",
    "outputTensors": [
        {
            "name": "dense_1/truediv:0",
            "shape": [
                1,
                128,
                9
            ],
            "outputValues": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC",
                "B-MISC",
                "I-MISC"
            ]
        }
    ],
    "modelAccuracy": "0.98931605",
    "trainingDate": "2021-02-23T22:15:58",
    "requiredOpsFiles": [
        "_lstm_ops.so"
    ]
}

Here is a table describing all the fields within a model policy.

Field	Description	Type
corpora	List of all the corpora/datasets the model was trained on	List of strings
languages	The languages the model was trained on	List of strings
checksum	A checksum that nlp-model-training generates for downstream projects to verify after retrieval	String
modelName	name of the model	String
inputTensors	List of nested JSON objects documenting the name and shape of the input layer	List of nested JSON objects
task	boot-nlp related task name for the specified model	String
testSet	Unused/empty string	String
outputTensors	List of nested JSON objects documenting the name and shape of the output layer	List of nested JSON objects
modelAccuracy	The model accuracy during training	String
trainingDate	The date when the model was trained	String
requiredOpsFiles	The required Ops files in order to run the model	List of strings

Model Metadata Registry

The model metadata registry is a resource geared towards tracking all the trained models via the NLP-model-training pipeline and Airflow. This file lives at the top level of the bucket used for saving models (default com-graphgrid-nlp). The registry is responsible for which models are selected for packaging. As such, the newly trained model is compared to all the other currently trained models to determine whether it should be labeled as the current default model for packaging once training and evaluation is complete.

note

This file should never be altered manually.

Here is a truncated sample of the current GraphGrid model metadata registry which includes models for both the 1.4 and 2.0 platform versions.

{
    "org": {
        "default": {
            "package": {
                "1.4.0": {
                    "models": {
                        "part_of_speech_tagging": {
                            "modelVersion": "20210107T172128-posModel",
                            "location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/1.4.0/20210107T172128-posModel.tar.gz"
                        },
                        ...
                        "sentiment_binary_lstm": {
                            "modelVersion": "20201222T054419-sentimentBinaryLstmModel",
                            "location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/1.4.0/20201222T054419-sentimentBinaryLstmModel.tar.gz"
                        }
                    }
                },
                "2.0.0": {
                    "models": {
                        "part_of_speech_tagging": {
                            "modelVersion": "20210521T215028-posModel",
                            "location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/2.0.0/20210521T215028-posModel.tar.gz"
                        },
                        ...
                        "sentiment_categorical_lstm": {
                            "modelVersion": "20210325T110728-sentimentCategoricalLstmModel",
                            "location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/2.0.0/20210325T110728-sentimentCategoricalLstmModel.tar.gz"
                        }
                    }
                }
            },
            "trainedModels": {
                "part_of_speech_tagging": {
                    "20201218T150016-posModel": {
                        "modelName": "part_of_speech_tagging",
                        "trainingDataset": "conll",
                        "trainingAccuracy": 0.954,
                        "trainingLoss": 0.154,
                        "evalAccuracy": 0.931,
                        "evalLoss": 0.246,
                        "properties": {
                            "languages": [
                                "en"
                            ]
                        },
                        "location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/1.4.0/20201218T150016-posModel.tar.gz",
                        "timestamp": "2020-12-18T15:00:16",
                        "platformVersion": "1.4.0"
                    },
                    ...
                    "20210719T203352-posModel": {
                        "modelName": "part_of_speech_tagging",
                        "trainingDataset": [
                            "conll"
                        ],
                        "trainingAccuracy": 0.945,
                        "trainingLoss": 0.035,
                        "evalAccuracy": 0.918,
                        "evalLoss": 0.056,
                        "properties": {
                            "languages": [
                                "en"
                            ]
                        },
                        "location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/2.0.0/20210719T203352-posModel.tar.gz",
                        "timestamp": "2021-07-19T20:33:52",
                        "platformVersion": "2.0.0"
                    }
                },
                ...
                "sentiment_categorical_lstm": {
                    "20210325T110728-sentimentCategoricalLstmModel": {
                        "modelName": "sentiment_categorical_lstm",
                        "trainingDataset": [
                            "sst1"
                        ],
                        "trainingAccuracy": 0.749,
                        "trainingLoss": 0.607,
                        "evalAccuracy": 0.487,
                        "evalLoss": 1.354,
                        "properties": {
                            "languages": [
                                "en"
                            ]
                        },
                        "location": "https://s3.us-west-2.amazonaws.com/graphgrid-nlp-models/2.0.0/20210325T110728-sentimentCategoricalLstmModel.tar.gz",
                        "timestamp": "2021-03-25T11:07:28",
                        "platformVersion": "2.0.0"
                    }
                }
            }
        }
    }
}

Training pipeline tasks

As the name implies, the train CLI command is the heart of NLP-model-training as it is responsible for running the complete NLP model training pipeline.

Some of the pipeline tasks are shared by multiple different models (e.g. BERT embedding), while some other models require custom or unique steps (e.g. translation models). The pipeline_config module dictates which exact pipeline steps each model performs.

build_model

build_model builds and compiles a GraphGrid NLP model. The following are all the possible trainable/buildable models: named_entity_recognition, part_of_speech_tagging, relation_extraction, coreference_resolution, sentiment_analysis_categorical, sentiment_analysis_binary, translation, keyphrase_extraction, test_model.

preprocess

preprocess processes a given corpus/dataset's raw samples and transforms them into a more standardized format.

There are currently 10 preprocessing modules, which follow a naming scheme of preprocess_MODEL_CORPUS, unless the corpus is shared across different models in which case the model field is dropped.

combine_datasets

combine_datasets loads all preprocessed datasets into memory, concatenates them, and then shuffles them. In training scenarios where only a single corpus is used for training, this step still occurs, but simply shuffles the dataset as there are not multiple datasets to combine.

cache

cache transforms the preprocessed dataset and caches it inside the local training directory. For all models other than translation, this includes embedding the samples through BERT.

train

train trains a given model with the cached training samples from the cache task. Once training is complete, the model is saved within the top_level_path/model_save directory.

model_policy

model_policy generates a model policy for the newly trained NLP model. For information regarding model policies see here.

Pipeline parameters

The NLP pipeline defines parameter dataclasses to ease interfaces across distinct model training pipeline steps. For example, the module build_model_parameters defines the dataclass BuildModelParameters which includes the fields model_name, retraining, and retraining_path. Each parameter dataclass allows for a single and simple interface between pipeline steps which places the responsibility of individual parameter retrieval on downstream pipeline tasks. For example, if two different models have distinct model building modules which require different parameters, we can embed both within BuildModelParameters and have the distinct modules retrieve only the parameters that they need. This then allows us to define our pipeline through lambda functions that take in the parameter dataclasses.

In the following sections, we'll go through each individual parameter dataclass and list out the associated parameters.

BuildModelParameters

Field	Description	Type
model_name	The name of the model	String
retraining	Flag to determine whether a pretrained model exists locally	Boolean
retraining_path	Path to local pretrained model	String

CacheParameters

Field	Description	Type
model_name	The name of the model	String
cache_path	Path to cache training samples	String
outfile_path	Path to preprocessed samples file or directory	String

CombineDatasetsParameters

Field	Description	Type
datasets	Dictionary with each dataset being an entry with relevant information	Dictionary
outfile_path	Path to preprocessed samples file or directory	String

ModelPolicyParameters

Field	Description	Type
model	The trained model	[BaseModel, TranslationModel]
datasets	List of all the datasets the model was trained on	List
history	History object that contains training performance metrics	[tf.keras.callbacks.History, DummyHistory]
model_policy_path	Path to save the model policy	String
languages	List of all the languages the model was trained on	List

PreprocessingParameters

Field	Description	Type
infile_path	Path to load raw dataset/corpus	String
outfile_path	Path to save preprocessed samples file or directory	String
dataset	Name of the dataset or corpus to preprocess	String
is_pos	Flag to mark part-of-speech models	Boolean

TrainingParameters

Field	Description	Type
model	The trained model	[BaseModel, TranslationModel]
model_name	The name of the model	String
cache_path	Path to cache training samples	String
model_save_path	Path to save the model	String

Data transformation specifications by model

All samples and data undergoes different types of transformations before being fed directly into our models. This section breaks down each model's transformations in sequential order and documents what occurs at each step.

named_entity_recognition

Training pipeline step	Input shape	Input shape type	Output shape	Output shape type	Link
`preprocess_conll`	`(None,)`	String	`(None,)`	List of strings	here
tokenization	`(batch_size,)`	Batch of strings	`(batch_size, None, None)`	`tf.RaggedTensor`	here
BERT	`(batch_size, None, None)`	`tf.RaggedTensor`	[`(batch_size, 768)`, `(batch_size, 128, 768)`]	`tf.Tensor`	here
`conll_handler`	`(batch_size, 128, 768)`	`tf.Tensor`	`(batch_size, 128, 768)`	`tf.Tensor`	here

part_of_speech_tagging

These transformation steps are identical to those of named_entity_recognition.

relation_extraction

Training pipeline step	Input shape	Input shape type	Output shape	Output shape type	Link
`preprocess_re_kbp37`	`(None,)`	String	`(None,)`	List of strings	here
tokenization	`(batch_size,)`	Batch of strings	`(batch_size, None, None)`	`tf.RaggedTensor`	here
BERT	`(batch_size, None, None)`	`tf.RaggedTensor`	[`(batch_size, 768)`, `(batch_size, 128, 768)`]	`tf.Tensor`	here
`handle_re_processing`	`(batch_size, 128, 768)`	`tf.Tensor`	`(batch_size, 128, 808)`	`tf.Tensor`	here

Relation extraction handler

The relation extraction handler takes BERT embeddings as its input and calculates the distance between the mentions and relative words within the sample, then embeds their positions. This yields a tensor of shape (batch_size, 128, 808). Relation extraction iterates token-by-token through the BERT embedding to calculate the distance between the token and the mentions. This yields two distances for every word (the distance is 0 when analyzing the mentions themselves). These distances are then passed through both sin and cos functions and expanded to a dimensionality of 20. As there are two mentions and 20 corresponding position embeddings, this allows the final output shape to arrive at (batch_size, 128, 768 + 20 + 20).

coreference_resolution

Training pipeline step	Input shape	Input shape type	Output shape	Output shape type	Link
`preprocess_crr_gap`, `preprocess_crr_wikicoref`	`(None,)`	String	`(None,)`	List of strings	here
tokenization	`(batch_size,)`	Batch of strings	`(batch_size, None, None)`	`tf.RaggedTensor`	here
BERT	`(batch_size, None, None)`	`tf.RaggedTensor`	[`(batch_size, 768)`, `(batch_size, 128, 768)`]	`tf.Tensor`	here
`handle_crr_processing`	`(batch_size, 128, 768)`	`tf.Tensor`	`(batch_size, 13854)`	`tf.Tensor`	here

Coreference resolution handler

The coreference resolution handler takes BERT embeddings as its input and calculates relative positional information about the sample and then creates a flattened tensor of all the information of shape (batch_size, 13854). Which can be broken down in the following manner: ((768 * 9 + 1 + 9) * 2 + 1 + 9).

The initial 768 * 9 are the following values:

Index	Description
0	The first corresponding token embedding for the mention
1	The last corresponding token embedding for the mention
2	The first token embedding before the mention (e.g. mention index - 1)
3	The second token embedding before the mention (e.g. mention index - 2)
4	The first token embedding after the mention (e.g. mention index + 1)
5	The second token embedding after the mention (e.g. mention index + 2)
6	The mean of the 5 tokens following the mention
7	The mean of the 5 tokens before the mention
8	The pooled output of the sample

As each embedded token has a shape of 768 and there are 9 different values.

The next singular appended value is the length of the mention within the sample.

The following 9 flat values are the "bucketized", i.e. one-hot encoded, sequence length of the tensor with depth 9. Finally, as coreference resolution is concerned with two different mentions, we repeat this same process for the second mention. Which is how we arrive at a shape of (batch_size, 13854).

keyphrase_extraction

Training pipeline step	Input shape	Input shape type	Output shape	Output shape type	Link
`preprocess_keyphrase_semeval2017`	`(None,)`	String	`(None,)`	List of strings	here
tokenization	`(batch_size,)`	Batch of strings	`(batch_size, None, None)`	`tf.RaggedTensor`	here
BERT	`(batch_size, None, None)`	`tf.RaggedTensor`	[`(batch_size, 768)`, `(batch_size, 128, 768)`]	`tf.Tensor`	here
`handle_keyphrase_processing`	`(batch_size, 128, 768)`	`tf.Tensor`	`(batch_size, 128, 768)`	`tf.Tensor`	here

Keyphrase extraction handler

The keyphrase extraction handler takes BERT embeddings as its input and does not alter it in any way. Therefore, its exact output is the same as the BERT embedding, i.e. a tensor of shape (batch_size, 128, 768)

sentiment_analysis_binary and sentiment_analysis_categorical

Training pipeline step	Input shape	Input shape type	Output shape	Output shape type	Link
`preprocess_sentiment_sst`	here
tokenization	`(batch_size,)`	Batch of strings	`(batch_size, None, None)`	`tf.RaggedTensor`	here
BERT	`(batch_size, None, None)`	`tf.RaggedTensor`	[`(batch_size, 768)`, `(batch_size, 128, 768)`]	`tf.Tensor`	here
`handle_sentiment_lstm_processing`, `handle_sentiment_cat_lstm_processing`	`(batch_size, 128, 768)`	`tf.Tensor`	`(batch_size, 128, 768)`	`tf.Tensor`	here

Sentiment analysis handler

The sentiment analysis handler takes BERT embeddings as its input and does not alter it in any way. Therefore, its exact output is the same as the BERT embedding, i.e. a tensor of shape (batch_size, 128, 768)

Shared transformations

Various transformations within the pipeline are shared across different models. This section goes through each shared transformation.

Tokenization

Tokenization takes in preprocessed samples and transforms them into the expected BERT input shape. This means that we first split sentences by whitespaces (as labels for specific tasks correspond words to labels), tokenize them through tensorflow_text.BertTokenizer, extract relevant metadata about the sentence (word indices, lengths, etc.), and then reshape the ragged tensor into the format that the helper function/model bert_pack_inputs expects.

The exact input is individual sentences in a simple string format. While the exact output is a ragged tensor with two nested rows, where row one defines the flat dimension of all the words in the sentence, and row two defines tensor sub-dimensions for sub tokens (e.g. indices where single words are comprised of multiple sub-tokens).

BERT embedding

BERT embedding takes tokenized sentences and processes them through BERT. This step is directly downstream from Tokenization, which means its input is the ragged tensor output from that step. The exact output for this step depends on whether the model expects pooled outputs (which can be thought of the entire culmination of the sentence), or sequence output (every token from the sentence). BERT defines a maximum sequence length for sentences, where any sentences that exceed that length in their number of tokens is truncated to the sequence length. For all of our models, the sequence length is defined as 128. Therefore, the exact output for this step is a either a tensor of shape (batch_size, 768), or (batch_size, 128, 768), where the former shape is for pooled outputs and the latter shape is for sequence outputs.

CoNLL handler

CoNLL handler is the post-BERT processing module that is shared for part_of_speech_tagging and named_entity_recognition models. As a handler, this step is directly downstream from BERT embedding, meaning it takes the sequence output from that step as its input. This step takes the mean of sub-tokens across the input batch, which aligns the number of inputs to the number of labels and allows for better model performance. This means that rather than retaining a multi-token word such as ham ##burger, we can take the mean of the BERT embeddings for the corresponding tokens ham and ##burger as they're the same word corresponding to the same label. The exact output of this step is (batch_size, 128, 768), however, all sub tensors within the sequence past the last word/token have be replaced with zero tensors.

Dataset Formats per model

Most of the NLP models utilize distinct datasets which use different formats. Therefore this section outlines each model's dataset and their basic formats.

named_entity_recognition

Computation Natural Language Learning (CoNLL)

"CoNLL2003: each word has to be on a separate line, and there must be an empty line after each sentence. A line must contain at least 2 columns, the first one being the word itself, the last one being the named entity. It does not matter if there are extra columns that contain tags or chunks in between. Tags have to be given in the IOB format (it can be IOB1 or IOB2)."

General sample format

Word string	POS-tag IOB string	Syntactic chunk tag IOB string	NER-tag IOB string
Japan	NNP	I-NP	I-LOC
's	POS	B-NP	O
Hashimoto	NNP	I-NP	I-PER
leaves	VBZ	I-VP	O
Brazil	NNP	I-NP	I-LOC
for	IN	I-PP	O
Peru	NNP	I-NP	I-LOC
.	.	O	O

Raw example sentence

Japan NNP I-NP I-LOC
's POS B-NP O
Hashimoto NNP I-NP I-PER
leaves VBZ I-VP O
Brazil NNP I-NP I-LOC
for IN I-PP O
Peru NNP I-NP I-LOC
. . O O

Wikigold

"Wikigold uses the text of 149 articles from the May 22, 2008 dump of English Wikipedia. The articles were selected at random from all articles describing named entities, with a roughly equal proportion of article topics from each of the four CONLL-03 classes (LOC, MISC, ORG, PER)."

General sample format

Word	Pos-tag
010	I-MISC
is	O
the	O
tenth	O
album	O
from	O
Japanese	I-MISC
Punk	O
Techno	O
band	O
The	I-ORG
Mad	I-ORG
Capsule	I-ORG
Markets	I-ORG
.	O

Raw example sentence

010 I-MISC
is O
the O
tenth O
album O
from O
Japanese I-MISC
Punk O
Techno O
band O
The I-ORG
Mad I-ORG
Capsule I-ORG
Markets I-ORG
. O

Emerging and Rare Entities

"This shared task (WNUT2017 Shared Task) focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet "so.. kktny in 30 mins?!" - even human experts find the entity kktny hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities."

General sample format

Word	NER-tag
@paulwalk	O
It	O
's	O
the	O
view	O
from	O
where	O
I	O
'm	O
living	O
for	O
two	O
weeks	O
.	O
Empire	B-location
State	I-location
Building	I-location
=	O
ESB	B-location
.	O
Pretty	O
bad	O
storm	O
here	O
last	O
evening	O
.	O

Raw example format

@paulwalk   O
It  O
's  O
the O
view    O
from    O
where   O
I   O
'm  O
living  O
for O
two O
weeks   O
.   O
Empire  B-location
State   I-location
Building    I-location
=   O
ESB B-location
.   O
Pretty  O
bad O
storm   O
here    O
last    O
evening O
.   O

part_of_speech_tagging

English CoNLL

Part_of_speech_tagging follows the same format as named_entity_recognition.

General sample format

This is the same as named_entity_recognition barring usage of the POS-tags column rather than NER-tags.

Raw example sentence

This is the same as named_entity_recognition.

Groningen Meaning Bank

"The Groningen Meaning Bank (GMB) consists of public domain English texts with corresponding syntactic and semantic representations. The GMB is developed at the University of Groningen. A multi-lingual version of the GMB is the Parallel Meaning Bank. A thorough description of the GMB can be found in the Handbook of Linguistic Annotation."

General sample format

Sentence #	Word	POS	Tag
Sentence: 1	Thousands	NNS	O
	of	IN	O
	demonstrators	NNS	O
	have	VBP	O
	marched	VBN	O
	through	IN	O
	London	NNP	B-geo
	to	TO	O
	protest	VB	O
	the	DT	O
	war	NN	O
	in	IN	O
	Iraq	NNP	B-geo
	and	CC	O
	demand	VB	O
	the	DT	O
	withdrawal	NN	O
	of	IN	O
	British	JJ	B-gpe
	troops	NNS	O
	from	IN	O
	that	DT	O
	country	NN	O
	.	.	O

Raw example sentence

Sentence: 1,Thousands,NNS,O
,of,IN,O
,demonstrators,NNS,O
,have,VBP,O
,marched,VBN,O
,through,IN,O
,London,NNP,B-geo
,to,TO,O
,protest,VB,O
,the,DT,O
,war,NN,O
,in,IN,O
,Iraq,NNP,B-geo
,and,CC,O
,demand,VB,O
,the,DT,O
,withdrawal,NN,O
,of,IN,O
,British,JJ,B-gpe
,troops,NNS,O
,from,IN,O
,that,DT,O
,country,NN,O
,.,.,O

relation_extraction

kbp37 (SemEval-2010 Task 8 and MIML-RE)

Within SemEeval-2010 Task 8, "there are 9 directional relations and an additional 'other' relation, resulting in 19 relation classes in total. Given a sentence and two target nominals, a prediction is counted as correct only when both the relation and its direction are correct. The performance is evaluated in terms of the F1 score defined by SemEval-2010 Task 8 (Hendrickx et al., 2009)." kbp37 is a modified combination of the datasets SemEval-2010 Task 8 and MIML-RE. kbp37 "contains 18 directional relations and an additional 'no_relation' relation, resulting in 37 relation classes."

General sample format

First line of a single sample contains the sentence number and associated text.

Sentence number	Text
0	" \ Thom Yorke \ of \ Radiohead \ has included the + for many of his signature distortion sounds using a variety of guitars to achieve various tonal options . "

The second line of a sample contains the relation for the sentence.

Relation
per:employee_of(e1,e2)

Raw example sentence

0    " <e1> Thom Yorke </e1> of <e2> Radiohead </e2> has included the + for many of his signature distortion sounds using a variety of guitars to achieve various tonal options . "
per:employee_of(e1,e2)

keyphrase_extraction

SemEval-2017 Task 10

"A corpus for the task is built from ScienceDirect open access publications and is available freely for participants, without the need to sign a copyright agreement. It consists of 500 journal articles evenly distributed among the domains Computer Science, Material Sciences and Physics. Three types of documents are provided: plain text documents, brat .ann standoff documents, and XML documents. Plain text documents contain sampled paragraphs, .ann files contain annotations for those paragraphs and .xml documents come with the original full article text. The training data part of the corpus consists of 350 documents, 50 are kept for development and 100 for testing."

General sample format

Text files consist of sentences with bracketed numbers added after keyphrases. Therefore, the formatting is simply:

Text
...which limits the allowed power density in fusion reactors [1] and is a severe technical challenge in itself [2]...

An annotation file is organized in the following manner:

ID	Label	Start-offset	End-offset	Surface-form
T1	Process	107	122	fusion reactors
...	...	...	...	...
T15	Task	267	293	severe technical challenge

Raw example sentence

Power and particle exhaust are crucial for the viability of any future fusion power plant concept. Heat in fusion reactors must be extracted through a wall and cannot be exhausted volumetrically, which limits the allowed power density in fusion reactors [1] and is a severe technical challenge in itself [2]. In addition, structural material changes resulting from neutron irradiation cause degradation in the heat exhaust capabilities of existing designs [3] and static surfaces can suffer severely from erosion due to impinging plasma particles [4,5]. It is concluded that conventional concepts and materials for plasma facing components (PFCs) reach their limits in terms of material lifetime and power exhaust at approximately 20MW/m2, which is presumably dramatically reduced to <10MW/m2 due to neutron damage in a D-T reactor [6] or even only half that value [7].

with corresponding annotation file:

T1    Process 107 122    fusion reactors
T2    Process 238 253    fusion reactors
T3    Process 464 479    static surfaces
T4    Material 0 26    Power and particle exhaust
T5    Process 365 384    neutron irradiation
T6    Task 615 639    plasma facing components
T7    Task 641 645    PFCs
*    Synonym-of T7 T6
T8    Material 820 831    D-T reactor
T9    Process 322 349    structural material changes
T10    Material 322 341    structural material
T11    Material 520 546    impinging plasma particles
T12    Material 800 807    neutron
T13    Process 800 814    neutron damage
T14    Material 71 89    fusion power plant
T15    Task 267 293    severe technical challenge

OpenKP

"OpenKeyPhrase(OpenKP), [is] a large scale, open domain keyphrase extraction dataset. The dataset features 148,124 real world web documents along with a human annotation indicating the 1-3 most relevant keyphrases."

It consists of "~100,000 urls from the Bing Index to get a representative sample of true domain diversity. Additionally, we sampled ~40,000 urls from the MSMARCO QA corpus since it can be considered a representative sample of open domain web document search. Once the urls are selected they are provided to an expert judge who visits the website, explores its content and when they are done annotates 1-3 keyphrases in the document they believe to be most salient to the overall document."

General sample format

url	text	VDOM	Keyphrases
.../star-trek-discovery-season-1.html	Star Trek Discovery Season 1 Director NA Actors Jason Isaacs...	[{\"Id\":0,\"text\":\"Star Trek Discovery Season 1\",\"feature\":[44.0,728.0,78.0,45.0,1.0,0.0,1.0,0.0,20.0,0.0,44.0,728.0,78.0,45.0,1.0,0.0,1.0,0.0,20.0,0.0],\"start_idx\":0,\"end_idx\":5},...,{\"Id\":0,\"text\":\"Play Movie\",\"feature\":[673.0,83.0,2085.0,31.0,0.0,0.0,0.0,0.0,12.0,0.0,667.0,95.0,2086.0,29.0,1.0,0.0,0.0,0.0,12.0,0.0],\"start_idx\":309,\"end_idx\":311}]"	[["Star", "Trek"], ["Jason", "Isaacs"], ["Doug", "Jones"]]

Raw example sample

{"url": "http://.../star-trek-discovery-season-1.html", "text": "Star Trek Discovery Season 1 Director NA Actors Jason Isaacs Doug Jones Shazad Latif Sonequa MartinGreen Genres SciFi Country USA Release Year 2017 Duration NA Synopsis Ten years before Kirk Spock and the Enterprise the USS Discovery discovers new worlds and lifeforms as one Starfleet officer learns to understand all things alien YOU ARE WATCHING Star Trek Discovery Season 1 000 000 Loaded Progress The video keeps buffering Just pause it for 510 minutes then continue playing Share Star Trek Discovery Season 1 movie to your friends Share to support Putlocker 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Version 1 Server Mega Play Movie Version 2 Server TheVideo Link 1 Play Movie Version 3 Server TheVideo Link 2 Play Movie Version 4 Server TheVideo Link 3 Play Movie Version 5 Server TheVideo Link 4 Play Movie Version 6 Server NowVideo Play Movie Version 7 Server NovaMov Play Movie Version 8 Server VideoWeed Play Movie Version 9 Server MovShare Play Movie Version 10 Server CloudTime Play Movie Version 11 Server VShare Link 1 Play Movie Version 12 Server VShare Link 2 Play Movie Version 13 Server VShare Link 3 Play Movie Version 14 Server VShare Link 4 Play Movie Version 15 Other Link 1 Play Movie Version 16 Other Link 2 Play Movie Version 17 Other Link 3 Play Movie Version 18 Other Link 4 Play Movie Version 19 Other Link 5 Play Movie Version 20 Other Link 6 Play Movie Version 21 Other Link 7 Play Movie Version 22 Other Link 8 Play Movie Version 23 Other Link 9 Play Movie Version 24 Other Link 10 Play Movie Version 25 Other Link 11 Play Movie Version 26 Other Link 12 Play Movie Version 27 Other Link 13 Play Movie Version 28 Other Link 14 Play Movie Version 29 Other Link 15 Play Movie", "VDOM": "[{\"Id\":0,\"text\":\"Star Trek Discovery Season 1\",\"feature\":[44.0,728.0,78.0,45.0,1.0,0.0,1.0,0.0,20.0,0.0,44.0,728.0,78.0,45.0,1.0,0.0,1.0,0.0,20.0,0.0],\"start_idx\":0,\"end_idx\":5},...,{\"Id\":0,\"text\":\"Play Movie\",\"feature\":[673.0,83.0,2085.0,31.0,0.0,0.0,0.0,0.0,12.0,0.0,667.0,95.0,2086.0,29.0,1.0,0.0,0.0,0.0,12.0,0.0],\"start_idx\":309,\"end_idx\":311}]", "KeyPhrases": [["Star", "Trek"], ["Jason", "Isaacs"], ["Doug", "Jones"]]}

translation

OPUS-100

"OPUS-100 is an English-centric, meaning that all training pairs include English on either the source or target side, multi-lingual corpus covering 100 languages. The OPUS collection is composed of multiple corpora, ranging from movie subtitles to GNOME documentation to the Bible. The dataset is not curated or balanced, instead opting for the simplest approach of downloading all corpora for each language pair and concatenating them."

General sample format

See OPUS Data Formats for an in depth explanation of the formatting.

However, "files are untokenized (raw format) and they may contain multiple sentences per line. They are aligned together to their corresponding sentence(s) in the other language. Empty alignments are excluded from the plain text files." Therefore, the basic format is simply:

Text
...
It also contributes to stability and détente in our part of the world.
There is wide popular support for this policy.
...

with an aligning text document for the target language:

Text
...
Elle contribue également à la stabilité et à la détente dans notre secteur du monde.
Cette politique recueille une large adhésion populaire.
...

Raw example sentence

"Plain text files are provided for each bitext in OPUS. The name follows the typical name conventions used in Moses, i.e. using file extensions that correspond to the language ID. For example, for the RF corpus the two files for English and French are called:"

RF.en-fr.en
RF.en-fr.fr

The contents of the English file looks like this:

Statement of Government Policy by the Prime Minister, Mr Ingvar Carlsson, at the Opening of the Swedish Parliament on Tuesday, 4 October, 1988.
Your Majesties, Your Royal Highnesses, Mr Speaker, Members of the Swedish Parliament.
Sweden's policy of neutrality is of decisive importance for our peace and independence.
It also contributes to stability and détente in our part of the world.
There is wide popular support for this policy.
It will be pursued with firmness and consistency.
...

And the corresponding French file looks like this:

Declaration de Politique Générale du Gouvernement présentée mardi 4 octobre 1988 devant le Riksdag par Monsieur Ingvar Carlsson, Premier Ministre.
Majestés, Altesses Royales, Monsieur le Président, Mesdames et Messieurs les députés!
La politique suédoise de neutralité revêt une importance capitale pour la paix et l' indépendance de notre pays.
Elle contribue également à la stabilité et à la détente dans notre secteur du monde.
Cette politique recueille une large adhésion populaire.
Elle sera poursuivie avec énergie et cohérence.
...

sentiment_analysis_binary and sentiment_analysis_categorical

SST-1 and SST-2 (extends MR)

MR: Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005). SST-1: Stanford Sentiment Treebank - an extension of MR but with train/dev/test splits provided and fine-grained labels. The labels take the form of very positive, positive, neutral, negative, and very negative. Note that data is actually provided at the phrase-level and hence we train the model on both phrases and sentences but only score on sentences at test time. Thus the training set is an order of magnitude larger than listed in the above table. SST-2 Same as SST-1 but with neutral reviews removed and binary labels.

General sample format

Both datasets follow the same format but differ in the range of their sentiment labels.

Sentiment	Text
4	victor rosa is leguizamo 's best movie work so far , a subtle and richly internalized performance

Raw example sentence (SST-1)

4 victor rosa is leguizamo 's best movie work so far , a subtle and richly internalized performance

Raw example sentence (SST-2)

0 plotless collection of moronic stunts is by far the worst movie of the year .

coreference_resolution (antecedentScoring)

GAP and WikiCoref

"GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name). It's sampled from Wikipedia and released by Google AI Language for the evaluation of coreference resolution in practical applications." Wikicoref is "an English corpus annotated for anaphoric relations, where all documents are from the English version of Wikipedia." Wikicoref maintains the following directory structure:

Documents
Evaluation
Annotation
Output

Documents contains all the plain text samples for use during training. Annotation is in MMAX Project format, and "the markables are stored within /docName/Markables/docName_coref_level.xml." Evaluation "contains Conll-2011 Format" files for evaluation model accuracy. Output contains three subdirectories: Key, Dcoref, and Scoref. Key contains "WikiCoref in Table,Text Format". Dcoref contains "the output of Dcoref in Table,Text,StandfordXML format". Scoref contains "the output of Scoref in Table,Text,StandfordXML Format" respectively.

For more details about Wikicoref please refer to "WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles."

Formatted example sentence (GAP)

ID	Text	Pronoun	Pronoun-offset	A	A-offset	A-coref	B	B-offset	B-coref	URL
development-111	Alice Perrers is the protagonist of Emma Campion's novel, The King's Mistress. She appears in Anya Seton's novel, Katherine.	She	79	Alice Perrers	0	TRUE	Emma Campion	36	FALSE	http://en.wikipedia.org/wiki/Alice_Perrers

Raw example sentence (GAP)

development-111    Alice Perrers is the protagonist of Emma Campion's novel, The King's Mistress. She appears in Anya Seton's novel, Katherine.    She    79    Alice Perrers    0    TRUE    Emma Campion    36    FALSE    http://en.wikipedia.org/wiki/Alice_Perrers

Formatted example sentence (Wikicoref)

Any of the training files under the Documents directory are simply plain text files containing sentences. Therefore, the format is:

Text
Anatole France was a French poet, journalist, and novelist. He was born in Paris, and died in Saint-Cyr-sur-Loire...

Raw example sentence (Wikicoref)

Per the training file Documents/Anatole France:

Anatole France was a French poet, journalist, and novelist. He was born in Paris, and died in Saint-Cyr-sur-Loire. He was a successful novelist, with several best-sellers. Ironic and skeptical, he was considered in his day the ideal French man of letters...

Introduction​

Project Structure​

CLI commands​

prep​

Parameters​

train​

Parameters​

eval​

Parameters​

save​

Parameters​

update_model_metadata_registry​

Parameters​

cleanup​

Parameters​

Model Policies​

Model Metadata Registry​

Training pipeline tasks​

build_model​

preprocess​

combine_datasets​

cache​

train​

model_policy​

Pipeline parameters​

BuildModelParameters​

CacheParameters​

CombineDatasetsParameters​

ModelPolicyParameters​

PreprocessingParameters​

TrainingParameters​

Data transformation specifications by model​

named_entity_recognition​

part_of_speech_tagging​

relation_extraction​

Relation extraction handler​

coreference_resolution​

Coreference resolution handler​

keyphrase_extraction​

Keyphrase extraction handler​

sentiment_analysis_binary and sentiment_analysis_categorical​

Sentiment analysis handler​

Shared transformations​

Tokenization​

BERT embedding​

CoNLL handler​

Dataset Formats per model​

named_entity_recognition​

Computation Natural Language Learning (CoNLL)​

General sample format​

Raw example sentence​

Links​

Wikigold​

General sample format​

Raw example sentence​

Links​

Emerging and Rare Entities​

General sample format​

Raw example format​

Links​

part_of_speech_tagging​

English CoNLL​

General sample format​

Raw example sentence​

Links​

Groningen Meaning Bank​

General sample format​

Raw example sentence​

Links​

relation_extraction​

kbp37 (SemEval-2010 Task 8 and MIML-RE)​

General sample format​

Raw example sentence​

Links​

keyphrase_extraction​

SemEval-2017 Task 10​

General sample format​

Raw example sentence​

Links​

OpenKP​

Introduction

Project Structure

CLI commands

prep

Parameters

train

Parameters

eval

Parameters

save

Parameters

update_model_metadata_registry

Parameters

cleanup

Parameters

Model Policies

Model Metadata Registry

Training pipeline tasks

build_model

preprocess

combine_datasets

cache

train

model_policy

Pipeline parameters

BuildModelParameters

CacheParameters

CombineDatasetsParameters

ModelPolicyParameters

PreprocessingParameters

TrainingParameters

Data transformation specifications by model

named_entity_recognition

part_of_speech_tagging

relation_extraction

Relation extraction handler

coreference_resolution

Coreference resolution handler

keyphrase_extraction

Keyphrase extraction handler

sentiment_analysis_binary and sentiment_analysis_categorical

Sentiment analysis handler

Shared transformations

Tokenization

BERT embedding

CoNLL handler

Dataset Formats per model

named_entity_recognition

Computation Natural Language Learning (CoNLL)

General sample format

Raw example sentence

Links

Wikigold

General sample format

Raw example sentence

Links

Emerging and Rare Entities

General sample format

Raw example format

Links

part_of_speech_tagging

English CoNLL

General sample format

Raw example sentence

Links

Groningen Meaning Bank

General sample format

Raw example sentence

Links

relation_extraction

kbp37 (SemEval-2010 Task 8 and MIML-RE)

General sample format

Raw example sentence

Links

keyphrase_extraction

SemEval-2017 Task 10

General sample format

Raw example sentence

Links

OpenKP