Skip to main content
Version: 2.0

SDK Dataset Structure

This section covers the details and usage of our SDK dataset structure.

The SDK dataset structure is a new format that can be used to train models for any of the services offered by GraphGrid NLP.

At its core, the GraphGrid SDK dataset structure is an aggregation of multiple task training datasets. This function provides a dataset format that can aggregate training data for all the distinct NLP tasks CDP supports into a single dataset. Because of this, the structure merges different task training formats to provide a more efficient process than traditional dataset formats.

The dataset structure was designed to be simple. We have significantly cut down the metadata (character indices, text lengths, etc.) that individual training task formats require. This simple formatting provides a way to manually label data for training several NLP tasks. Since the formats are json they can also be programmatically manipulated and generated from other data sources.

A single line can contain training information for as many models as desired, whether that be all possible models or only a single model. The SDK dataset structure uses a json lines (jsonl) file format to store the training data for multiple model types.

You are not required to use the CDP's SDK dataset format for model training, and traditional dataset formats can be used instead. This is possible by uploading them manually to MinIO, and then specifying their location during the TrainRequestBody for kicking off model training. For information about traditional task formats please see the NLP model training page.

Dataset File

The dataset file itself is a json lines (jsonl) file. The jsonl format is also known as ndjson (newline delimited json).

Each line in the jsonl is its own training sample.

Below is an example jsonl of two training samples:

{ "sentence": "This is an example sentence",  "named_entity": {...}, "pos": {...}}
{ "sentence": "This is another example sentence", "named_entity": {...}, "pos": {...}}

Note that each sample is on its own line, this is required for the jsonl format.

Each sample is independent of other samples. Coreference data may share sentences with other samples, but that does not affect how training occurs for any individual sample.

Dataset line sample

In the examples below, we expand the json for an easier read, but keep in mind each sample is stored on a single line within the jsonl.

Each training sample consists of the raw text ("sentence") and at some labeled data. Let us look at a few examples.

Here is a sample that can be used to train a single translation model with multiple languages.

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"translations": {
"ar": "أسس بيل جيتس وصديق طفولته الراحل بول ألين شركة مايكروسوفت في 4 أبريل 1975.",
"de": "Bill Gates und sein verstorbener Jugendfreund Paul Allen gründeten Microsoft am 4. April 1975.",
"el": "Ο Μπιλ Γκέιτς και ο αείμνηστος παιδικός του φίλος Πολ Άλεν ίδρυσαν τη Microsoft στις 4 Απριλίου 1975.",
"es": "Bill Gates y su difunto amigo de la infancia Paul Allen fundaron Microsoft el 4 de abril de 1975 .",
"fr": "Bill Gates et son ami d' enfance Paul Allen ont fondé Microsoft le 4 avril 1975 .",
"ja": "ビルゲイツと彼の幼なじみのポールアレンは、1975年4月4日にマイクロソフトを設立しました。",
"ko": "1975년 4월 4일 Bill Gates와 그의 늦은 소꿉친구 Paul Allen은 Microsoft를 설립했습니다.",
"pt": "Bill Gates e seu amigo de infância Paul Allen fundaram a Microsoft em 4 de abril de 1975 .",
"ru": "Билл Гейтс и его покойный друг детства Пол Аллен основали Microsoft 4 апреля 1975 года.",
"tr": "Bill Gates ve geç çocukluk arkadaşı Paul Allen, Microsoft'u 4 Nisan 1975'te kurdu.",
"zh": "比尔盖茨和他已故的儿时好友保罗艾伦于 1975 年 4 月 4 日创立了微软。"
}
}

This is another sample, this time providing training data for named entity recognition and part of speech models.

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"named_entity": ["I-PER", "I-PER", "O", "O", "O", "O", "O", "I-PER", "I-PER", "O", "I-ORG", "O", "O", "O", "O", "O", "O"],
"pos": ["NNP", "NNP", "CC", "PRP$", "JJ", "JJ", "NN", "NNP", "NNP", "VBD", "NNP", "IN", "NNP", "CD", ",", "CD", "."]
}

These individual fields (translations, named_entity, pos) are for training different task models.

Sample Training Formats

This next portion breaks down the organization of the SDK dataset and the different training formats available.

Here is a quick reference table for which format trains which type of model:

Format FieldModel Type
named_entitynamed_entity_recognition
pospart_of_speech_tagging
keyphraseskeyphrase_extraction
relationsrelation_extraction
corferencecoreference_resolution
translationstranslation
sentimentsentiment_analysis_binary and/or sentiment_analysis_categorical

Sentence

Each training sample always contains a sentence field. This is the sentence all other fields are labeling for training.

For the best model accuracy all punctuation should be separated by whitespace. The most basic example uses periods and commas:

Alice went to the store . Will she find her favorite meal ?

This rule also applies to apostrophes and quotes

Alice 's friend opened the door , " Where 'd you go ? " she said .

Punctuation not separated by spaces will not break model training, but it may degrade your models' performance.

Here's an example of it in the json:

{ "sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 ." }

Named Entity Recognition

The named_entity field contains data for training a named entity recognition model (named_entity_recognition).

The format is very similar to the format of CoNLL convention except that the sentence is unzipped from the named entity labels.

Let us look at an example first:

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"named_entity": ["I-PER", "I-PER", "O", "O", "O", "O", "O", "I-PER", "I-PER", "O", "I-ORG", "O", "O", "O", "O", "O", "O"]
}

The labeling is based off of white spaces in the sentence, hence the number of labels is equivalent to the number of tokenized words in the sentence. In our example Bill Gates is labeled as a single person and Microsoft is labeled as an organization.

The data contains entities of four types: persons PER, organizations ORG, locations LOC, and miscellaneous MISC. The O are outside of entity recognition.

Lastly, there are I and B prefixes on the different entity types. The I prefix is used for words inside of a named entity. The B prefix is used when there are two entities of the same type right next to each other but do not represent the same entity.

This is why in our example Bill Gates is counted as a single entity because the two I-PER are right next to each other, rather than having Bill and Gates be two separate entities.

Combining all these rules together, there are 9 valid entity labels.

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

Part Of Speech

Part-of-speech tagging (part_of_speech_tagging) or the pos section contains data for training models that tag parts of speech.

The pos format is nearly identical to the named_entity format except it uses slightly different labels for training the part of speech task.

For example

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"pos": ["NNP", "NNP", "CC", "PRP$", "JJ", "JJ", "NN", "NNP", "NNP", "VBD", "NNP", "IN", "NNP", "CD", ",", "CD", "."]
}

These are all the valid part-of-speech tagging labels available:

['NNP', 'VBZ', 'JJ', 'NN', 'TO', 'VB', '.', 'CD',
'DT', 'VBD', 'IN', 'PRP', 'NNS', 'VBP', 'MD', 'VBN',
'POS', 'JJR', '"', 'RB', ',', 'FW', 'CC', 'WDT', '(',
')', ':', 'PRP$', 'RBR', 'VBG', 'EX', 'WP', 'WRB',
'$', 'RP', 'NNPS', 'SYM', 'RBS', 'UH', 'PDT', "''",
'LS', 'JJS', 'WP$', 'NN|SYM']

For a detailed explanation on POS tagging please see these Part of speech tagging guildlines.

Keyphrase Extraction

Training samples of keyphrases are simply the keyphrases found in the sentence. This field uses a simplified version of the original format for ease of manually labeling data.
The keyphrases are text snippets in the sentence.

Here is an example of how to define keyphrases for our sample.

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"keyphrases": [
"Bill Gates",
"Microsoft"
]
}

Note that the keyphrase text needs to match with the text in the sentence exactly. The type of keyphrase model only supports training on text already contained in the sample sentence.

This field trains a keyphrase extraction (keyphrase_extraction) model.

Relation Extraction

The relations field specifies a training format for models that extract relationships from text. Its format is based on the kbp37 (SemEval-2010 Task 8 and MIML-RE) dataset and has 19 total relationship types possible:

["no_relation", "org:alternate_names",
"org:city_of_headquarters",
"org:country_of_headquarters", "org:founded",
"org:founded_by", "org:members",
"org:stateorprovince_of_headquarters",
"org:subsidiaries", "org:top_members/employees",
"per:alternate_names", "per:cities_of_residence",
"per:countries_of_residence", "per:country_of_birth",
"per:employee_of", "per:origin", "per:spouse",
"per:stateorprovinces_of_residence", "per:title"]

With that information in hand let us look at an example:

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"relations": [
{
"obj": {
"entity": "Bill Gates",
"type": "PER"
},
"sub": {
"entity": "Microsoft",
"type": "ORG"
},
"relation": "org:founded_by"
},
{
"obj": {
"entity": "Paul Allen",
"type": "PER"
},
"sub": {
"entity": "Microsoft",
"type": "ORG"
},
"relation": "org:founded_by"
},
{
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Bill Gates",
"type": "PER"
},
"relation": "per:employee_of"
},
{
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Paul Allen",
"type": "PER"
},
"relation": "per:employee_of"
}
]
}

This sample provides data for teaching a relation extraction model how to extract two different relationships, founded_by and employee_of.

We can observe this by looking more closely at the last training entry:

    {
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Paul Allen",
"type": "PER"
},
"relation": "per:employee_of"
}

Each relationship in the relations field has three main keys, obj sub and relation. The obj and sub are further expanded into the entity string (exact match) and the type the entity is. The type comes from the NER labeling, so it is either PER, ORG, LOC, or MISC.

Here we see the object is Microsoft (an organization), the subject is Paul Allen (a person), and the relationship is per:employee_of: Our sample trains the model that the original sentence provides the relation that Paul Allen is an employee of Microsoft.

The relationships are directed, and can be thought about as the subject pointing to the object with the relationship (Paul Allen is an employee of Microsoft).

This format specifically trains a relation extraction (relation_extraction) model.

Coreference Resolution

The coreference field is used to provide training data for picking out the same entities across multiple sentences.

The format uses a second sentence sentence_2 which we compare against the first sentence. All entities from the first sentence are labeled for coreference against entities from the second sentence. Here is an example:

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"coreference": [
{
"sentence_2": "During his career at Microsoft , Gates held the positions of chairman , chief executive officer ( CEO ) , president and chief software architect , while also being the largest individual shareholder until May 2014 .",
"references": [
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "his",
"coreferent": true
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "his",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "his",
"coreferent": false
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "his",
"coreferent": false
}
]
},
{
"sentence_2": "Gates said he personally reviewed and often rewrote every line of code that the company produced in its first five years .",
"references": [
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "he",
"coreferent": true
},
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "its",
"coreferent": false
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "its",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "its",
"coreferent": true
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "its",
"coreferent": false
}
]
}
]
}

In this sample we have two training examples for coreference. The first one tells the model that his in During his career at Microsoft... is actually the entity Bill Gates. The second example trains the model that he in Gates said he personally... is Bill Gates and that its in that the company produced in its first five years refers to Microsoft.

It ought to be mentioned that the words are greedily matched and that the format is optimized to train on sentences that follow one another.

It is possible to have the second sentence be the same as the first, but the model performance may degrade.

This format specifically trains a coreference resolution (coreference_resolution) task.

Translation

The translations field contains data for training a single model to translate multiple languages into English. The format is simply the translations key, which maps to individual language keys (ar, es, etc.) and the translation of the sentence as the value.

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"translations": {
"ar": "أسس بيل جيتس وصديق طفولته الراحل بول ألين شركة مايكروسوفت في 4 أبريل 1975.",
"es": "Bill Gates y su difunto amigo de la infancia Paul Allen fundaron Microsoft el 4 de abril de 1975 .",
"fr": "Bill Gates et son ami d' enfance Paul Allen ont fondé Microsoft le 4 avril 1975 .",
"ja": "ビルゲイツと彼の幼なじみのポールアレンは、1975年4月4日にマイクロソフトを設立しました。"
}
}

Here we're providing training data for a model to learn the four languages Arabic, Spanish, French, and Japanese.

The model can learn most languages, including Arabic, Chinese, Dutch, French, German, Russian and Spanish.

Sentiment Analysis

The sentiment field you can train either a binary sentiment model or a categorical sentiment model. A binary model only labels text as positive or negative, while a categorical model provides five total categories of sentiment.

The format is simply using a sentiment key with either the binary or categorical or both keys and their corresponding sentiment value.

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"sentiment": {
"binary": 1,
"categorical": 3
}
}

For binary models there are only two labels: 1-positive, 0-negative.

For categorical models there are five labels: 5-very positive, 4-positive, 3-neutral, 2-negative, 1-very negative.

This example says that our original sentence is positive (1) in the binary model, but neutral (3) in the categorical model.

Dataset Example

Here is a dataset example to show all the different tasks annotated for a single sentence:

{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"translations": {
"ar": "أسس بيل جيتس وصديق طفولته الراحل بول ألين شركة مايكروسوفت في 4 أبريل 1975.",
"es": "Bill Gates y su difunto amigo de la infancia Paul Allen fundaron Microsoft el 4 de abril de 1975 .",
"fr": "Bill Gates et son ami d' enfance Paul Allen ont fondé Microsoft le 4 avril 1975 .",
"ja": "ビルゲイツと彼の幼なじみのポールアレンは、1975年4月4日にマイクロソフトを設立しました。"
},
"sentiment": {
"binary": 1,
"categorical": 2
},
"keyphrases": [
"Gates"
],
"named_entity": ["I-PER", "I-PER", "O", "O", "O", "O", "O", "I-PER", "I-PER", "O", "I-ORG", "O", "O", "O", "O", "O", "O"],
"pos": ["NNP", "NNP", "CC", "PRP$", "JJ", "JJ", "NN", "NNP", "NNP", "VBD", "NNP", "IN", "NNP", "CD", ",", "CD", "."],
"relations": [
{
"obj": {
"entity": "Bill Gates",
"type": "PER"
},
"sub": {
"entity": "Microsoft",
"type": "ORG"
},
"relation": "org:founded_by"
},
{
"obj": {
"entity": "Paul Allen",
"type": "PER"
},
"sub": {
"entity": "Microsoft",
"type": "ORG"
},
"relation": "org:founded_by"
},
{
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Bill Gates",
"type": "PER"
},
"relation": "per:employee_of"
},
{
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Paul Allen",
"type": "PER"
},
"relation": "per:employee_of"
}
],
"coreference": [
{
"sentence_2": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"references": [
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "his",
"coreferent": true
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "his",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "his",
"coreferent": false
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "his",
"coreferent": false
}
]
},
{
"sentence_2": "Gates said he personally reviewed and often rewrote every line of code that the company produced in its first five years .",
"references": [
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "he",
"coreferent": true
},
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "its",
"coreferent": false
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "its",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "its",
"coreferent": true
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "its",
"coreferent": false
}
]
}
]
}

Combining multiple annotated sentences like this and we can form our jsonl training file!

{ "sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",  "named_entity": {...}, "pos": {...}, ...}
{ "sentence": "Gates said he personally reviewed and often rewrote every line of code that the company produced in its first five years .", "named_entity": {...}, "pos": {...}, ...}