SDK Dataset Structure
This section covers the details and usage of our SDK dataset structure.
The SDK dataset structure is a new format that can be used to train models for any of the services offered by GraphGrid NLP.
At its core, the GraphGrid SDK dataset structure is an aggregation of multiple task training datasets. This function provides a dataset format that can aggregate training data for all the distinct NLP tasks CDP supports into a single dataset. Because of this, the structure merges different task training formats to provide a more efficient process than traditional dataset formats.
The dataset structure was designed to be simple. We have significantly cut down the metadata (character indices, text lengths, etc.) that individual training task formats require. This simple formatting provides a way to manually label data for training several NLP tasks. Since the formats are json they can also be programmatically manipulated and generated from other data sources.
A single line can contain training information for as many models as desired, whether that be all possible models or only a single model. The SDK dataset structure uses a json lines (jsonl) file format to store the training data for multiple model types.
You are not required to use the CDP's SDK dataset format for model training, and traditional dataset formats can be used instead.
This is possible by uploading them manually to MinIO, and then specifying their location during the TrainRequestBody
for kicking off model training.
For information about traditional task formats please see the NLP model training page.
Dataset File
The dataset file itself is a json lines (jsonl) file. The jsonl format is also known as ndjson (newline delimited json).
Each line in the jsonl is its own training sample.
Below is an example jsonl of two training samples:
{ "sentence": "This is an example sentence", "named_entity": {...}, "pos": {...}}
{ "sentence": "This is another example sentence", "named_entity": {...}, "pos": {...}}
Note that each sample is on its own line, this is required for the jsonl format.
Each sample is independent of other samples. Coreference data may share sentences with other samples, but that does not affect how training occurs for any individual sample.
Dataset line sample
In the examples below, we expand the json for an easier read, but keep in mind each sample is stored on a single line within the jsonl.
Each training sample consists of the raw text ("sentence"
) and at some labeled data.
Let us look at a few examples.
Here is a sample that can be used to train a single translation model with multiple languages.
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"translations": {
"ar": "أسس بيل جيتس وصديق طفولته الراحل بول ألين شركة مايكروسوفت في 4 أبريل 1975.",
"de": "Bill Gates und sein verstorbener Jugendfreund Paul Allen gründeten Microsoft am 4. April 1975.",
"el": "Ο Μπιλ Γκέιτς και ο αείμνηστος παιδικός του φίλος Πολ Άλεν ίδρυσαν τη Microsoft στις 4 Απριλίου 1975.",
"es": "Bill Gates y su difunto amigo de la infancia Paul Allen fundaron Microsoft el 4 de abril de 1975 .",
"fr": "Bill Gates et son ami d' enfance Paul Allen ont fondé Microsoft le 4 avril 1975 .",
"ja": "ビルゲイツと彼の幼なじみのポールアレンは、1975年4月4日にマイクロソフトを設立しました。",
"ko": "1975년 4월 4일 Bill Gates와 그의 늦은 소꿉친구 Paul Allen은 Microsoft를 설립했습니다.",
"pt": "Bill Gates e seu amigo de infância Paul Allen fundaram a Microsoft em 4 de abril de 1975 .",
"ru": "Билл Гейтс и его покойный друг детства Пол Аллен основали Microsoft 4 апреля 1975 года.",
"tr": "Bill Gates ve geç çocukluk arkadaşı Paul Allen, Microsoft'u 4 Nisan 1975'te kurdu.",
"zh": "比尔盖茨和他已故的儿时好友保罗艾伦于 1975 年 4 月 4 日创立了微软。"
}
}
This is another sample, this time providing training data for named entity recognition and part of speech models.
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"named_entity": ["I-PER", "I-PER", "O", "O", "O", "O", "O", "I-PER", "I-PER", "O", "I-ORG", "O", "O", "O", "O", "O", "O"],
"pos": ["NNP", "NNP", "CC", "PRP$", "JJ", "JJ", "NN", "NNP", "NNP", "VBD", "NNP", "IN", "NNP", "CD", ",", "CD", "."]
}
These individual fields (translations
, named_entity
, pos
) are for training different task models.
Sample Training Formats
This next portion breaks down the organization of the SDK dataset and the different training formats available.
Here is a quick reference table for which format trains which type of model:
Format Field | Model Type |
---|---|
named_entity | named_entity_recognition |
pos | part_of_speech_tagging |
keyphrases | keyphrase_extraction |
relations | relation_extraction |
corference | coreference_resolution |
translations | translation |
sentiment | sentiment_analysis_binary and/or sentiment_analysis_categorical |
Sentence
Each training sample always contains a sentence
field.
This is the sentence all other fields are labeling for training.
For the best model accuracy all punctuation should be separated by whitespace. The most basic example uses periods and commas:
Alice went to the store . Will she find her favorite meal ?
This rule also applies to apostrophes and quotes
Alice 's friend opened the door , " Where 'd you go ? " she said .
Punctuation not separated by spaces will not break model training, but it may degrade your models' performance.
Here's an example of it in the json:
{ "sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 ." }
Named Entity Recognition
The named_entity
field contains data for training a named entity recognition model (named_entity_recognition
).
The format is very similar to the format of CoNLL convention except that the sentence is unzipped from the named entity labels.
Let us look at an example first:
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"named_entity": ["I-PER", "I-PER", "O", "O", "O", "O", "O", "I-PER", "I-PER", "O", "I-ORG", "O", "O", "O", "O", "O", "O"]
}
The labeling is based off of white spaces in the sentence, hence the number of labels is equivalent to the number of tokenized words in the sentence.
In our example Bill Gates
is labeled as a single person and Microsoft
is labeled as an organization.
The data contains entities of four types: persons PER
, organizations ORG
, locations LOC
, and miscellaneous MISC
.
The O
are outside of entity recognition.
Lastly, there are I
and B
prefixes on the different entity types. The I
prefix is used for words inside of a named entity.
The B
prefix is used when there are two entities of the same type right next to each other but do not represent the same entity.
This is why in our example Bill Gates
is counted as a single entity because the two I-PER
are right next to each other, rather than having Bill
and Gates
be two separate entities.
Combining all these rules together, there are 9 valid entity labels.
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
Part Of Speech
Part-of-speech tagging (part_of_speech_tagging
) or the pos
section contains data for training models that tag parts of speech.
The pos
format is nearly identical to the named_entity
format except it uses slightly different labels for training the part of speech task.
For example
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"pos": ["NNP", "NNP", "CC", "PRP$", "JJ", "JJ", "NN", "NNP", "NNP", "VBD", "NNP", "IN", "NNP", "CD", ",", "CD", "."]
}
These are all the valid part-of-speech tagging labels available:
['NNP', 'VBZ', 'JJ', 'NN', 'TO', 'VB', '.', 'CD',
'DT', 'VBD', 'IN', 'PRP', 'NNS', 'VBP', 'MD', 'VBN',
'POS', 'JJR', '"', 'RB', ',', 'FW', 'CC', 'WDT', '(',
')', ':', 'PRP$', 'RBR', 'VBG', 'EX', 'WP', 'WRB',
'$', 'RP', 'NNPS', 'SYM', 'RBS', 'UH', 'PDT', "''",
'LS', 'JJS', 'WP$', 'NN|SYM']
For a detailed explanation on POS tagging please see these Part of speech tagging guildlines.
Keyphrase Extraction
Training samples of keyphrases
are simply the keyphrases found in the sentence.
This field uses a simplified version of the original format for ease of manually labeling data.
The keyphrases are text snippets in the sentence.
Here is an example of how to define keyphrases for our sample.
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"keyphrases": [
"Bill Gates",
"Microsoft"
]
}
Note that the keyphrase text needs to match with the text in the sentence exactly. The type of keyphrase model only supports training on text already contained in the sample sentence.
This field trains a keyphrase extraction (keyphrase_extraction
) model.
Relation Extraction
The relations
field specifies a training format for models that extract relationships from text.
Its format is based on the kbp37 (SemEval-2010 Task 8 and MIML-RE)
dataset and has 19 total relationship types possible:
["no_relation", "org:alternate_names",
"org:city_of_headquarters",
"org:country_of_headquarters", "org:founded",
"org:founded_by", "org:members",
"org:stateorprovince_of_headquarters",
"org:subsidiaries", "org:top_members/employees",
"per:alternate_names", "per:cities_of_residence",
"per:countries_of_residence", "per:country_of_birth",
"per:employee_of", "per:origin", "per:spouse",
"per:stateorprovinces_of_residence", "per:title"]
With that information in hand let us look at an example:
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"relations": [
{
"obj": {
"entity": "Bill Gates",
"type": "PER"
},
"sub": {
"entity": "Microsoft",
"type": "ORG"
},
"relation": "org:founded_by"
},
{
"obj": {
"entity": "Paul Allen",
"type": "PER"
},
"sub": {
"entity": "Microsoft",
"type": "ORG"
},
"relation": "org:founded_by"
},
{
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Bill Gates",
"type": "PER"
},
"relation": "per:employee_of"
},
{
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Paul Allen",
"type": "PER"
},
"relation": "per:employee_of"
}
]
}
This sample provides data for teaching a relation extraction model how to extract two different relationships, founded_by
and employee_of
.
We can observe this by looking more closely at the last training entry:
{
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Paul Allen",
"type": "PER"
},
"relation": "per:employee_of"
}
Each relationship in the relations
field has three main keys, obj
sub
and relation
.
The obj
and sub
are further expanded into the entity
string (exact match) and the type
the entity is.
The type
comes from the NER labeling, so it is either PER
, ORG
, LOC
, or MISC
.
Here we see the object is Microsoft
(an organization), the subject is Paul Allen
(a person), and the relationship is per:employee_of
:
Our sample trains the model that the original sentence provides the relation that Paul Allen is an employee of Microsoft
.
The relationships are directed, and can be thought about as the subject pointing to the object with the relationship (Paul Allen is an employee of Microsoft
).
This format specifically trains a relation extraction (relation_extraction
) model.
Coreference Resolution
The coreference
field is used to provide training data for picking out the same entities across multiple sentences.
The format uses a second sentence sentence_2
which we compare against the first sentence.
All entities from the first sentence are labeled for coreference against entities from the second sentence.
Here is an example:
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"coreference": [
{
"sentence_2": "During his career at Microsoft , Gates held the positions of chairman , chief executive officer ( CEO ) , president and chief software architect , while also being the largest individual shareholder until May 2014 .",
"references": [
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "his",
"coreferent": true
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "his",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "his",
"coreferent": false
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "his",
"coreferent": false
}
]
},
{
"sentence_2": "Gates said he personally reviewed and often rewrote every line of code that the company produced in its first five years .",
"references": [
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "he",
"coreferent": true
},
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "its",
"coreferent": false
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "its",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "its",
"coreferent": true
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "its",
"coreferent": false
}
]
}
]
}
In this sample we have two training examples for coreference
.
The first one tells the model that his
in During his career at Microsoft...
is actually the entity Bill Gates
.
The second example trains the model that he
in Gates said he personally...
is Bill Gates
and that its
in
that the company produced in its first five years
refers to Microsoft
.
It ought to be mentioned that the words are greedily matched and that the format is optimized to train on sentences that follow one another.
It is possible to have the second sentence be the same as the first, but the model performance may degrade.
This format specifically trains a coreference resolution (coreference_resolution
) task.
Translation
The translations
field contains data for training a single model to translate multiple languages into English. The format is simply the translations
key,
which maps to individual language keys (ar
, es
, etc.) and the translation of the sentence as the value.
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"translations": {
"ar": "أسس بيل جيتس وصديق طفولته الراحل بول ألين شركة مايكروسوفت في 4 أبريل 1975.",
"es": "Bill Gates y su difunto amigo de la infancia Paul Allen fundaron Microsoft el 4 de abril de 1975 .",
"fr": "Bill Gates et son ami d' enfance Paul Allen ont fondé Microsoft le 4 avril 1975 .",
"ja": "ビルゲイツと彼の幼なじみのポールアレンは、1975年4月4日にマイクロソフトを設立しました。"
}
}
Here we're providing training data for a model to learn the four languages Arabic, Spanish, French, and Japanese.
The model can learn most languages, including Arabic, Chinese, Dutch, French, German, Russian and Spanish.
Sentiment Analysis
The sentiment
field you can train either a binary
sentiment model or a categorical
sentiment model.
A binary model only labels text as positive or negative, while a categorical model provides five total categories of sentiment.
The format is simply using a sentiment
key with either the binary
or categorical
or both keys and their corresponding sentiment value.
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"sentiment": {
"binary": 1,
"categorical": 3
}
}
For binary
models there are only two labels: 1-positive, 0-negative.
For categorical
models there are five labels: 5-very positive, 4-positive, 3-neutral, 2-negative, 1-very negative.
This example says that our original sentence is positive (1) in the binary model, but neutral (3) in the categorical model.
Dataset Example
Here is a dataset example to show all the different tasks annotated for a single sentence:
{
"sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"translations": {
"ar": "أسس بيل جيتس وصديق طفولته الراحل بول ألين شركة مايكروسوفت في 4 أبريل 1975.",
"es": "Bill Gates y su difunto amigo de la infancia Paul Allen fundaron Microsoft el 4 de abril de 1975 .",
"fr": "Bill Gates et son ami d' enfance Paul Allen ont fondé Microsoft le 4 avril 1975 .",
"ja": "ビルゲイツと彼の幼なじみのポールアレンは、1975年4月4日にマイクロソフトを設立しました。"
},
"sentiment": {
"binary": 1,
"categorical": 2
},
"keyphrases": [
"Gates"
],
"named_entity": ["I-PER", "I-PER", "O", "O", "O", "O", "O", "I-PER", "I-PER", "O", "I-ORG", "O", "O", "O", "O", "O", "O"],
"pos": ["NNP", "NNP", "CC", "PRP$", "JJ", "JJ", "NN", "NNP", "NNP", "VBD", "NNP", "IN", "NNP", "CD", ",", "CD", "."],
"relations": [
{
"obj": {
"entity": "Bill Gates",
"type": "PER"
},
"sub": {
"entity": "Microsoft",
"type": "ORG"
},
"relation": "org:founded_by"
},
{
"obj": {
"entity": "Paul Allen",
"type": "PER"
},
"sub": {
"entity": "Microsoft",
"type": "ORG"
},
"relation": "org:founded_by"
},
{
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Bill Gates",
"type": "PER"
},
"relation": "per:employee_of"
},
{
"obj": {
"entity": "Microsoft",
"type": "ORG"
},
"sub": {
"entity": "Paul Allen",
"type": "PER"
},
"relation": "per:employee_of"
}
],
"coreference": [
{
"sentence_2": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .",
"references": [
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "his",
"coreferent": true
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "his",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "his",
"coreferent": false
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "his",
"coreferent": false
}
]
},
{
"sentence_2": "Gates said he personally reviewed and often rewrote every line of code that the company produced in its first five years .",
"references": [
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "he",
"coreferent": true
},
{
"sentence_1_entity": "Bill Gates",
"sentence_2_entity": "its",
"coreferent": false
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "Paul Allen",
"sentence_2_entity": "its",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "Microsoft",
"sentence_2_entity": "its",
"coreferent": true
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "he",
"coreferent": false
},
{
"sentence_1_entity": "April",
"sentence_2_entity": "its",
"coreferent": false
}
]
}
]
}
Combining multiple annotated sentences like this and we can form our jsonl
training file!
{ "sentence": "Bill Gates and his late childhood friend Paul Allen founded Microsoft on April 4 , 1975 .", "named_entity": {...}, "pos": {...}, ...}
{ "sentence": "Gates said he personally reviewed and often rewrote every line of code that the company produced in its first five years .", "named_entity": {...}, "pos": {...}, ...}