Supported Tasks¶

Each task has an associated a Dataset Format, a Query Format and a Prediction Format.

A task’s Dataset Format specifies the format of the dataset files. Datasets are prepared by Application Developers when they create Train Jobs and received by Model Developers when they define singa_auto.model.BaseModel.train and singa_auto.model.BaseModel.evaluate.

A task’s Query Format specifies the format of queries when they are passed to models. Queries are generated by Application Users when they send queries to Inference Jobs and received by Model Developers when they define singa_auto.model.BaseModel.predict.

A task’s Prediction Format specifies the format of predictions made by models. Predictions are generated by Model Developers when they define singa_auto.model.BaseModel.predict and received by Application Users as predictions to their queries sent to Inference Jobs.

IMAGE_SEGMENTATION¶

Dataset Format¶

dataset-type: SEGMENTATION_IMAGES

note

We use the same annotation format as Pascal VOC segmentation dataset

An image and its corresponding mask should have the same width and length while the number of channels can be different. For example, an image can have three channels representing RGB values but its mask should only have one grayscale channel.
In the mask image, each pixel’s grayscale value represents its label, while there can be a specific value represents the pixel is meaningless (the same definition as ignore_lable in some loss function) such as paddings or borders.

Query Format¶

An image file in the following common formats: .jpg, .jpeg, .png, .gif, .bmp, or .tiff.

Prediction Format¶

A W x H single-channel mask image file with each pixel’s grayscale value representing its label.

IMAGE_CLASSIFICATION¶

Dataset Format¶

dataset-type: IMAGE_FILES

There is only 1 tag column of class, corresponding to the class of the image as an integer from 0 to k - 1, where k is the total no. of classes.
The train & validation dataset’s images should be have the same dimensions W x H and same total no. of classes.

An example:

path,class
image-0-of-class-0.png,0
image-1-of-class-0.png,0
...
image-0-of-class-1.png,1
...
image-99-of-class-9.png,9

**note**

You can refer to and run
`./examples/datasets/image\_files/load\_folder\_format.py <https://github.com/nusdbsystem/singa-auto/tree/master/examples/datasets/load_folder_format.py>`__
for converting *directories of images* to SINGA-Auto's
``IMAGE_CLASSIFICATION`` format.

Query Format¶

An image file in the following common formats: .jpg, .jpeg, .png, .gif, .bmp, or .tiff.

Prediction Format¶

A jsonified string representing the classification result. There are no strict requirements for the format of the output string, which is entirely determined by the model itself, such as directly outputting the label or class name of the classification result, one-hot encoding, or the probability corresponding to each class.

OBJECT_DETECTION¶

Dataset Format¶

dataset-type: DETECTION_DATASET

It is recommended to follow the YOLO dataset format.

For folder hierarchy, two folders ‘images’ and ‘labels’ should be prepared. In ‘images’ folder, there are PIL loadable images, and the corresponding txt label files should be placed in ‘labels’ folder, with the same basename with the images.
The label file format is as follows, where object-id is the index of object, the following four numbers should be normalized to range between 0 and 1 by dividing by the width and height of the image. center_x center_y are the central coordinates of bounding box, and width heigh is the sides lengths of it. It is allowable to use empty label file (negative samples), which means there are no objects to detect in the image.

object-id center_x center_y width height
...

In addition, train.txt, valid.txt can be provided to note images used for training/validataion, only including the path of image files. A class.names contains the category names and thier line numbers are object-id.

Query Format¶

An image file in the following common formats: .jpg, .jpeg, .png, .gif, .bmp, or .tiff.

Prediction Format¶

A jsonified dict (string) indicating the bounding boxes and their corresponding classes. The keys and values format are strictly required as following:

{'explanations':
    {'box_info': [{'coord': (224, 275, 281, 357),
                   'class_name': 'person'},
                  {'coord': (64, 263, 150, 368),
                   'class_name': 'person'}]
    }
}

GENERAL_TASK¶

Dataset Format¶

dataset-type: GENERAL_FILES

For general task, as its name states, any domain’s task (or model) can be included within this category, such as image processing, nlp, speech, or video.
There is no requirements for the form of dataset, as long as it can be read into memory in the form of a file. However, the model developer has to know in advance how to handle the read-in file.

Query Format¶

A file is required as the query format. As long as this file corresponds to the input required by the model, it can be in any file format.

Prediction Format¶

The same as the input query, the prediction returns the output file as set in the model.’

POS_TAGGING¶

Dataset Format¶

dataset-type:CORPUS

Sentences are delimited by \n tokens.
There is only 1 tag column of tag corresponding to the POS tag of the token as an integer from 0 to k-1.

An example:

token       tag
Two         3
leading     2
...
line-item   1
veto        5
.           4
\n          0
Professors  6
Philip      6
...
previous    1
presidents  8
.           4
\n          0

Query Format¶

An array of strings representing a sentence as a list of tokens in that sentence.

Prediction Format¶

A array of integers representing the list of predicted tag for each token, in sequence, for the sentence.

QUESTION_ANSWERING¶

COVID19 Task Dataset Format¶

dataset-type:QUESTION_ANSWERING_COVID19

Dataset can be used to finetune the SQuAD pre-trained Bert model.

The dataset zips folders containing JSON files. JSON files under different level folders will be automaticly read all together.

Dataset structure example:

/DATASET_NAME.zip
│
├──FOLDER_NAME_1                                              # first level folder
│  └──FOLDER_NAME_2                                           # second level folder, not necessarily to be included
│      └──FOLDER_NAME_3                                       # third level folder, not necessarily to be included
│           ├── 003d2e515e1aaf06f0052769953e8.json            # JSON file name is a random combination of either alphabets/numbers or both
│           ├── 00a407540a8bdd.json
│           ...
│
├──FOLDER_NAME_4                                              # first level folder
│  ├── 0015023cc06b5362d332b3.json
│  ├── 001b4a31684c8fc6e2cfbb70304354978317c429.json
│  ...
...
│
└──metadata.csv                                               # if additional information is provided for above JSON files, user can add a metadata.csv

JSON file includes body_text, providing list of paragraphs in full body which can be used for question answering. body_text can contain different entries, only the “text” field of each entry will be read.

For JSON files extracted from papers, it comes that one JSON file for one paper. And if additional information is given in metadata.csv for papers, each JSON file and each metadata.csv entries are linked via sha values of both.
For dataset having their additional information paragraph, the body_text> text entry is in <question> + <\n> + <information paragraph> string format. In this circumstance, there is no sha value nor metadata.csv file needed.

Sample of JSON file:

# JSON file 1                           # for example, a JSON file extracted from one paper
{
    "sha": <str>,                       # 40-character sha1 of the PDF, this field is only required for JSON extracted from papers. it will be read into model in forms of string

    "body_text": [                      # list of paragraphs in full body, this is must-have
        {
            "text": <str>,              # text body for first entry, which is for one paragraph of this paper. this is must-have. it will be read as string into model
        }
        ...                             # other 'text' blocks, i.e. paragraphs blocks the same as above, then all string ‘text’ will be handled and processed into panda datafame
    ],
}

# ---------------------------------------------------------------------------------------------------------------------- #

# JSON file 2                           # for example, a JSON file extraced from SQuAD2.0
{
    "body_text": [                      # list of paragraphs in full body, this is must-have
        {
            "text": 'What are the treatments for Age-related Macular Degeneration ?\n If You Have Advanced AMD Once dry AMD reaches the advanced stage, no form of treatment can prevent vision loss...',
                                        # text body for first entry, this is must-have

        },
        ...                             # other 'text' blocks, i.e. paragraphs blocks look the same as above
    ],
}

metadata.csv is not strictly required. User can provide additional information with it, i.e. authors, title, journal and publish_time, mapping to each JSON files by every sha value. cord_uid serves unique values serve as the entry identity. Time sensitive entry, is advised to have publish_time value in Date format. Other values, General format is recommended.

Sample of metadata.csv entry:

Column Names Column Values

cord_uid zjufx4fo

sha b2897e1277f56641193a6db73825f707eed3e4c9

source_x PMC

title Sequence requirements for RNA strand transfer during nidovirus …

doi 10.1093/emboj/20.24.7220

pmcid PMC125340

pubmed_id 11742998

license unk

abstract Nidovirus subgenomic mRNAs contain a leader sequence derived …

publish_time 2001-12-17

Query Format¶

note

The pretrained model should be fine-tuned with a dataset first to adapt to particular question domains when necessary.

Otherwise, following the question, input should contain relevant information (context paragraph or candidate answers, or both), whether or not addresses the question.

Optionally, while the relevant information as additional paragraph are provided in query, the question always comes first, followed by additional paragraph. We use “n” separators between the question and its paragraph of the input.

Query is in JSON format. It could be a \ of a single question in questions field. Model will only read the questions field.

{
 'questions': ['Is individual's age considered a potential risk factor of COVID19? \n  People of all ages can be infected by the new coronavirus (2019-nCoV). Older people, and people with pre-existing medical conditions (such as asthma, diabetes, heart disease) appear to be more vulnerable to becoming severely ill with the virus. WHO advises people of all ages to take steps to protect themselves from the virus, for example by following good hand hygiene and good respiratory hygiene.',
               # query string can include optional context which follows the question with `\n` syntax
               'Is COVID-19 associated with cardiomyopathy and cardiac arrest?'],     # will be read as a list of string by model, and each question will be extracted as string to process the question answering stage recursively
               ...                                                                    # questions in string format
 ...                                                                                  # other fileds. fields, other than 'questions', won't be read into the model
}

Prediction Format¶

The output is in JSON format.

['Given a higher mortality rate for older cases, in one study, li et al showed that more than 50% of early patients with covid-19 in wuhan were more than 60 years old',
 'cardiac involvement has been reported in patients with covid-19, which may be reflected by ecg changes.'
 ...
 ]   # output field is a list of string

MedQuAD Task Dataset Format¶

dataset-type:QUESTION_ANSWERING_MEDQUAD

Dataset structure example:

/MedQuAD.zip
│
├──FOLDER_NAME_1                                              # first level folder
│  └──FOLDER_NAME_2                                           # second level folder, not necessarily to be included
│      └──FOLDER_NAME_3                                       # third level folder, not necessarily to be included
│           ├── 003d2e515e1aaf0052769953e8.xml                # xml file name is a random combination of either alphabets/numbers or both
│           ├── 00a40758bdd.xml
│           ...
│
├──FOLDER_NAME_4                                              # first level folder
│  ├── 0015023cc06b5332b3.xml
│  ├── 001b4a31684c8fc6e2cfbb70304c429.xml
│  ...
...

**note**

-  For following .xml sample, model would only take Question and
   Answer fields into the question answering processing.
-  Each xml file contains multiple \\. Each \\ contains one question
   and its answer.

Sample .xml file:

<?xml version="1.0" encoding="UTF-8"?>
<Document>
...
<QAPairs>
 <QAPair pid="1">                                                           # pair #1
   <Question qid="000001-1"> A question here ... </Question>                # question #1, will be read as string by model
   <Answer> An answer here ... </Answer>                                    # answer of question #1, will be read as string by model
 </QAPair>
 ...                                                                        # multiple subsequent <QAPair> blocks, Question and its Answer pair will be combined into one string by model, and strings of QAPair are then processed into panda dataframe
</QAPairs>
</Document>

Query Format¶

note

The pretrained model should be fine-tuned with a dataset first to adapt to particular question domains when necessary.

Otherwise, following the question, input should contain relevant information (context paragraph or candidate answers, or both), whether or not addresses the question.

Optionally, while the relevant information as additional paragraph are provided in query, the question always comes first, followed by additional paragraph. We use “n” separators between the question and its paragraph of the input.

Query is in JSON format. It could be a \ of a single question in questions field. Model will only read the questions field.

{
 'questions': ['Who is at risk for Adult Acute Lymphoblastic Leukemia?',
              'What are the treatments for Adult Acute Lymphoblastic Leukemia ?'],     # will be read as a list of string by model, and each question will be extracted as string to process the question answering stage recursively
              ...                                                                      # questions in format of string
 ...                                                                                   # other fileds. fields, other than 'questions', won't be read into the model
}

Prediction Format¶

The output is in JSON format.

{'answers':['Past treatment with chemotherapy or radiation therapy. Having certain genetic disorders.',    # output 'answers' field is a list of string
            'Chemotherapy. Radiation therapy. Chemotherapy with stem cell transplant. Targeted therapy.'
            ...
            ]}

SPEECH_RECOGNITION¶

Speech recognition for the English language.

Dataset Type¶

dataset-type:AUDIO_FILES

The audios.csv should be of a .CSV format with 3 columns of wav_filename, wav_filesize and transcript.

For each row,

wav_filename should be a file path to a .wav audio file within the archive, relative to the root of the directory. Each audio file’s sample rate must equal to 16kHz.

wav_filesize should be an integer representing the size of the .wav audio file, in number of bytes.

transcript should be a string of the true transcript for the audio file. Transcripts should only contain the following alphabets:
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z


'
An example of audios.csv follows:

wav_filename,wav_filesize,transcript
6930-81414-0000.wav,412684,audio transcript one
6930-81414-0001.wav,559564,audio transcript two
...
672-122797-0005.wav,104364,audio transcript one thousand
...
1995-1837-0001.wav,279404,audio transcript three thousand

Query Format¶

A Base64-encoded string of the bytes of the audio as a 16kHz .wav file

Prediction Format¶

A string, representing the predicted transcript for the audio.

TABULAR_CLASSIFICATION¶

Dataset Type¶

dataset-type:TABULAR

The following optional train arguments are supported:

Train A rgument Description

f eatures List of feature columns’ names as a list of strings (defaults to first N-1 columns in the CSV file)

` target` Target column name as a string (defaults to the last column in the CSV file)

The train & validation datasets should have the same columns.

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
58,0,0,170,225,1,0,146,1,2.8,1,2,1,0

Query Format¶

An size-N-1 dictionary representing feature-value pairs.

E.g.

queries=[
{'age': 48,'sex': 1,'cp': 2,'trestbps': 130,'chol': 225,'fbs': 1,'restecg': 1,'thalach': 172,'exang': 1,'oldpeak': 1.7,'slope': 2,'ca': 0,'thal': 3},
{'age': 48,'sex': 0,'cp': 2,'trestbps': 130,'chol': 275,'fbs': 0,'restecg': 1,'thalach': 139,'exang': 0,'oldpeak': 0.2,'slope': 2,'ca': 0,'thal': 2},
]

Prediction Format¶

A size-k list of floats, representing the probabilities of each class from 0 to k-1 for the target column.

TABULAR_REGRESSION¶

Dataset Type¶

dataset-type:TABULAR

The following optional train arguments are supported:

Train A rgument Description

f eatures List of feature columns’ names as a list of strings (defaults to first N-1 columns in the CSV file)

` target` Target column name as a string (defaults to the last column in the CSV file)

The train & validation datasets should have the same columns.

An example of the dataset follows:

density,bodyfat,age,weight,height,neck,chest,abdomen,hip,thigh,knee,ankle,biceps,forearm,wrist
1.0708,12.3,23,154.25,67.75,36.2,93.1,85.2,94.5,59,37.3,21.9,32,27.4,17.1
1.0853,6.1,22,173.25,72.25,38.5,93.6,83,98.7,58.7,37.3,23.4,30.5,28.9,18.2
1.0414,25.3,22,154,66.25,34,95.8,87.9,99.2,59.6,38.9,24,28.8,25.2,16.6
...

Query Format¶

An size-N-1 dictionary representing feature-value pairs.

Prediction Format¶

A float, representing the value of the target column.