Dataset Types¶
note
Refer to ./examples/datasets/ for examples on pre-processing common dataset formats to conform to the SINGA-Auto’s own dataset formats.
CORPUS¶
The dataset file must be of the .zip archive format with a
corpus.tsv at the root of the directory.
The corpus.tsv should be of a
.TSV format
with columns of token and N other variable column names (tag
columns).
For each row,
tokenshould be a string, a token (e.g. word) in the corpus. These tokens should appear in the order as it is in the text of the corpus. To delimit sentences,tokencan be take the value of\n.The other
Ncolumns describe the corresponding token as part of the text of the corpus, depending on the task.
SEGMENTATION_IMAGES¶
Inside the uploaded
.zipfile, the training and validation sets should be wrapped separately, and be named strictly astrainandval.For
trainfolder (the same forvalfolder), the images and annotated masks should also be wrapped separately, and be named strictly asimageandmask.maskfolder should contain only.pngfiles and file name should be the same as each mask’s corresponding image. (eg. for an image named0001.jpg, its corresponding mask should be named as0001.png)An JSON file named
params.jsonmust also be included in the.zipfile, in order to indicates the essential training parameters such asnum_classes, for example:{ "num_classes": 21 }
An example of the upload .zip file structure:
+ dataset.zip
+ train
+ image
+ 0001.jpg
+ 0002.jpg
+ ...
+ mask
+ 0001.png
+ 0002.png
+ ..
+ val
+ image
+ 0003.jpg
+ ...
+ mask
+ 0003.png
+ ...
+ params.json
IMAGE_FILES¶
The dataset file must be of the .zip archive format with a
images.csv at the root of the directory.
The images.csv should be of a
.CSV format
with columns of path and N other variable column names (tag
columns).
For each row,
pathshould be a file path to a.png,.jpgor.jpegimage file within the archive, relative to the root of the directory.The other
Ncolumns describe the corresponding image, depending on the task.
DETECTION_DATASET¶
It is recommended to follow the YOLO dataset format.
- For folder hierarchy, two folders ‘images’ and ‘labels’ should be
prepared. In ‘images’ folder, there are PIL loadable images, and the
corresponding
txtlabel files should be placed in ‘labels’ folder, with the same basename with the images. - The label file format is as follows, where
object-idis the index of object, the following four numbers should be normalized to range between 0 and 1 by dividing by the width and height of the image.center_x center_yare the central coordinates of bounding box, andwidth heighis the sides lengths of it. It is allowable to use empty label file (negative samples), which means there are no objects to detect in the image.
object-id center_x center_y width height
...
- In addition,
train.txt,valid.txtcan be provided to note images used for training/validataion, only including the path of image files. Aclass.namescontains the category names and thier line numbers areobject-id.
GENERAL_FILES¶
- For general task, as its name states, any domain’s task (or model) can be included within this category, such as image processing, nlp, speech, or video.
- There is no requirements for the form of dataset, as long as it can be read into memory in the form of a file. However, the model developer has to know in advance how to handle the read-in file.
QUESTION_ANSWERING_COVID19¶
The dataset file must be of the .zip archive format, containing
JSON files. JSON files under
different levels of folders will be automaticly read all together.
Each JSON file is extracted from one paper. JSON structure contains field body_text, which is a list of {“text”: <str>} blocks. Each text block is namely each paragraph of corresponding paper.
Meanwhile, a metadata.csv file, at the root of the archive directory, is optional. It is to provide the model with publish_time column, each entry is in Date format, e.g. 2001-12-17. In this condition, each metadata entry is required to have sha value column in General format, and each JSON file required to have “sha”:<str> field, while both sha values linked. When neither metadata.csv or publish_time Date value is provided, the model would not check the timeliness of corresponding JSON body_text field.
QUESTION_ANSWERING_MEDQUAD¶
The dataset file must be of the .zip archive format, containing
xml
files. Xml files under different levels of folders will be automaticly
read all together.
Model would only take <Document> <QAPairs> … </QAPairs> </Document>field, and this filed contains multiple <QAPair> … </QAPair>. Each QAPair has one <Question> … </Question> and its <Answer> … </Answer> combination.
TABULAR¶
The dataset file must be a tabular dataset of the .csv format with
N columns.
AUDIO_FILES¶
The dataset file must be of the .zip archive format with a
audios.csv at the root of the directory.
The audios.csv should be of a
.CSV format
with 3 columns of wav_filename, wav_filesize and transcript.
For each row,
wav_filenameshould be a file path to a.wavaudio file within the archive, relative to the root of the directory. Each audio file’s sample rate must equal to 16kHz.
wav_filesizeshould be an integer representing the size of the.wavaudio file, in number of bytes.
transcriptshould be a string of the true transcript for the audio file. Transcripts should only contain the following alphabets:a b c d e f g h i j k l m n o p q r s t u v w x y z 'An example of
audios.csvfollows:
wav_filename,wav_filesize,transcript
6930-81414-0000.wav,412684,audio transcript one
6930-81414-0001.wav,559564,audio transcript two
...
672-122797-0005.wav,104364,audio transcript one thousand
...
1995-1837-0001.wav,279404,audio transcript three thousand
Query Format¶
A Base64-encoded string of the bytes of the audio as a 16kHz .wav file
Prediction Format¶
A string, representing the predicted transcript for the audio.