自定义数据集¶

支持新数据格式¶

要支持新的数据格式，您可以将其转换为现有格式（COCO 格式或 PASCAL 格式），或者直接将其转换为中间格式。您也可以选择离线转换（通过脚本在训练之前转换）或在线转换（实现新的数据集并在训练时进行转换）。在 MMDetection 中，我们建议将数据转换为 COCO 格式并离线转换，因此您只需在转换数据后修改配置的 data annotation paths 和 classes。

将新数据格式重组为现有格式¶

最简单的方法是将您的数据集转换为现有数据集格式（COCO 或 PASCAL VOC）。

COCO 格式的标注 JSON 文件包含以下必要键

'images': [
    {
        'file_name': 'COCO_val2014_000000001268.jpg',
        'height': 427,
        'width': 640,
        'id': 1268
    },
    ...
],

'annotations': [
    {
        'segmentation': [[192.81,
            247.09,
            ...
            219.03,
            249.06]],  # If you have mask labels, and it is in polygon XY point coordinate format, you need to ensure that at least 3 point coordinates are included. Otherwise, it is an invalid polygon.
        'area': 1035.749,
        'iscrowd': 0,
        'image_id': 1268,
        'bbox': [192.81, 224.8, 74.73, 33.43],
        'category_id': 16,
        'id': 42986
    },
    ...
],

'categories': [
    {'id': 0, 'name': 'car'},
 ]

JSON 文件中有三个必要的键

images: 包含图像列表及其信息，如 file_name、height、width 和 id。
annotations: 包含实例标注列表。
categories: 包含类别名称列表及其 ID。

在数据预处理后，用户使用现有格式（例如 COCO 格式）训练自定义新数据集有两个步骤

修改配置文件以使用自定义数据集。
检查自定义数据集的标注。

这里我们给出一个示例来展示上述两个步骤，该示例使用 5 类 COCO 格式的自定义数据集来训练现有的 Cascade Mask R-CNN R50-FPN 检测器。

1. 修改配置文件以使用自定义数据集¶

配置文件的修改涉及两个方面

The data field. Specifically, you need to explicitly add the metainfo=dict(classes=classes) fields in train_dataloader.dataset, val_dataloader.dataset and test_dataloader.dataset and classes must be a tuple type.
The num_classes field in the model part. Explicitly over-write all the num_classes from default value (e.g. 80 in COCO) to your classes number.

In configs/my_custom_config.py

# the new config inherits the base configs to highlight the necessary modification
_base_ = './cascade_mask_rcnn_r50_fpn_1x_coco.py'

# 1. dataset settings
dataset_type = 'CocoDataset'
classes = ('a', 'b', 'c', 'd', 'e')
data_root='path/to/your/'

train_dataloader = dict(
    batch_size=2,
    num_workers=2,
    dataset=dict(
        type=dataset_type,
        # explicitly add your class names to the field `metainfo`
        metainfo=dict(classes=classes),
        data_root=data_root,
        ann_file='train/annotation_data',
        data_prefix=dict(img='train/image_data')
        )
    )

val_dataloader = dict(
    batch_size=1,
    num_workers=2,
    dataset=dict(
        type=dataset_type,
        test_mode=True,
        # explicitly add your class names to the field `metainfo`
        metainfo=dict(classes=classes),
        data_root=data_root,
        ann_file='val/annotation_data',
        data_prefix=dict(img='val/image_data')
        )
    )

test_dataloader = dict(
    batch_size=1,
    num_workers=2,
    dataset=dict(
        type=dataset_type,
        test_mode=True,
        # explicitly add your class names to the field `metainfo`
        metainfo=dict(classes=classes),
        data_root=data_root,
        ann_file='test/annotation_data',
        data_prefix=dict(img='test/image_data')
        )
    )

# 2. model settings

# explicitly over-write all the `num_classes` field from default 80 to 5.
model = dict(
    roi_head=dict(
        bbox_head=[
            dict(
                type='Shared2FCBBoxHead',
                # explicitly over-write all the `num_classes` field from default 80 to 5.
                num_classes=5),
            dict(
                type='Shared2FCBBoxHead',
                # explicitly over-write all the `num_classes` field from default 80 to 5.
                num_classes=5),
            dict(
                type='Shared2FCBBoxHead',
                # explicitly over-write all the `num_classes` field from default 80 to 5.
                num_classes=5)],
    # explicitly over-write all the `num_classes` field from default 80 to 5.
    mask_head=dict(num_classes=5)))

2. 检查自定义数据集的标注¶

假设您的自定义数据集是 COCO 格式，请确保您在自定义数据集中有正确的标注

标注中 categories 字段的长度应与配置文件中 classes 字段的元组长度完全相同，这意味着类别的数量（在本例中为 5）。
配置文件中的 classes 字段应与标注的 categories 中的 name 具有完全相同的元素和相同的顺序。MMDetection 自动将标注 categories 中的不连续 id 映射到连续的标签索引，因此标注 categories 字段中 name 的字符串顺序会影响标签索引的顺序。同时，配置文件中 classes 的字符串顺序会影响预测边界框可视化时的标签文本。
标注 annotations 字段中的 category_id 应有效，即 category_id 中的所有值都应属于 categories 中的 id。

这是一个有效的标注示例

'annotations': [
    {
        'segmentation': [[192.81,
            247.09,
            ...
            219.03,
            249.06]],  # if you have mask labels
        'area': 1035.749,
        'iscrowd': 0,
        'image_id': 1268,
        'bbox': [192.81, 224.8, 74.73, 33.43],
        'category_id': 16,
        'id': 42986
    },
    ...
],

# MMDetection automatically maps the uncontinuous `id` to the continuous label indices.
'categories': [
    {'id': 1, 'name': 'a'}, {'id': 3, 'name': 'b'}, {'id': 4, 'name': 'c'}, {'id': 16, 'name': 'd'}, {'id': 17, 'name': 'e'},
 ]

我们使用这种方法来支持 CityScapes 数据集。脚本在 cityscapes.py 中，我们还提供了微调的 configs。

注意

对于实例分割数据集，MMDetection 目前仅支持评估 COCO 格式数据集中 mask 的 AP。
建议在训练之前离线转换数据，这样您仍然可以使用 CocoDataset，只需修改标注的路径和训练类别。

将新数据格式重组为中间格式¶

如果您不想将标注格式转换为 COCO 或 PASCAL 格式，那也没关系。实际上，我们在 MMEninge 的 BaseDataset 中定义了一种简单的标注格式，所有现有数据集都经过处理以与之兼容，无论是在线还是离线。

数据集的标注必须为 json 或 yaml、yml 或 pickle、pkl 格式；存储在标注文件中的字典必须包含两个字段 metainfo 和 data_list。 metainfo 是一个字典，它包含数据集的元数据，例如类别信息；data_list 是一个列表，列表中的每个元素都是一个字典，该字典定义了一张图像的原始数据，每个原始数据包含一个或多个训练/测试样本。

这是一个示例。

{
    'metainfo':
        {
            'classes': ('person', 'bicycle', 'car', 'motorcycle'),
            ...
        },
    'data_list':
        [
            {
                "img_path": "xxx/xxx_1.jpg",
                "height": 604,
                "width": 640,
                "instances":
                [
                  {
                    "bbox": [0, 0, 10, 20],
                    "bbox_label": 1,
                    "ignore_flag": 0
                  },
                  {
                    "bbox": [10, 10, 110, 120],
                    "bbox_label": 2,
                    "ignore_flag": 0
                  }
                ]
              },
            {
                "img_path": "xxx/xxx_2.jpg",
                "height": 320,
                "width": 460,
                "instances":
                [
                  {
                    "bbox": [10, 0, 20, 20],
                    "bbox_label": 3,
                    "ignore_flag": 1,
                  }
                ]
              },
            ...
        ]
}

一些数据集可能提供类似人群/困难/忽略边界框的标注，我们使用 ignore_flag 来覆盖它们。

获得上述标准数据标注格式后，您可以在配置中直接使用 MMDetection 的 BaseDetDataset，无需转换。

自定义数据集的示例¶

假设标注为文本文件中的一种新格式。边界框标注存储在文本文件 annotation.txt 中，格式如下

#
000001.jpg
1280 720
2
10 20 40 60 1
20 40 50 60 2
#
000002.jpg
1280 720
3
50 20 40 60 2
20 40 30 45 2
30 40 50 60 3

我们可以在 mmdet/datasets/my_dataset.py 中创建一个新数据集来加载数据。

import mmengine

from mmdet.base_det_dataset import BaseDetDataset
from mmdet.registry import DATASETS


@DATASETS.register_module()
class MyDataset(BaseDetDataset):

    METAINFO = {
       'classes': ('person', 'bicycle', 'car', 'motorcycle'),
        'palette': [(220, 20, 60), (119, 11, 32), (0, 0, 142), (0, 0, 230)]
    }

    def load_data_list(self, ann_file):
        ann_list = mmengine.list_from_file(ann_file)

        data_infos = []
        for i, ann_line in enumerate(ann_list):
            if ann_line != '#':
                continue

            img_shape = ann_list[i + 2].split(' ')
            width = int(img_shape[0])
            height = int(img_shape[1])
            bbox_number = int(ann_list[i + 3])

            instances = []
            for anns in ann_list[i + 4:i + 4 + bbox_number]:
                instance = {}
                instance['bbox'] = [float(ann) for ann in anns.split(' ')[:4]]
                instance['bbox_label']=int(anns[4])
 				instances.append(instance)

            data_infos.append(
                dict(
                    img_path=ann_list[i + 1],
                    img_id=i,
                    width=width,
                    height=height,
                    instances=instances
                ))

        return data_infos

然后在配置中，要使用 MyDataset，您可以修改配置如下

dataset_A_train = dict(
    type='MyDataset',
    ann_file = 'image_list.txt',
    pipeline=train_pipeline
)

通过数据集包装器自定义数据集¶

MMEngine 还支持许多数据集包装器来混合数据集或修改数据集分布以进行训练。目前它支持以下三种数据集包装器

RepeatDataset: 仅重复整个数据集。
ClassBalancedDataset: 以类别平衡的方式重复数据集。
ConcatDataset: 连接数据集。

有关详细用法，请参阅 MMEngine Dataset Wrapper。

修改数据集类¶

使用现有数据集类型，我们可以修改其元信息以训练注释的子集。例如，如果你想只训练当前数据集的三个类别，你可以修改数据集的类别。数据集将自动过滤掉其他类别的真实框。

classes = ('person', 'bicycle', 'car')
train_dataloader = dict(
    dataset=dict(
        metainfo=dict(classes=classes))
    )
val_dataloader = dict(
    dataset=dict(
        metainfo=dict(classes=classes))
    )
test_dataloader = dict(
    dataset=dict(
        metainfo=dict(classes=classes))
    )

注意:

在 MMDetection v2.5.0 之前，如果设置了类别且没有 GT 图像，则数据集会自动过滤掉空 GT 图像，并且无法通过配置文件禁用此功能。这是一个不希望的行为，并会导致混淆，因为如果未设置类别，则数据集仅在 filter_empty_gt=True 和 test_mode=False 时过滤掉空 GT 图像。在 MMDetection v2.5.0 之后，我们将图像过滤过程与类别修改解耦，即数据集仅在 filter_cfg=dict(filter_empty_gt=True) 和 test_mode=False 时过滤掉空 GT 图像，无论类别是否设置。因此，设置类别仅影响用于训练的类别的注释，用户可以自行决定是否过滤空 GT 图像。
在 MMEngine 中直接使用 BaseDataset 或在 MMDetection 中使用 BaseDetDataset 时，用户无法通过修改配置来过滤没有 GT 的图像，但这可以通过离线方式解决。
请记住，在数据集指定 classes 时，修改头部中的 num_classes。我们实现了 NumClassCheckHook 用于检查从 v2.9.0（PR#4508 之后）起数字是否一致。

COCO 全景分割数据集¶

现在我们支持 COCO 全景分割数据集，全景分割注释的格式与 COCO 格式不同。前景点和背景点都会出现在注释文件中。COCO 全景分割格式的注释 json 文件具有以下必要的键

'images': [
    {
        'file_name': '000000001268.jpg',
        'height': 427,
        'width': 640,
        'id': 1268
    },
    ...
]

'annotations': [
    {
        'filename': '000000001268.jpg',
        'image_id': 1268,
        'segments_info': [
            {
                'id':8345037,  # One-to-one correspondence with the id in the annotation map.
                'category_id': 51,
                'iscrowd': 0,
                'bbox': (x1, y1, w, h),  # The bbox of the background is the outer rectangle of its mask.
                'area': 24315
            },
            ...
        ]
    },
    ...
]

'categories': [  # including both foreground categories and background categories
    {'id': 0, 'name': 'person'},
    ...
 ]

此外，必须将 seg 设置为全景分割注释图像的路径。

dataset_type = 'CocoPanopticDataset'
data_root='path/to/your/'

train_dataloader = dict(
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_prefix=dict(
            img='train/image_data/', seg='train/panoptic/image_annotation_data/')
    )
)
val_dataloader = dict(
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_prefix=dict(
            img='val/image_data/', seg='val/panoptic/image_annotation_data/')
    )
)
test_dataloader = dict(
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_prefix=dict(
            img='test/image_data/', seg='test/panoptic/image_annotation_data/')
    )
)