取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

利用 Python 和 PyTorch 处理面向对象的数据集 - 第 3 部分:猫和狗

yolanda
Moderator
Moderator
0 0 122

 

BY Giovanni Guasti

注意:本论坛博客所有内容皆来源于Xilinx工程师,如需转载,请写明出处作者及赛灵思论坛链接并发邮件至cncrc@xilinx.com,未经Xilinx及著作权人许可,禁止用作商业用途 


本篇是利用 Python 和 PyTorch 处理面向对象的数据集系列博客的第 3 篇。如需阅读第 1 篇,请参阅此处

如需阅读第 2 篇,请参阅此处。

 

第 3 部分:repetita iuvant(*):猫和狗

(*) 是一个拉丁语词组,意为“水滴石穿,功到自成”

在本篇博文中,我们将在“猫和狗”数据库上重复先前第 2 部分中已完成的过程,并且我们将添加一些其它内容。

 

通常,简单的数据集都是按文件夹来组织的。例如,猫、狗以及每个类别的“训练”、“验证”和“测试”文件夹。

通过将数据集组织为单一对象,即可避免文件夹树的复杂性。在此应用中,所有图片都保存到同一个文件夹内。

我们只需 1 个标签文件来标明哪个是狗,哪个是猫即可。以下包含了用于自动创建标签文件的代码。


即使每张图片名称本身就包含了标签,我们也特意创建 1 个专用 label.txt 文件:其中每行都包含文件名和标签:猫 (cat) = 0 狗 (dog) = 1。

在此示例最后,我们将回顾使用 PyTorch 拆分数据集的 2 种方法,并训练 1 个非常简单的模型。

输入 [ ]:

 

data_path = './raw_data/dogs_cats/all'
import os
files = [f for f in os.listdir(data_path) ]
#for f in files:
#    print(f)

with open(data_path + '/'+ "labels.txt", "a") as myfile:
    for f in files:
        if f.split('.')[0]=='cat':
            label = 0
        elif f.split('.')[0]=='dog':
            label = 1
        else:
            print("ERROR in recognizing file " + f + "label")
        
        myfile.write(f + ' ' + str(label) + '\n')

 

输入 [106]:

 

raw_data_path = './raw_data/dogs_cats/all'
im_example_cat = Image.open(raw_data_path + '/' + 'cat.1070.jpg') 
im_example_dog = Image.open(raw_data_path + '/' + 'dog.1070.jpg') 

fig, axs = plt.subplots(1, 2, figsize=(10, 3))

axs[0].set_title('should be a cat')
axs[0].imshow(im_example_cat)

axs[1].set_title('should be a dog')
axs[1].imshow(im_example_dog)
plt.show()

 

1.png

请务必刷新样本列表:

输入 [ ]:

 

del sample_list

@functools.lru_cache(1)
def getSampleInfoList(raw_data_path):
    sample_list = []
    with open(str(raw_data_path) + '/labels.txt', mode = 'r') as f:
        reader = csv.reader(f, delimiter = ' ')
        for i, row in enumerate(reader):
            imgname = row[0]
            label = int(row[1])
            sample_list.append(DataInfoTuple(imgname, label))
    sample_list.sort(reverse=False, key=myFunc)
    # print("DataInfoTouple: samples list length = {}".format(len(sample_list)))
    return sample_list

 

数据集对象创建非常简单,只需 1 行代码即可:

输入 [114]:

 

mydataset = MyDataset(isValSet_bool = None, raw_data_path = raw_data_path, norm = False, resize = True, newsize = (64, 64))

 

如需进行归一化,则应计算平均值和标准差,并重新生成归一化后的数据集。

代码如下,以确保代码完整性。

输入 [ ]:

 

imgs = torch.stack([img_t for img_t, _ in mydataset], dim = 3)
im_mean = imgs.view(3, -1).mean(dim=1).tolist()
im_std = imgs.view(3, -1).std(dim=1).tolist()
del imgs
normalize = transforms.Normalize(mean=im_mean, std=im_std)
mydataset = MyDataset(isValSet_bool = None, raw_data_path = raw_data_path, norm = True, resize = True, newsize = (64, 64))

 

 

将数据库拆分为训练集、验证集和测试集。

下一步是训练阶段所必需的。通常,整个样本数据集已重新打乱次序,随后拆分为三个集:训练集、验证集和测试集。

如果您已将数据集组织为数据张量和标签张量,那么您可使用 2 次“sklearn.model_selection.train_test_split”。

首先,将其拆分为“训练”和“测试”,然后再将“训练”拆分为“验证”和“训练”。

结果如下所示:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

 

但我们想把数据集保留为对象,而 PyTorch 则可帮助我们简化这一操作。

例如,我们仅创建“训练和验证”集。

方法 1:

此处,我们对索引进行打乱次序,然后创建数据集

输入 [ ]:

 

n_samples = len(mydataset)
# 验证集中将包含多少样本
n_val = int(0.2 * n_samples)
# 重要!对数据集进行打乱次序。首先对索引进行打乱次序
shuffled_indices = torch.randperm(n_samples)
# 第一步是拆分索引
train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]
train_indices, val_indices

 

输入 [ ]:

 

from torch.utils.data.sampler import SubsetRandomSampler
batch_size = 64

train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_loader = torch.utils.data.DataLoader(mydataset, batch_size=batch_size, sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(mydataset, batch_size=batch_size, sampler=valid_sampler)

 

方法 2

以下是直接对数据库进行打乱次序的示例。代码风格更抽象:

输入 [116]:

 

train_size = int(0.9 * len(mydataset))
valid_size = int(0.1 * len(mydataset))
train_dataset, valid_dataset = torch.utils.data.random_split(mydataset, [train_size, valid_size])

# 如需“测试”数据集,则请取消注释
#test_size = valid_size
#train_size = train_size - test_size
#train_dataset, test_dataset = torch.utils.data.random_split(train_dataset, [train_size, test_size])

len(mydataset), len(train_dataset), len(valid_dataset)

 

输出 [116]:

 

(25000, 22500, 2500)

 

 

模型定义

输入 [41]:

import torch.nn as nn
import torch.nn.functional as F
n_out = 2

 输入 [ ]:

# NN 极小
# 期望的精确度为 0.66

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding = 1)
        self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding = 1)
        self.fc1 = nn.Linear(8*16*16, 32)
        self.fc2 = nn.Linear(32, 2)

    def forward(self, x):
        out = F.max_pool2d(torch.tanh(self.conv1(x)), 2)
        out = F.max_pool2d(torch.tanh(self.conv2(out)), 2)
        #print(out.shape)
        out = out.view(-1,8*16*16)
        out = torch.tanh(self.fc1(out))
        out = self.fc2(out)
        return out

输入 [131]:

# 模型更深 - 但训练时间在我的 CPU 上开始变得有些难以承受

class ResBlock(nn.Module):
    def __init__(self, n_chans):
        super(ResBlock, self).__init__()
        self.conv = nn.Conv2d(n_chans, n_chans, kernel_size=3, padding=1)
        self.batch_norm = nn.BatchNorm2d(num_features=n_chans)
    def forward(self, x):
        out = self.conv(x)
        out = self.batch_norm(out)
        out = torch.relu(out)
        return out + x

输入 [177]:

class Net(nn.Module):
    def __init__(self, n_chans1=32, n_blocks=10):
        super(Net, self).__init__()
        self.n_chans1 = n_chans1
        self.conv1 = nn.Conv2d(3, n_chans1, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(n_chans1, n_chans1, kernel_size=3, padding=1)
        self.resblocks = nn.Sequential(* [ResBlock(n_chans=n_chans1)] * n_blocks)
        self.fc1 = nn.Linear(n_chans1 * 8 * 8, 32)
        self.fc2 = nn.Linear(32, 2)
    def forward(self, x):
        out = F.max_pool2d(torch.relu(self.conv1(x)), 2)
        out = self.resblocks(out)
        out = F.max_pool2d(torch.relu(self.conv3(out)), 2)
        out = F.max_pool2d(torch.relu(self.conv3(out)), 2)
        out = out.view(-1, self.n_chans1 * 8 * 8)
        out = torch.relu(self.fc1(out))
        out = self.fc2(out)
        return out
model = Net(n_chans1=32, n_blocks=5)

让我们来显示模型大小:

输入 [178]:

model = Net()
numel_list = [p.numel() for p in model.parameters() if p.requires_grad == True]
sum(numel_list), numel_list

输出 [178]:

(85090, [864, 32, 9216, 32, 9216, 32, 32, 32, 65536, 32, 64, 2])

简单且聪明的窍门来检查图片 shape 的不匹配和错误:训练模型前先执行正向运行来进行检查:

输入 [180]:

model(mydataset[0][0].unsqueeze(0))
# 需要解压缩才能添加维度并对批次进行仿真

输出 [180]:

tensor([[0.7951, 0.6417]], grad_fn=<AddmmBackward>)

 

成功了!

模型训练

虽然这并非本文的目标,但既然有了模型,何不试试训练一下,更何况 Pytorch 还免费提供了 DataLoader。

DataLoader 的任务是通过灵活的采样策略对来自数据集的 mini-batch 进行采样。将自动把数据集打乱,然后再加载 mini-batch。如需获取参考信息,请访问 https://pytorch.org/docs/stable/data.html

输入 [181]:

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print("Training on device {}.".format(device))
 
Training on device cpu.

输入 [182]:

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=64, shuffle=False) # 注:此处无需乱序

输入 [183]:

def training_loop(n_epochs, optimizer, model, loss_fn, train_loader):
    for epoch in range(1, n_epochs + 1):
        loss_train = 0.0
        for imgs, labels in train_loader:
            outputs = model(imgs)
            loss = loss_fn(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            loss_train += loss.item()
        if epoch == 1 or epoch % 5 == 0:
            print('{} Epoch {}, Training loss {}'.format(
                datetime.datetime.now(), epoch, float(loss_train)))

输入 [184]:

model = Net()
# 预训练的模型 models_data_path = './raw_data/models' model.load_state_dict(torch.load(models_data_path + '/cats_dogs.pt'))

输入 [185]:

optimizer = optim.SGD(model.parameters(), lr=1e-2)
loss_fn = nn.CrossEntropyLoss()

training_loop(
    n_epochs = 20,
    optimizer = optimizer,
    model = model,
    loss_fn = loss_fn,
    train_loader = train_loader,
)

2020-09-15 19:33:03.105620 Epoch 1, Training loss 224.0338312983513
2020-09-15 20:01:35.993491 Epoch 5, Training loss 153.11289536952972
2020-09-15 20:36:51.486071 Epoch 10, Training loss 113.09166505932808
2020-09-15 21:11:37.375586 Epoch 15, Training loss 85.17814277857542
2020-09-15 21:46:05.792975 Epoch 20, Training loss 59.60428727790713

输入 [189]:

for loader in [train_loader, valid_loader]:
    correct = 0
    total = 0
    with torch.no_grad():
        for imgs, labels in loader:
            outputs = model(imgs)
            _, predicted = torch.max(outputs, dim=1) 
            total += labels.shape[0]
            correct += int((predicted == labels).sum())
    print("Accuracy: %f" % (correct / total))
Accuracy: 0.956756
Accuracy: 0.830800

性能一般,但目的只是为了测试组织为 Python 对象的数据集是否有效,并测试我们能否训练一般模型。

此外还需要注意,为了能够以我的 CPU 来加速训练,所有图像都降采样到 64x64。

输入 [187]:

models_data_path = './raw_data/models'
torch.save(model.state_dict(), models_data_path + '/cats_dogs.pt')

输入 [ ]:

# 如需加载先前保存的模型
model = Net()
model.load_state_dict(torch.load(models_data_path + 'cats_dogs.pt'))

 

附录

认识 DIM

理解 pytorch sum 或 mean 中的“dim”的方法是,它会折叠所指定的维度。因此,当它折叠维度 0(行),它会变为仅 1 行(它在整列范围内进行操作)。

输入 [ ]:

a = torch.randn(2, 3)
a

输入 [ ]:

torch.mean(a)

输入 [ ]:

torch.mean(a, dim=0) # now collapsing rows, only one row will result

输入 [ ]:

torch.mean(a, dim=1) # now collapsing columns, only one column will remain

 

参考资料