神经网络 - 反向传播算法推导

228

算法推导

符号说明

  • $C$ 表示 cost function
  • $w^l_{j,i}$ 表示第 $l-1$ 层的第 $i$ 个神经元到第 $l$ 层的第 $j$ 个神经元的 weight
  • $b^l_i$ 表示第 $l$ 层的第 $i$ 个神经元的 bias
  • $a^l_k$ 表示第 $l$ 层的第 $k$ 个神经元的输出(activation)
  • $P_{l}$ 表示第 $l$ 层的神经元数量
  • $f$ 表示激活函数(activation function)
  • $z^l_k$ 表示第 $l$ 层的第 $k$ 个神经元的输入,即

$$ z^l_k = \sum^{P_{l-1}}_{k_{l-1}=1} w^l_{k,k_{l-1}} a^{l-1}_{k_{l-1}} + b^{l}_{k_l} \tag{1} $$

两个假设

(1)假设 $C$ 可以写成输出层的每个神经元的损失函数的和,即 $$ C = \sum^{n}_{k=1} C^n_k \tag{2} $$
(2)假设输出层的神经元的损失函数是该神经元输出(activation)的函数,即 $$ C^n_k = C^n_k(a^n_k) \tag{3} $$

公式推导

我们的目的是要求网络模型中所有的 weight 梯度和 bias 梯度,即
$$ \frac{\partial C}{\partial w^{l}_{j,i}} 与 \frac{\partial C}{\partial b^l_j} \tag{4} $$
其中 $l$ 表示网络模型中的任意一层(不包括输入层),$i$, $j$ 分别表示某一层的第 $i$, $j$ 个神经元

下面开始推导 $ \dfrac{\partial C}{\partial w^{l}_{j,i}} $


\begin{align*}
& \frac{\partial{C}}{\partial{w_{j,i}^{l}}} \\ &= \sum^{P_n}_{k_{n}=1} \frac{\partial C^{n}_{k_n}}{\partial w^{l}_{j,i}} \\
&= \sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial w^{l}_{j,i}} \\
&= \sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}} \frac{\partial z^{n}_{k_n}}{\partial w^{l}_{j,i}} \\
&= \sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}} \cdot \frac{\partial }{\partial w^{l}_{j,i}} (\sum_{k_{n-1} = 1}^{P_{n-1}}w^{n}_{k_n,k_{n-1}} a^{n-1}_{k_{n-1}} + b^{n}_{k_n}) \\
&= \sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}} \cdot \sum_{k_{n-1} = 1}^{P_{n-1}} w^{n}_{k_n,k_{n-1}} \frac{\partial a^{n-1}_{k_{n-1}}}{\partial w^{l}_{j,i}} \\
&= \sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}} \cdot \sum_{k_{n-1} = 1}^{P_{n-1}} w^{n}_{k_n,k_{n-1}} \frac{\partial a^{n-1}_{k_{n-1}}}{\partial z^{n-1}_{k_{n-1}}} \cdot \frac{\partial z^{n-1}_{k_{n-1}}}{\partial w^{l}_{j,i}} \\
&= \sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}} \cdot \sum_{k_{n-1} = 1}^{P_{n-1}} w^{n}_{k_n,k_{n-1}} \frac{\partial a^{n-1}_{k_{n-1}}}{\partial z^{n-1}_{k_{n-1}}} \cdot \frac{\partial}{\partial w^{l}_{j,i}} (\sum_{k_{n-2} = 1}^{P_{n-2}} w^{n-1}_{k_{n-1},k_{n-2}} a^{n-2}_{k_{n-2}} + b^{n-1}_{k_{n-1}}) \\
&= \sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}} \cdot \sum_{k_{n-1} = 1}^{P_{n-1}} w^{n}_{k_n,k_{n-1}} \frac{\partial a^{n-1}_{k_{n-1}}}{\partial z^{n-1}_{k_{n-1}}} \cdot \sum_{k_{n-2} = 1}^{P_{n-2}} w^{n-1}_{k_{n-1},k_{n-2}} \frac{\partial a^{n-2}_{k_{n-2}}}{\partial w^{l}_{j,i}} \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\sum_{k_{n-1} = 1}^{P_{n-1}} w^{n}_{k_n,k_{n-1}} \frac{\partial a^{n-1}_{k_{n-1}}}{\partial z^{n-1}_{k_{n-1}}}}_{Layer\ n-1} \cdot \underbrace{\sum_{k_{n-2} = 1}^{P_{n-2}} w^{n-1}_{k_{n-1},k_{n-2}} \frac{\partial a^{n-2}_{k_{n-2}}}{\partial z^{n-2}_{k_{n-2}}}}_{Layer\ n-2} \cdot \frac{\partial z^{n-2}_{k_{n-2}}}{\partial w^{l}_{j,i}} \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\prod^{l+1}_{m=n-1} \sum_{k_m = 1}^{p_m} w^{m+1}_{k_{m+1},k_{m}} \frac{\partial a^{m}_{k_{m}}}{\partial z^{m}_{k_{m}}}}_{Layer\ n-1\ \backsim\ l+1} \cdot \underbrace{\sum_{k_{l} = 1}^{P_{l}} w^{l+1}_{k_{l+1},k_{l}} \frac{\partial a^{l}_{k_{l}}}{\partial z^{l}_{k_{l}}}}_{Layer\ l} \cdot \frac{\partial z^{l}_{k_{l}}}{\partial w^{l}_{j,i}} \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\prod^{l+1}_{m=n-1} \sum_{k_m = 1}^{p_m} w^{m+1}_{k_{m+1},k_{m}} \frac{\partial a^{m}_{k_{m}}}{\partial z^{m}_{k_{m}}}}_{Layer\ n-1\ \backsim\ l+1} \cdot \underbrace{\sum_{k_{l} = 1}^{P_{l}} w^{l+1}_{k_{l+1},k_{l}} \frac{\partial a^{l}_{k_{l}}}{\partial z^{l}_{k_{l}}} \cdot \frac{\partial}{\partial w^{l}_{j,i}} (\sum_{k_{l-1} = 1}^{P_{l-1}} w^{l}_{k_{l},k_{l-1}}a^{l-1}_{k_{l-1}} + b^{l}_{k_{l}}) }_{\text{该项只有当 $k_{l-1} = i$ 并且 $k_l = j$ 时才不为 $0$}} \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\prod^{l+1}_{m=n-1} \sum_{k_m = 1}^{p_m} w^{m+1}_{k_{m+1},k_{m}} \frac{\partial a^{m}_{k_{m}}}{\partial z^{m}_{k_{m}}}}_{Layer\ n-1\ \backsim\ l+1} \cdot \underbrace{w^{l+1}_{k_{l+1},j}}_{=\frac{\partial z^{l+1}_{k_{l+1}}}{\partial a^{l}_{j}}} \frac{\partial a^l_j}{\partial z^l_j} \cdot a^{l-1}_i \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\prod^{l+1}_{m=n-1} \sum_{k_m = 1}^{p_m} w^{m+1}_{k_{m+1},k_{m}} \frac{\partial a^{m}_{k_{m}}}{\partial z^{m}_{k_{m}}}}_{Layer\ n-1\ \backsim\ l+1} \cdot \frac{\partial z^{l+1}_{k_{l+1}}}{\partial a^{l}_{j}} \frac{\partial a^l_j}{\partial z^l_j} \cdot a^{l-1}_i \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\prod^{l+1}_{m=n-1} \sum_{k_m = 1}^{p_m} w^{m+1}_{k_{m+1},k_{m}} \frac{\partial a^{m}_{k_{m}}}{\partial z^{m}_{k_{m}}}}_{Layer\ n-1\ \backsim\ l+1} \cdot \frac{\partial z^{l+1}_{k_{l+1}}}{\partial z^l_j} \cdot a^{l-1}_i \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\prod^{l+2}_{m=n-1} \sum_{k_m = 1}^{p_m} w^{m+1}_{k_{m+1},k_{m}} \frac{\partial a^{m}_{k_{m}}}{\partial z^{m}_{k_{m}}}}_{Layer\ n-1\ \backsim\ l+2} \cdot \underbrace{\sum_{k_{l+1} = 1}^{p_{l+1}} w^{l+2}_{k_{l+2},k_{l+1}} \frac{\partial a^{l+1}_{k_{l+1}}}{\partial z^{l+1}_{k_{l+1}}}}_{Layer\ l+1} \cdot \frac{\partial z^{l+1}_{k_{l+1}}}{\partial z^l_j} \cdot a^{l-1}_i \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\prod^{l+2}_{m=n-1} \sum_{k_m = 1}^{p_m} w^{m+1}_{k_{m+1},k_{m}} \frac{\partial a^{m}_{k_{m}}}{\partial z^{m}_{k_{m}}}}_{Layer\ n-1\ \backsim\ l+2} \cdot \frac{\partial}{\partial z^{l}_{j}} (\sum^{P_{l+1}}_{k_{l+1} = 1}w^{l+2}_{k_{l+2},k_{l+1}}a^{l+1}_{k_{l+1}} + \text{任意常数}) \cdot a^{l-1}_i \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\prod^{l+2}_{m=n-1} \sum_{k_m = 1}^{p_m} w^{m+1}_{k_{m+1},k_{m}} \frac{\partial a^{m}_{k_{m}}}{\partial z^{m}_{k_{m}}}}_{Layer\ n-1\ \backsim\ l+2} \cdot \frac{\partial}{\partial z^{l}_{j}} (\sum^{P_{l+1}}_{k_{l+1} = 1}w^{l+2}_{k_{l+2},k_{l+1}}a^{l+1}_{k_{l+1}} + b^{l+2}_{k_{l+2}}) \cdot a^{l-1}_i \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \underbrace{\prod^{l+2}_{m=n-1} \sum_{k_m = 1}^{p_m} w^{m+1}_{k_{m+1},k_{m}} \frac{\partial a^{m}_{k_{m}}}{\partial z^{m}_{k_{m}}}}_{Layer\ n-1\ \backsim\ l+2} \cdot \frac{\partial z^{l+2}_{k_{l+2}}}{\partial z^{l}_{j}} \cdot a^{l-1}_i \\
&= \underbrace{\sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^{n}_{k_n}}}_{Layer\ n} \cdot \frac{\partial z^{n}_{k_{n}}}{\partial z^{l}_{j}} \cdot a^{l-1}_i \\
&= \sum_{k_n=1}^{P_n} \frac{\partial C^{n}_{k_n}}{\partial z^{l}_{j}} \cdot a^{l-1}_i \\
&= \frac{\partial C}{\partial z^{l}_{j}} \cdot a^{l-1}_i \tag{5}\\
\end{align*}

于是,得到:


$$ \frac{\partial{C}}{\partial{w_{j,i}^{l}}} = \frac{\partial C}{\partial z^{l}_{j}} \cdot a^{l-1}_i \tag{6} $$

类似上面的推导,得到:


$$ \frac{\partial{C}}{\partial{b_{j}^{l}}} = \frac{\partial C}{\partial z^{l}_{j}} \tag{7} $$

我们用 $\nabla$ 表示梯度,并把 $\frac{\partial C}{\partial z}$ 记为 $\delta$,则得到梯度公式:


$$ \nabla_{c}{w^{l}_{j,i}} = \delta^{l}_{j} \cdot a^{l-1}_{i} \tag{8} $$

$$ \nabla_{c}{b^{l}_{j}} = \delta^{l}_{j} \tag{9} $$

下面求 $\delta^{l}_{j}$ .

我们不妨首先求出输出层(第n层)的 $\delta^n$:


\begin{align*} \delta^{n}_{j} = \frac{\partial C}{\partial Z^{n}_{j}} &= \sum^{P_{n}}_{k_{n}=1} \frac{\partial C^{n}_{k_{n}}}{\partial a^{n}_{k_n}} \frac{\partial a^{n}_{k_n}}{\partial z^n_j} \\ &= \frac{\partial C^n_j}{\partial a^n_j} \frac{\partial a^n_j}{\partial z^n_j} \\ &= C’(a^n_j) \cdot f’(z^n_j) \tag{10} \end{align*}

如何求出任意层的 $\delta^l$ 呢?目前我们可以求出输出层的 $\delta^n$,那么是否可以通过后一层的 $\delta^{l+1}$ 求出前一层的 $\delta^l$ 呢,如果可以,那么我们就可以通过 $\delta^n$ 经过若干次运算求得 $\delta^l$,有了这样的想法,我们考虑通过 $\delta^{l+1}$ 来求 $\delta^{l}$:


\begin{align*}
\delta^{l}_{j} = \frac{\partial C}{\partial z^{l}_{j}} &= \sum^{P_{l+1}}_{k_{l+1}=1} \frac{\partial C}{\partial z^{l+1}_{k_{l+1}}} \frac{\partial z^{l+1}_{k_{l+1}}}{\partial z^{l}_{j}} \\
&= \sum^{P_{l+1}}_{k_{l+1}=1} \frac{\partial C}{\partial z^{l+1}_{k_{l+1}}} \frac{\partial}{\partial z^{l}_{j}} (\sum^{P_{l}}_{k_{l}=1} w^{l+1}_{k_{l+1},k_{l}}a^{l}_{k_{l}} + b^{l+1}_{k_{l+1}}) \\
&= \sum^{P_{l+1}}_{k_{l+1}=1} \delta^{l+1}_{k_{l+1}} w^{l+1}_{k_{l+1},j} \frac{\partial a^{l}_{j}}{\partial z^{l}_{j}} \\
&= \sum^{P_{l+1}}_{k_{l+1}=1} \delta^{l+1}_{k_{l+1}} w^{l+1}_{k_{l+1},j} f’(z^{l}_{j}) \tag{11}\\
\end{align*}

通过以上演算,我们成功的通过 $\delta^{l+1}$ 计算出了 $\delta^{l}$,这正是我们想要的结果。

于是,我们得到反向传播公式:


$$ \delta^{n}_{j} = C’(a^n_j) \cdot f’(z^n_j) \tag{12} $$
$$ \delta^{l}_{j} = \sum^{P_{l+1}}_{k_{l+1}=1} \delta^{l+1}_{k_{l+1}} w^{l+1}_{k_{l+1},j} f’(z^{l}_{j}) \tag{13} $$
$$ \nabla_{c}{w^{l}_{j,i}} = \delta^{l}_{j} \cdot a^{l-1}_{i} \tag{14} $$
$$ \nabla_{c}{b^{l}_{j}} = \delta^{l}_{j} \tag{15} $$

目前很多线性运算库可以帮助我们快速地计算矩阵,比如C++的Eigen,Python的Numpy等。为了将反向传播算法应用于这些线性运算库,我们有必要将上面的公式用矩阵和向量表示。
我们用 $\delta^l$ 表示一个列向量,该向量中的元素为 $\delta^l_k$,即:


$$ \delta^l = (\delta^l_1, \delta^l_2, …, \delta^l_{k_n})^T \tag{16} $$

类似于 $\delta^l$,引入 $a^l$,$z^l$,$\delta^l$, $b^l$ .
另外,引入二维矩阵 $w^l$,它的 $ij$ 元表示第 $l$ 层的第 $i$ 个神经元与第 $l-1$ 层的第 $j$ 个神经元连接的 weight .

于是,上面的公式可用矩阵表示为:


$$ \delta^n = C’({a^n}) \odot f’(z^n) \tag{17} $$
$$ \delta^l = (w^{l+1})^{T} \cdot \delta^{l+1} \odot f’(z^l) \tag{18} $$
$$ \nabla_{c}{w^{l}} = \delta^{l} \cdot (a^{l-1})^{T} \tag{19} $$
$$ \nabla_{c}{b^{l}} = \delta^{l} \tag{20} $$

附赠:C++实现ANN

下面的程序用C++实现了一个ANN,用来训练 MNIST 数据集。矩阵库用的是 Eigen 。如果想要运行下面的程序,你需要配置好 Eigen 头文件的包含路径,以及修改下面的 219 行和 227 行的 MNIST 数据集路径。

#include <cassert>
#include <cmath>
#include <fstream>
#include <iostream>
#include <memory>
#include <random>
#include <string>
#include <vector>

#include <Eigen/Dense>

#include <Windows.h>

static inline bool isBigEnd()
{
    int16_t value = 0x1234;
    return *reinterpret_cast<int8_t*>(&value) == 0x12;
}

static inline int32_t littleEnd2BigEnd(int32_t value)
{
    return ((0xFF000000 & value) >> 24) | ((0x00FF0000 & value) >> 8)
        | ((0x0000FF00 & value) << 8) | ((0x000000FF & value) << 24);
}

class Sample
{
    // Constructor & Destructor
public:
    Sample(int id, const Eigen::VectorXd &input, const Eigen::VectorXd &output)
        : m_id{ id }, m_input{ input }, m_output{ output } 
    {
    }

    // Accessor & Mutator
public:
    int id() const { return m_id; }
    const Eigen::VectorXd input() const { return m_input; }
    const Eigen::VectorXd output() const { return m_output; }

private:
    int m_id = { -1 };
    Eigen::VectorXd m_input;
    Eigen::VectorXd m_output;
};

class DataLoader
{
public:
    std::vector<Sample> load(const std::string &inputFilePath, const std::string &outputFilePath)
    {
        /*
         * 从 MNIST 文档读取图片数据
         */

        std::cout << "Loading image data from MNIST file ..." << std::endl;
    
        // 打开文档
        std::ifstream finImage(inputFilePath.c_str(), std::ifstream::in | std::ifstream::binary);
        if (!finImage)
        {
            throw std::runtime_error("Image file can not be opened!");
        }
        std::shared_ptr<std::ifstream> finImageCloser(&finImage, [](std::ifstream *finLabel) { finLabel->close(); });
    
        // 读取 Magic Number
        int32_t magicNumber = 0;
        finImage.read(reinterpret_cast<char*>(&magicNumber), 4);
        if (!isBigEnd())
        {
            magicNumber = littleEnd2BigEnd(magicNumber);
        }
        assert(magicNumber == 0x00000803);
    
        // 读取图片数量
        int32_t images = 0;
        finImage.read(reinterpret_cast<char*>(&images), 4);
        if (!isBigEnd())
        {
            images = littleEnd2BigEnd(images);
        }
        assert(60000);
        std::cout << "Number of images: " << images << std::endl;
    
        // 读取图片尺寸
        int32_t rows = 0;
        finImage.read(reinterpret_cast<char*>(&rows), 4);
        if (!isBigEnd())
        {
            rows = littleEnd2BigEnd(rows);
        }
        assert(28);
        std::cout << "Number of rows: " << rows << std::endl;
        int32_t cols = 0;
        finImage.read(reinterpret_cast<char*>(&cols), 4);
        if (!isBigEnd())
        {
            cols = littleEnd2BigEnd(cols);
        }
        assert(28);
        std::cout << "Number of columns: " << cols << std::endl;
    
        // 加载所有图片数据
        const uint32_t allImageDataSize = rows * cols * images;
        std::unique_ptr<uint8_t[]> allImageData{ new uint8_t[allImageDataSize]() };
        finImage.read(reinterpret_cast<char*>(allImageData.get()), allImageDataSize);
    
        // 关闭文件
        finImageCloser.reset();
    
        std::cout << "Loading image data from MNIST file done!" << std::endl;
    
        /*
         * 从 MNIST 文档读取图片对应的标记
         */
    
        std::cout << "Loading label data from MNIST file ..." << std::endl;
    
        // 打开文档
        std::ifstream finLabel(outputFilePath.c_str(), std::ifstream::in | std::ifstream::binary);
        if (!finLabel)
        {
            throw std::runtime_error("Label file can not be opened!");
        }
        std::shared_ptr<std::ifstream> finLabelCloser(&finLabel, [](std::ifstream *finLabel) { finLabel->close(); });
    
        // 读取 Magic Number
        magicNumber = 0;
        finLabel.read(reinterpret_cast<char*>(&magicNumber), 4);
        if (!isBigEnd())
        {
            magicNumber = littleEnd2BigEnd(magicNumber);
        }
        assert(magicNumber == 0x00000801);
    
        // 标记数据的数量
        int32_t labels = 0;
        finLabel.read(reinterpret_cast<char*>(&labels), 4);
        if (!isBigEnd())
        {
            labels = littleEnd2BigEnd(labels);
        }
        assert(labels == images);  // 标记的数量应该与图片的数量相同
        std::cout << "Number of labels: " << labels << std::endl;
    
        // Load all label data at once
        std::unique_ptr<uint8_t[]> allLabelData{ new uint8_t[labels]() };
        finLabel.read(reinterpret_cast<char*>(allLabelData.get()), labels);
    
        std::cout << "Loading label data from MNIST file done!" << std::endl;
    
        /*
         * 整理读取的数据,这里我们将图片和对应的标记保存在 Sample 对象中
         */
    
        std::vector<Sample> samples;  // 保存所有的样本
    
        std::cout << "Preparing data ...";
        for (int image = 0; image != images; ++image)
        {
            std::cout << "\rPreparing data " << image + 1 << "/" << images << " "
                << static_cast<int>(static_cast<double>(image + 1) * 100 / images)
                << "%";
    
            // read input neuron value from image data
            Eigen::VectorXd input(rows * cols);
            for (int pixel = 0; pixel < rows * cols; ++pixel)
            {
                // note that we use gray scaled pixel value here
                input(pixel) = allImageData[image * rows * cols + pixel] / 255.0;
            }
    
            // read desired output neuron value from label data
            Eigen::VectorXd output = Eigen::VectorXd::Zero(10);
            output(allLabelData[image]) = 1.0;
    
            // put sample to vector
            samples.push_back(Sample(image, input, output));
        }
        std::cout << std::endl;
    
        std::cout << "Preparing data done!" << std::endl;
    
        return samples;
    }
};

double sigmoid(double z)
{
    return 1.0 / (1.0 + std::exp(-z));
}

double sigmoidDerivative(double z)
{
    return sigmoid(z) * (1 - sigmoid(z));
}

int main()
{
    /*
     * 超参数设置
     */

    const int NUMBER_OF_LAYER = 3;
    constexpr int NEURON_OF_LAYER[NUMBER_OF_LAYER] = { 784, 60, 10 };
    const int EPOCH_COUNT = 400;
    const int MINIBATCH_SIZE = 10;
    const double ETA = 0.1;  // learning rate
    const double LAMBDA = 5.0;  // regularization parameter
    
    /*
     * 加载训练数据和测试数据
     */
    
    DataLoader dataLoader;
    
    std::cout << "Loading training data ..." << std::endl;
    
    std::vector<Sample> trainingData = dataLoader.load(
        "F:\\deeplearning\\resources\\MNIST\\train-images-idx3-ubyte\\train-images.idx3-ubyte",
        "F:\\deeplearning\\resources\\MNIST\\train-labels-idx1-ubyte\\train-labels.idx1-ubyte");
    
    std::cout << "Loading training data done!" << std::endl;
    
    std::cout << "Loading test data ..." << std::endl;
    
    std::vector<Sample> testData = dataLoader.load(
        "F:\\deeplearning\\resources\\MNIST\\t10k-images-idx3-ubyte\\t10k-images.idx3-ubyte",
        "F:\\deeplearning\\resources\\MNIST\\t10k-labels-idx1-ubyte\\t10k-labels.idx1-ubyte");
    
    std::cout << "Loading test data done!" << std::endl;
    
    /*
     * 创建神经网络参数 Weights 和 Biases
     */
    
    std::vector<Eigen::MatrixXd> weights(NUMBER_OF_LAYER);
    std::vector<Eigen::VectorXd> biases(NUMBER_OF_LAYER);
    for (int layer = 1; layer != NUMBER_OF_LAYER; ++layer)
    {
        weights[layer] = Eigen::MatrixXd::Random(NEURON_OF_LAYER[layer], NEURON_OF_LAYER[layer - 1]);
        biases[layer] = Eigen::VectorXd::Random(NEURON_OF_LAYER[layer]);
    }
    
    /*
     * 开始训练
     */
    
    const int MINIBATCH_COUNT = static_cast<int>(trainingData.size() / MINIBATCH_SIZE);
    
    for (int epoch = 0; epoch < EPOCH_COUNT; ++epoch)
    {
        std::cout << "Training epoch " << epoch + 1 << " ..." << std::endl;
    
        // 打乱训练集顺序
        std::random_shuffle(trainingData.begin(), trainingData.end());
    
        for (int miniBatch = 0; miniBatch < MINIBATCH_COUNT; ++miniBatch)
        {
            std::cout << "\rTraining mini batch: " << miniBatch + 1;
    
            // 针对每一个 Batch 创建梯度
            std::vector<Eigen::MatrixXd> nablaWeights(NUMBER_OF_LAYER);
            std::vector<Eigen::VectorXd> nablaBiases(NUMBER_OF_LAYER);
            for (int layer = 1; layer != NUMBER_OF_LAYER; ++layer)
            {
                nablaWeights[layer] = Eigen::MatrixXd::Zero(NEURON_OF_LAYER[layer], NEURON_OF_LAYER[layer - 1]);
                nablaBiases[layer] = Eigen::VectorXd::Zero(NEURON_OF_LAYER[layer]);
            }
    
            for (int sample = 0; sample < MINIBATCH_SIZE; ++sample)
            {
                // Feedforward
                std::vector<Eigen::VectorXd> weightedInputs(NUMBER_OF_LAYER);
                std::vector<Eigen::VectorXd> activations(NUMBER_OF_LAYER);
                Eigen::VectorXd activation = trainingData[miniBatch * MINIBATCH_SIZE + sample].input();
                activations[0] = activation;
                for (int layer = 1; layer != NUMBER_OF_LAYER; ++layer)
                {
                    Eigen::VectorXd weightedInput = weights[layer] * activation + biases[layer];
                    weightedInputs[layer] = weightedInput;
                    activation = weightedInput.unaryExpr(std::ptr_fun(sigmoid));
                    activations[layer] = activation;
                }
    
                // Backpropagation
                // 计算最后一层的 delta
                Eigen::VectorXd output = trainingData[miniBatch * MINIBATCH_SIZE + sample].output();
                Eigen::VectorXd delta = (activations[NUMBER_OF_LAYER - 1] - output).cwiseProduct(
                    weightedInputs[NUMBER_OF_LAYER - 1].unaryExpr(std::ptr_fun(sigmoidDerivative)));
                // 累加梯度
                nablaWeights[NUMBER_OF_LAYER - 1] += delta * activations[NUMBER_OF_LAYER - 2].transpose();
                nablaBiases[NUMBER_OF_LAYER - 1] += delta;
                for (int layer = NUMBER_OF_LAYER - 2; layer > 0; --layer)
                {
                    // 根据后一层计算当前层的delta
                    delta = (weights[layer + 1].transpose() * delta).cwiseProduct(
                        weightedInputs[layer].unaryExpr(std::ptr_fun(sigmoidDerivative)));
    
                    // 累加梯度
                    nablaWeights[layer] += delta * activations[layer - 1].transpose();
                    nablaBiases[layer] += delta;
                }
            }
    
            // Gradient descent
            for (int layer = 1; layer != NUMBER_OF_LAYER; ++layer)
            {
                weights[layer] *= (1.0 - ETA * LAMBDA /trainingData.size());
                weights[layer] -= nablaWeights[layer].unaryExpr([ETA, MINIBATCH_SIZE](double weight) {
                    return ETA / static_cast<double>(MINIBATCH_SIZE) * weight;
                });
                biases[layer] -= nablaBiases[layer].unaryExpr([ETA, MINIBATCH_SIZE](double bias) { 
                    return ETA / static_cast<double>(MINIBATCH_SIZE) * bias;
                });
            }
        }
        std::cout << std::endl;
    
        // 使用测试集验证
        std::cout << "Testing ..." << std::endl;
        int correctCount = 0;
        for (int test = 0; test != testData.size(); ++test)
        {
            // Feedforward
            Eigen::VectorXd activation = testData[test].input();
            for (int layer = 1; layer != NUMBER_OF_LAYER; ++layer)
            {
                activation = (weights[layer] * activation + biases[layer]).unaryExpr(std::ptr_fun(sigmoid));
            }
    
            int trainNumber = -1;
            activation.maxCoeff(&trainNumber);
    
            int correctNumber = -1;
            testData[test].output().maxCoeff(&correctNumber);
    
            if (trainNumber == correctNumber)
            {
                ++correctCount;
            }
        }
        std::cout << "Testing done!" << std::endl;
    
        std::cout << "Writing weights and biases to file ..." << std::endl;
    
        char modulePath[MAX_PATH] = { 0 };
        ::GetModuleFileNameA(NULL, modulePath, MAX_PATH);
        *(strrchr(modulePath, '\\') + 1) = '\0';
        std::ostringstream oss;
        oss << modulePath << "epoch" << epoch + 1 << ".dat" << std::ends;
        std::ofstream fout(oss.str().c_str(), std::ofstream::out | std::ofstream::binary);
        std::shared_ptr<std::ofstream> foutCloser(&fout, [](std::ofstream *fout) { fout->close(); });
        fout.write(reinterpret_cast<const char*>(&NUMBER_OF_LAYER), sizeof (int));
        for (int layer = 0; layer != NUMBER_OF_LAYER; ++layer)
        {
            fout.write(reinterpret_cast<const char*>(&NEURON_OF_LAYER[layer]), sizeof (int));
        }
        for (int layer = 1; layer != NUMBER_OF_LAYER; ++layer)
        {
            fout.write(reinterpret_cast<const char*>(&weights[layer](0)), sizeof(double) * weights[layer].size());
            fout.write(reinterpret_cast<const char*>(&biases[layer](0)), sizeof(double) * biases[layer].size());
        }
        foutCloser.reset();
    
        std::cout << "Writing weights and biases to file done!" << std::endl;
    
        std::cout << "Training epoch " << epoch + 1 << " Done! "
            << "Recognition rate: " << correctCount << "/" << testData.size()
            << std::endl;
    }
    
    std::cin.get();
    
    return 0;
}

版权声明:本文为原创文章,转载请注明出处。http://cynhard.com/2018/06/02/NN-Derivation-of-Back-Propagation-Algorithm/

推荐文章