The most important part of deep learning, training the neural network, often requires the processing of a large amount of data and can takes days to complete. Data parallelism is widely used for training deep neural networks on multiple GPUs in a single machine thanks to its simplicity. However, its scalability is bound by the number of data transfers, mainly for exchanging and accumulating gradients among the GPUs. In this paper, we present a novel approach to data parallel training called CPU-GPU data parallel (CGDP) training that utilizes free CPU time on the host to speed up the training in the GPUs. We also present a cost model for analyzing and comparing the performances of both the typical data parallel training and the CPU-GPU data parallel training. Using the cost model, we formally show why our approach is better than the typical one and clarify the remaining issues. Finally, we explain how we optimized CPU-GPU data parallel training by introducing chunks of layers and present a runtime algorithm that automatically finds a good configuration for the training. The algorithm is effective for very deep neural networks, which are the current trend in deep learning. Experimental results showed that we achieved speedups of $1.21$, $1.04$, $1.21$ and $1.07$ for four state-of-the-art neural networks: AlexNet, GoogLeNet-v1, VGGNet-16, and ResNet-152, respectively. Weak scaling efficiency greater than $90$ was achieved for all networks across four GPUs.