Early warning mechanism is crucial for maintaining the security and reliability of the power grid system. It remains to be a difficult task in a smart grid system due to complex environments in practice. In this paper, by considering the lack of vision-based datasets and models for early warning classification, we constructed a large-scale image dataset, namely EWSPG1.0, which contains 12,113 images annotated with five levels of early warnings. Moreover, 104,448 object instances with respect to ten categories of high-risk objects and power gird infrastructure were annotated with labels, bounding boxes and polygon masks. On the other hand, we proposed a local-to-global perception framework for arly warning classification, namely EWNet. Specifically, a local patch responsor is trained by using image patches extracted from the training set according to the labeled bounding box information of objects. The capability of recognizing high-risk objects and power grid infrastructure is transferred by loading the trained local patch responsor with frozen weights. Features are then fed into a feature integration module and a global classification module for early warning classification of an entire image. In order to evaluate the proposed framework, we benchmarked the proposed framework on our constructed dataset with 11 state-of-the-art deep convolutional neural networks (CNNs)-based classification models. Experimental results exhibit the effectiveness of our proposed method in terms of Top-1 classification accuracy. They also indicate that vision-based early warning classification remains challengeable under power grid surveillance and needs further study in future work.