Action Recognition - Summary of Two Steam CNN paper

About Paper

Key Contributions

Action recognition in videos

Recognition of human actions in videos is a challenging task as this involves temporal component of videos in addition to the spatial recognition. In this paper, authors aimed at extending deep CNNs to action recognition. Previously this task was addressed by inputing stacked video frames to the CNN. But it was observed that results were very poor compared to the hand-crafted features.

Therefore, the authors investigated a different architecture based on two seperate recognition streams, spatial and temporal, which are then combined by late fusion.

two-stream

The idea is videos can be decomposed into spatial and temporal components. The spatial part in the individual frame carries information about scenes and objects depicted in the video. The temporal part in the form of motion across the frames conveys the movement of the observer (the camera) and the objects. These two parts are seperately implemented using a deep CNN, softmax scores of which are combined by the late fusion.

Averaging and training a multi-class linear SVM on stacked \(L_{2}\)-normalised softmax scores as features are the two fusion methods considered by the authors.

Spatial Convolutional network is essentially an image classification architecture which can be pre-trained on a large image classification dataset like ImageNet. The temporal convolutional network takes a stack of optical flow displacement fields between several consecutive video frames as input. The brief description of optical flow CNN is given below.

Optical flow CNNs

The optical flow displacement fields explicitly describes the motion between video frames, which makes the action recognition easier, as the network does not need to estimate motion implicitly.

optical-flow

The above set of images explains optical flow in detail:

The input to a temporal CNN model contains a multiple flows. The two variations of the optical flow-based input are described below:

optical_flow_stack

The above figure describes the first variation, optical flow stacking. It can be seen as a set of displacement vector fields \(d_{t}\) between the pairs of consecutive frames \(t\) and \(t + 1\). This method samples the displacement vectors d at the same location (u, v) in multiple frames and are given as input to the temporal CNN model.

trajectories

The above figure is the second variation for motion representation. Here the method samples the vectors at the locations (u, v) - \(p_{k}\) along the trajectory and this stack will be given as input to the model.

Experiments

The above architectures were trained and evaluated on the below standard video actions benchmarks

The following are the observations from the results of experiments.

Conclusion

The two recognition streams are complementary and the combination deep architecture significantly outperforms and is competitive with the state of the art shallow representations in spite of being trained on relatively small datasets.