The Media Lab at the City College of New York was founded in 2008, and Dr. YingLi Tian was appointed as the lab director. Our research group is dedicated to both fundamental and applied research in the areas of Computer Vision, Image/video Understanding, Multimedia, Artificial Intelligence and Machine Learning, and Assistive Technology.

Our research topics include:

  • Text Spotting & Understanding:


Unambiguous Text Localization and Retrieval for Cluttered Scenes

Xuejian Rong, Chucai Yi, Yingli Tian

CVPR, 2017 (Spotlight)

To utilize text instances for understanding natural scenes, we have proposed a framework that combines image-based text localization with language-based context description for text instances.
Specifically, we explore the task of unambiguous text localization and retrieval, to accurately localize a specific targeted text instance in a cluttered image given a natural language description that refers to it.


Towards Accurate Instance-Level TextSpotting With Guided Attention

Haiyan Wang, Xuejian Rong, Yingli Tian

ICME, 2019

We tackle the text detection problem from the instance-aware segmentation perspective, in which text bounding boxes are directly extracted from segmentation results without location regression. Specifically, a text-specific attention model and a global enhancement block are introduced to enrich the semantics of text detection features. The attention model is trained with a weakly segmentation supervision signal and enforces the detector to focus on the text regions, while also suppressing the influence of neighboring background clutters. In conjunction with the attention model, a global enhancement block (GEB) is adapted to reason the relationship among different channels with channel-wise weights calibration. Our method achieves comparable performance with the recent state-of-the-arts on ICDAR2013, ICDAR2015, and ICDAR2017-MLT benchmark datasets.


Recognizing Elevator Buttons and Labels for Blind Navigation Recognizing Elevator Buttons and Labels for Blind Navigation

Jingya Liu, Yingli Tian

IEEE Int. Conf. on CYBER Technology in Automation, Control, and Intelligent Systems (IEEE-CYBER), 2017.

We propose a cascade framework to detect elevator buttons and recognize their labels from images for blind navigation. First, a pixel-level mask of elevator buttons is segmented based on deep neural networks. Then a fast scene text detector is applied to recognize the text labels in the image as well as to extract their spatial vectors. Finally, all the detected buttons and their associated labels are paired by combining the button mask and spatial vectors of labels based on their location distribution. The cascade framework is conducive to multitask but the accuracy may decrease task by task. To avoid the limitation of the intermediate task, we further introduce a new schema by pairing buttons with their labels to consider the region of button and label as a whole. First, the regions of button-label pairs are detected and then the label for each pair is recognized. To evaluate the proposed method, we collect an elevator button detection dataset including 1,000 images containing buttons captured from both inside and outside of elevators with annotations of button locations and labels and 500 images are captured in elevators but without button buttons which are used for negative images in the experiments. Preliminary results demonstrate the robustness and effectiveness of the proposed method for elevator button detection and associated label recognition.

  • Vehicle Tracking and Re-identification:



Multi-camera Vehicle Tracking and Re-identification on AI City Challenge 2019

Yucheng Chen, Longlong Jing, Elahe Vahdani, Ling Zhang, Mingyi He, Yingli Tian

CVPRW 2019 (under review)

In this work, we present our solutions to the image-based vehicle re-identification (ReID) track and multi-camera vehicle tracking (MVT) tracks on AI City Challenge 2019 (AIC2019). For the ReID track, we propose an enhanced multi-granularity network with multiple branches to extract visual features for vehicles with different levels of grains. With the help of these multi-grained features, the proposed framework outperforms the current state-of-the-art vehicle ReID methods by 16.3% on Veri dataset. For the MVT track, we first generate tracklets by Kernighan-Lin graph partitioning algorithm with feature and motion correlation, then combine tracklets to trajectories by proposed progressive connection strategy, finally match trajectories under different camera views based on the annotated road boundaries. Our MVT and ReID algorithms are ranked the 10 and 23 in MVT and ReID tracks respectively at the NVIDIA AI City Challenge 2019.

  • Video Action Recognition:


Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction

Longlong Jing, Xiaodong Yang, Jingen Liu, Yingli Tian

(under review)

The success of deep neural networks generally requires a vast amount of training data to be labeled, which is expensive and unfeasible in scale, especially for video collections. To alleviate this problem, in this paper, we propose 3DRotNet: a fully self-supervised approach to learn spatiotemporal features from unlabeled videos. A set of rotations are applied to all videos, and a pretext task is defined as prediction of these rotations. When accomplishing this task, 3DRotNet is actually trained to understand the semantic concepts and motions in videos. In other words, it learns a spatiotemporal video representation, which can be transferred to improve video understanding tasks in small datasets. Our extensive experiments successfully demonstrate the effectiveness of the proposed framework on action recognition, leading to significant improvements over the state-of-the-art self-supervised methods. With the self-supervised pre-trained 3DRotNet from large datasets, the recognition accuracy is boosted up by 20.4% on UCF101 and 16.7% on HMDB51 respectively, compared to the models trained from scratch.


Recognizing American Sign Language Manual Signs from RGB-D Videos

Longlong Jing*, Elahe Vahdani*, Yingli Tian, Matt Huenerfaut (* equal contribution)

(under review)

In this paper, we propose a 3D Convolutional Neural Network (3DCNN) based multi-stream framework to recognize American Sign Language (ASL) manual signs (consisting of movements of the hands, as well as non-manual face movements in some cases) in real-time from RGB-D videos, by fusing multimodality features including hand gestures, facial expressions, and body poses from multi-channels (RGB, depth, motion, and skeleton joints.) To learn the overall temporal dynamics in a video, we generate a proxy video by selecting a subset of frames for each video which then be used to train the proposed 3DCNN model. We collect a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos, each of 100 ASL manual signs, including RGB channel, depth maps, skeleton joints, face features, and HDface. The dataset is fully annotated for each semantic region (i.e. the time duration of each word that the human signer performs). Our proposed method achieves 75.9% accuracy from only RGB channel and 80.3% from the fusion of multi-channels for recognizing 100 ASL words, which demonstrate the effectiveness of recognizing ASL signs from RGB-D videos.


Video you only look once: Overall temporal convolutions for action recognition

Longlong Jing, Xiaodong Yang, Yingli Tian

JVCIR, 2018

In this paper, we propose an efficient and straightforward approach, video you only look once (VideoYOLO), to capture the overall temporal dynamics from an entire video in a single process for action recognition. It remains an open question for action recognition on how to deal with the temporal dimension in videos. Existing methods subdivide a whole video into either individual frames or short clips and consequently have to process these fractions multiple times. A post process is then used to aggregate the partial dynamic cues to implicitly infer the whole temporal information. On the contrary, in VideoYOLO, we first generate a proxy video by selecting a subset of frames to roughly reserve the overall temporal dynamics presented in the original video. A 3D convolutional neural network (3D-CNN) is employed to learn the overall temporal characteristics from the proxy video and predict action category in a single process. Our proposed method is extremely fast. VideoYOLO-32 is able to process 36 videos per second that is 10 times and 7 times faster than prior 2DCNN (Two-stream) and 3D-CNN (C3D) based models, respectively, while still achieves superior or comparable classification accuracies on the benchmark datasets, UCF101 and HMDB51.


3D convolutional neural network with multi-model framework for action recognition

Longlong Jing, Yuancheng Ye, Xiaodong Yang, Yingli Tian

ICIP, 2017

In this paper, we propose an efficient and effective action recognition framework by combining multiple feature models from dynamic image, optical flow and raw frame, with 3D convolutional neural network (CNN). Dynamic image preserves the long-term temporal information, while optical flow captures short-term temporal information, and raw frame represents the appearance information. Experiments demonstrate that dynamic image provides complementary information to raw frame feature and optical flow feature. Furthermore, with the approximate rank pooling, the computation of dynamic images is about 360 times faster than optical flow, and the dynamic image requires far less memory than optical flow and raw frame.

  • Image Segmentation:


Coarse-to-fine Semantic Segmentation from Image-level Labels

Longlong Jing*, Yucheng Chen*, Yingli Tian (*equal contribution)

(under review)

Deep neural network-based semantic segmentation generally requires large-scale cost extensive annotations for training to obtain better performance. To avoid pixel-wise segmentation annotations which are needed for most methods, recently some researchers attempted to use object-level labels (e.g. bounding boxes) or image-level labels (e.g. image categories). In this paper, we propose a novel recursive coarse-to-fine semantic segmentation framework based on only image-level category labels. For each image, an initial coarse mask is first generated by a convolutional neural network-based unsupervised foreground segmentation model and then is enhanced by a graph model. The enhanced coarse mask is fed to a fully convolutional neural network to be recursively refined. Unlike existing image-level label-based semantic segmentation methods which require to label all categories for images contain multiple types of objects, our framework only needs one label for each image and can handle images contains multi-category objects. With only trained on ImageNet, our framework achieves comparable performance on PASCAL VOC dataset as other image-level label-based state-of-the-arts of semantic segmentation. Furthermore, our framework can be easily extended to foreground object segmentation task and achieves comparable performance with the state-of-the-art supervised methods on the Internet Object dataset.


LGAN: Lung Segmentation in CT scans using Generative Adversarial Network

Jiaxing Tan*, Longlong Jing*, Yingli Tian, Oguz Akin, Yumei Huo (*equal contribution)

(under review)

Abstract—Lung segmentation in computerized tomography (CT) images is an important procedure in various lung disease diagnosis. Most of the current lung segmentation approaches are performed through a series of procedures with manually empirical parameter adjustments in each step. Pursuing an automatic segmentation method with fewer steps, in this paper, we propose a novel deep learning Generative Adversarial Network (GAN) based lung segmentation schema, which we denote as LGAN. Our proposed schema can be generalized to different kinds of neural networks for lung segmentation in CT images and is evaluated on a dataset containing 220 individual CT scans with two metrics: segmentation quality and shape similarity. Also, we compared our work with the current state of the art methods. The results obtained with this study demonstrate that the proposed LGAN schema can be used as a promising tool for automatic lung segmentation due to its simplified procedure as well as its good performance.

  • Intelligent Video Activity Analysis:

There are large amount data of events and activities for intelligent video surveillance. The research will exploit the composite event detection, association mining, pattern discovery and unusual pattern detection by using data mining.

  • Moving Object Detection and Tracking in Challenge Environments:

There are many research about moving object detection and tracking. It is hard to achieve satisfied results in challenge environments such as in crowed, with lighting changes, or in bad weather. Our research will focus on propose more robust and efficient algorithms for video understanding.

  • Facial Expression Analysis in Naturalistic Environments:

The research of facial expression analysis in naturalistic environments will have significant impact across a range of theoretical and applied topics. Real-life facial expression analysis must handle head motion (both in-plane and out-of-plane), occlusion, lighting change, low intensity expressions, low resolution input images, absence of a neutral face for comparison , and facial actions due to speech.