CATALIST: CAmera TrAnsformations for multi-LIngual Scene Text recognition
Start!Automatic text recognition in videos is challenging because ofproblems like motion blur, variations in text size, fonts, and use ofvarious languages. Movement of the capturing camera and resultingorientation of text makes the recognition task even more difficult. Attention-based methods have delivered excellent results for scenetext OCR in images. However, they suffer from the problem of attention masks becoming unstable and wandering in the scene.
In order to alleviate this issue, we offer a video dataset which contains scene-text videos along with the camera movements. These videos mainly contains sign boards and number plates. We also provide word-level masks for each video frame. The videos are shot both in indoor and outdoor environments.
More details can be found in the paper here.
Complete dataset can be downloaded from here.
We present a total of 2322 (Train: 1597, Val: 199, Test: 526) scene-text videos. The videos contain a combination of three different languages, namely, English, Hindi, and Marathi.
For each such scene-text, we create around 12 videos using 12 different types of camera transformations, broadly categorized into 5 groups:
We use a camera with a tripod stand to record all these videos to have a uniform control. It is important to note that there are four types of translations, whereas there are only two types for all other transformations. We capture all these videos at 25 fps with a resolution of 1920 × 1080. Table 1 shows the distribution of videos on the basis of transformations.
S.No. | Transformation Video | Number of Videos |
---|---|---|
1. | Translation | 736 |
2. | Roll | 357 |
3. | Tilt | 387 |
4. | Pan | 427 |
5. | Zoom | 402 |
train_ann.txt and val_ann.txt are tab seperated files. Each line contains the following data:
We also provide word-level and paragraph-level mask for all video frames which we generate using Cloud Vision API. This is provided if needed for weak supervision.
train_bbox and val_bbox folders contain a folder for each video. Each video folder contains JSON files for each frame of the video along with the mask as given by the Vision API. The first object in the JSON file is the mask for all the text in the frame, alternativel called the paragraph-level mask. The following objects in the JSON file are mask for each individual word.
We recommend using word-level masks along with the text labels in train_ann.txt and val_ann.txt file. Note that the text annotations (by Cloud Vision API) and mask are not verified by us.
@inproceedings{catalist2021,
title={CATALIST: {CA}mera {T}r{A}nsformations for multi-{LI}ngual {S}cene {T}ext recognition},
author={Shivam Sood and Rohit Saluja and Ganesh Ramakrishnan and Parag Chaudhuri},
booktitle={2021 International Conference on Document Analysis and Recognition Workshops (ICDARW)},
year={2021},
organization={IEEE}
}