Welcome to CATALIST

CATALIST: CAmera TrAnsformations for multi-LIngual Scene Text recognition



Automatic text recognition in videos is challenging because ofproblems like motion blur, variations in text size, fonts, and use ofvarious languages. Movement of the capturing camera and resultingorientation of text makes the recognition task even more difficult. Attention-based methods have delivered excellent results for scenetext OCR in images. However, they suffer from the problem of attention masks becoming unstable and wandering in the scene.

In order to alleviate this issue, we offer a video dataset which contains scene-text videos along with the camera movements. These videos mainly contains sign boards and number plates. We also provide word-level masks for each video frame. The videos are shot both in indoor and outdoor environments.

More details can be found in the paper here.

Complete dataset can be downloaded from here.


We present a total of 2322 (Train: 1597, Val: 199, Test: 526) scene-text videos. The videos contain a combination of three different languages, namely, English, Hindi, and Marathi.

For each such scene-text, we create around 12 videos using 12 different types of camera transformations, broadly categorized into 5 groups:

  1. four types of translation, that could be left, right, up and down,
  2. two types of roll, including clockwise and anti-clockwise,
  3. two types of tilt which could be up-down or down-up motion,
  4. two types of pan, that is left-right and right-left, and
  5. two types of zoom which could be in or out.

We use a camera with a tripod stand to record all these videos to have a uniform control. It is important to note that there are four types of translations, whereas there are only two types for all other transformations. We capture all these videos at 25 fps with a resolution of 1920 × 1080. Table 1 shows the distribution of videos on the basis of transformations.

Table 1: Transformation Distribution of Videos
S.No. Transformation Video Number of Videos
1. Translation 736
2. Roll 357
3. Tilt 387
4. Pan 427
5. Zoom 402


train_ann.txt and val_ann.txt are tab seperated files. Each line contains the following data:

  • Video name
  • Mannually annotated and verified text label for the video
  • Type of transformation - TR, R, T, P, Z (namely Translation, Roll, Tilt, Pan and Zoom)
  • Time where transformation starts. (Default: start of video)
  • Time where transformation ends. (Default: end of video)

We also provide word-level and paragraph-level mask for all video frames which we generate using Cloud Vision API. This is provided if needed for weak supervision.

train_bbox and val_bbox folders contain a folder for each video. Each video folder contains JSON files for each frame of the video along with the mask as given by the Vision API. The first object in the JSON file is the mask for all the text in the frame, alternativel called the paragraph-level mask. The following objects in the JSON file are mask for each individual word.

We recommend using word-level masks along with the text labels in train_ann.txt and val_ann.txt file. Note that the text annotations (by Cloud Vision API) and mask are not verified by us.



  title={CATALIST: {CA}mera {T}r{A}nsformations for multi-{LI}ngual {S}cene {T}ext recognition},
  author={Shivam Sood and Rohit Saluja and Ganesh Ramakrishnan and Parag Chaudhuri},
  booktitle={2021 International Conference on Document Analysis and Recognition Workshops (ICDARW)},

Contact us

  • Team [catalist2021 at gmail dot com]
  • Shivam Sood [ssood at cse dot iitb dot ac dot in]