ICDAR 2019 Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection
- Harold Mouchère <harold.mouchere(at)univ-nantes.fr>
Downloads
| Name | Size | Type | Mirrors | Description |
|---|---|---|---|---|
| TC11_package_CROHME2019.zip | 364 MB | other | 1 | Zip file with data, tools and papers |
Dataset Information
This package provides training and test data from the competitions CROHME 2011, 2012, 2013, 2014, 2016 and 2019.
Ground Truth
Math expression for online and off handwriting
The ground-truth is available in INKML format (with latex string and mathml structure), in Stroke Label Graph (SLG, associating stroke to ground-truth) and Object Layout Graph (symbol layout tree independent of the strokes). These ground-truth allows training and evaluation for on-line and off-line recognition tasks.
Typeset Formula Detection
Using the ground-truth from the GTDB datasets, the math expressions are located in a set of scientific documents.
Research Tasks
Online Handwritten Formula Recognition
For the traditional task in CROHME, participants must convert a list of handwritten strokes captured as a list of polylines from a tablet or similar devices to a Symbol Layout Tree (SLT). This SLT captures the segmentation of strokes into symbols, symbol classification, and the spatial relationships between symbols. SLTs are represented using labeled directed graphs, so that all segmentation, classification, and relationship (parsing) errors can be automatically identified and compiled using tools developed for CROHME (CROHMELib and LgEval).
Offline Handwritten Formula Recognition
For offline recognition of handwritten inputs, we will render images from the (x,y) points in the CROHME InkML files. As in the previous task, for a given test image, participating systems must produce one .lg file. Please notice since primitive level information (connected components) is not provided, we evaluate the systems based on the correct symbols and correct relation between the symbols (symbolic evaluation). Systems can produce a LaTeX string or Presentation MathML tree as output. LaTeX and MathML should be converted to symbolic LG for evaluation using provided toolsi (tex2symlg and mml2symlg). There are also tools to convert .lg files to symbolic label graphs (lg2symlg) for interested participants (although they will be defining their own ‘stroke’ primitives in that case).
For evaluation, we will use the same evaluation tools as in online recognition tasks, only ignoring CC (connected components) segmentation and the correspondence of CCs to symbols with this new format in symLGs.
Detection of Formulas in Document Pages
In this task, for a given document page, participating systems identify the location of formulas using bounding boxes. Evaluation will be done by calculating the intersection over union (IoU) with the groundtruth annotations. We will use thresholds of 50% and 75% to observe coarse and fine detection of formula regions. Participants will also have the option to use character level information, but they also have to submit the final math regions for IoU calculation where regions are defined by the characters that they detected as math characters. This reflects how detection of math regions in born-digital documents (e.g., PDFs generated using a word processor) would be performed when characters are available.