Proposal for follow up research of the DARPA CMA Competition

Title Ideas

Using synthetic training data to identify glyphs in small and unbalanced datasets.

Background

The DARPA CMA competition presented a challenge to take legends from maps and extract a mask of each feature in the map. For this challenge there were 3 different kinds of map features to extract; Polygonal, lines and glyphs. The way the competition scored the results it placed much more weight on the polygon extraction than the other two kinds of features and so that is where most of our focus was during the competition. However from our experience and the scores of the other teams in the competition, the hardest of these features to identity was actually the glyphs features. This was due to the nature of the dataset, being that the distribution of unique glyph classes was extremely unbalanced with some glyphs having thousands of occurrences and others only having a single digit amount. The other problem that came up was the high variability in individual occurrences of some of the glyph features shapes. This could be because of numerals or letters in them or that they were supposed to represent various angles. To attempt to solve these issues we tried creating our own training images (Synthetic images). Using this synthetic dataset we were able to train the model to identify some of the features that had only a couple examples from the original dataset, however owing to time constraints we were not able to address the issue of variability in glyphs which stopped this from being truly successful.

Project Description

This proposal would like to continue the work that was done for the DARPA CMA competition by creating a tool to create a synthetic dataset for 2D glyphs. The ultimate goal of the project would be to create a dataset and train a model off of a single image. The project would primarily use the dataset provided by the DARPA CMA for the research, but identification of other usable datasets should be conducted as well.

Project Goals

Questions the project should answer.

What is an acceptable recall?
How many synthetic samples are needed to get to that acceptable recall?
Can we measure how well a synthetic pattern mimics the original pattern?
How does the precision and recall of the model on the train set relate to the generalization back to the original set.
Can the full process of creating a synthetic dataset be automated or does it require some human review?

The main goal will to build a synthetic data creation tool that can take a legend label of a glyph and build create a pattern of it that can be used with background templates to create a synthetic training set. The secondary goal will be to measure the performance of various types of models on this problem and try and create a better performing custom model.

Research paper describing the tool creating synthetic data
Research paper contrasting the performance of various models on this problem and my custom one.
Tool to create synthetic data
ML Model to train against these glyphs.
Small talks on research.

Project Tasks (11 Weeks Estimated, Should budget 4 months to be comfortable?)

Pre-Project Research (2-3 Days)
- Research of prior work on the topic.
- Research of supporting papers to give a foundation to the work.
Programming Tool (4-5 Weeks)
- Programming the automatic detection of patterns from base image
  - Programming the identification of fixed patterns
  - Programming the identification of the color of patterns
  - Programming the identification of changing elements within patterns (EX. Numerals and Letters).
- Programming the UI Interface
  - Programming the user adjustment of created patterns (Changing bounding boxes or vertices)
- Programming the creation of "clean" background data
- Programming the creation of the synthetic dataset
  - Programming the random rotations of patterns
  - Programming the random changes in numerals
- Putting the entire thing together in one workflow and creating a non-gut version that can generate synthetic data off of pre made template files or creates good enough accounting that can be downloaded.
- Documentation of the Tool
Model training and testing (3-4 Weeks)
- Identify and build the models that will be used for testing.
- Build a script to automate the testing of various models
- Build a baseline of traditional training on the dataset with no synthetic data
- Measure the performance of the models on the synthetic dataset
- Build off of the most promising model and try to improve it.
- Hopefully few iterations of improvements.
- Generate final test results for paper
- Stretch goal to integrate model training into tool.
Writing of Research Products (3 Weeks?)
- Identify candidate journals for submission
- First draft of synthetic data tool paper
- Small talk /presentation on synthetic data tool
- Final draft of synthetic data tool
- First draft of model research paper
- Small talk / presentation on model research paper
- Final draft of model research paper

Page tree

Proposal for follow up research of the DARPA CMA Competition

Title Ideas

Project Description

Project Goals

Project Tasks (11 Weeks Estimated, Should budget 4 months to be comfortable?)