Proposal for follow up research of the DARPA CMA Competition

Title Ideas

Using synthetic training data to identify glyphs in small and unbalanced datasets.

Background

The DARPA CMA competition presented a challenge to take legends from maps and extract a mask of each feature in the map. For this challenge there were 3 different kinds of map features to extract; Polygonal, lines and glyphs. The way the competition scored the results it placed much more weight on the polygon extraction than the other two kinds of features and so that is where most of our focus was during the competition. However from our experience and the scores of the other teams in the competition, the hardest of these features to identity was actually the glyphs features. This was due to the nature of the dataset, being that the distribution of unique glyph classes was extremely unbalanced with some glyphs having thousands of occurrences and others only having a single digit amount. The other problem that came up was the high variability in individual occurrences of some of the glyph features shapes. This could be because of numerals or letters in them or that they were supposed to represent various angles. To attempt to solve these issues we tried creating our own training images (Synthetic images). Using this synthetic dataset we were able to train the model to identify some of the features that had only a couple examples from the original dataset, however owing to time constraints we were not able to address the issue of variability in glyphs which stopped this from being truly successful.

Project Description

This proposal would like to continue the work that was done for the DARPA CMA competition by creating a tool to create a synthetic training dataset for 2D legend items. The ultimate goal of the project would be to create a dataset and train a model off of a single image of a map. The project would primarily use the dataset provided by the DARPA CMA for the research, but identification of other usable datasets should be conducted as well.

Project Goals

The main goal is to build a synthetic data creation tool that can take the glyph of a legend label and create a pattern of it that can be used with background templates to create a synthetic training set. The secondary goal will be to measure the performance of various types of models on this problem and try and create a better performing custom model.

Milestone	Quality Metric	Date
Show that a synthetic training set can be used to train the base model that is equal to or better then the original training set.	F-score of model trained on synthetic data is >= the F-score from training on the original dataset	3-4 weeks after start
Show proof that the initial concept holds weight. The scores from the competition were mostly in the <10% range, with the highest being above 30%.	F-score is >50%	2 months after start
	F-score is >95%	Project Conclusion

Project Products

Research paper describing the tool creating synthetic data
Research paper contrasting the performance of various models on this problem and my custom one.
Tool to create synthetic data
ML Model to train against these glyphs.
Small talks on research.

Questions the project should answer.

What is an acceptable recall?
How many synthetic samples are needed to get to that acceptable recall?
Can we measure how well a synthetic pattern mimics the original pattern?
How does the precision and recall of the model on the train set relate to the generalization back to the original set.
Can the full process of creating a synthetic dataset be automated or does it require some human review?

Project Tasks (11 Weeks Estimated?)

Pre-Project Research (2-3 Days)
- Research of prior work on the topic.
- Research of supporting papers to give a foundation to the work.
Programming Tool (4-5 Weeks)
- Programming the automatic detection of patterns from base image
  - Programming the identification of fixed patterns
  - Programming the identification of the color of patterns
  - Programming the identification of changing elements within patterns (EX. Numerals and Letters).
- Programming the UI Interface
  - Programming the user adjustment of created patterns (Changing bounding boxes or vertices)
- Programming the creation of "clean" background data
- Programming the creation of the synthetic dataset
  - Programming the random rotations of patterns
  - Programming the random changes in numerals
- Putting the entire thing together in one workflow and creating a non-gut version that can generate synthetic data off of pre made template files or creates good enough accounting that can be downloaded.
- Documentation of the Tool
Model training and testing (3-4 Weeks)
- Identify and build the models that will be used for testing.
- Build a script to automate the testing of various models
- Build a baseline of traditional training on the dataset with no synthetic data
- Measure the performance of the models on the synthetic dataset
- Build off of the most promising model and try to improve it.
- Hopefully few iterations of improvements.
- Generate final test results for paper
- Stretch goal to integrate model training into tool.
Writing of Research Products (3 Weeks?)
- Identify candidate journals for submission
- First draft of synthetic data tool paper
- Small talk /presentation on synthetic data tool
- Final draft of synthetic data tool
- First draft of model research paper
- Small talk / presentation on model research paper
- Final draft of model research paper

Benefits for NFI

The primary benefit for NFI in this project is gaining the skills and experience and tools needed to tackle harder problems with synthetic training imagery. Both the NGA and DOD have shown they have an interest in synthetic satellite data, giving out development grants to Orbital Insights and L3Harris respectively. Working with satellite data presents a more difficult challenge than the DARPA Project. If we want to take on these more difficult challenges having a foundation of knowledge and tools that we can adapt would be greatly beneficial.

Synthetic training data provides a way to quickly generate new ai models for problems that there may not be enough data to tackle traditionally. Even if the NGA and DOD do not end up being partners, understanding synthetic data workflows and problems would increase our value as ML consultants for other projects, as it is my view that using synthetic training data will become more common in the future.

Reference:

“Subject matter experts to identify objects in scenarios are still very important,” Mark Munsell, National Geospatial-Intelligence Agency deputy director of data and digital information, told SpaceNews. “We anticipate they will be supplemented by synthetic moving forward.” https://spacenews.com/synthetic-data-geoint/

Rendered.AI has a one and half year $1 Million SIBR Phase II grant with the NGA to produce synthetic image generation products. https://www.sbir.gov/node/2174291 and https://orbitalinsight.com/news-and-events/press-releases/orbital-insight-unveils-multiclass-object-detection-algorithms-for-ships-aircraft-and-vehicles-2-2?utm_campaign=Rendered.AI%20Awareness&utm_source=Medium&utm_medium=social%20&utm_content=GEOINT%20blog%20support%20link%20OI%20PR

L3Harris demonstrated their synthetic data capabilities at GEOINT22

Army has 2 awards for synthetic data https://sam.gov/opp/9a24f0dcf26f46f7a2fd80b5493d8a8f/view and https://sam.gov/opp/66986a7ba15f893ad6466c0516b8e13e/view

Planet Labs is beta testing Hyperspectal synthetic data https://www.planet.com/pulse/agile-aerospace-innovation-leveraging-synthetic-data-in-satellite-data-product-development/

Possibly for later if we pursue synthetic satellite data projects.

It seems a lot of the synthetic satellite tools are built off of the backbone of the Digital Imaging and Remote Sensing Image Generation (DIRSIG) program, which is a free tool developed by the Rochester Institute of Technology over the last 30 years. http://dirsig.cis.rit.edu/

RarePlanes dataset is a premade synthetic satellite data set that we can test models on. https://paperswithcode.com/dataset/rareplanes-dataset

Page tree