(originally for 2016 Faculty Fellow proposal with J. Schneider)

Requirements Gathering

XSEDE musings

What is the data size on disk?  

Does it expand more in RAM when running?

Does the program need to load all the data at once, or can it either stream the data or page it?

How much data gets generated?

Do you have a place to move the data after the allocation ends?

Can the applications be built from source?

What language is the source in and does it require a particular compiler version?

What are its dependencies?

Does it work on Linux?

Is it something which might benefit from running in a virtual machine?

Is it multi-threaded or parallelized?

Can it be run completely in batch or does it require a GUI to start/stop?

Would a GUI have to go over X11 or is there a client/server option?

Is the application scriptable?

How long does any processing currently take, say on a standard issue laptop?

ECSS/ESRT support can be justified in a lot of ways:

  • Porting code

  • Parallelizing code

  • Benchmarking/profiling code

  • Optimizing code

  • Comparing application performance

  • Improving workflows

  • Visualization prototyping

Interactivity

This is in the context of running any visualization locally. Would not recommend generally trying to do interactive visualization remotely from the supercomputer. Interaction could be purely visual, like zooming, or analytic, like clicking on a node and highlighting similar nodes, etc.

Anecdotally, the things which slow down interactivity are

  • Large number of edges, say 33% of a complete graph

  • Graph analytics

Node count in the order of millions is not that big a deal, but edge count in the order of millions is.

Examples of Visuals

  • Labels appearing on mouseover

  • Dragging elements around

  • Zooming in and out

  • Selecting particular elements

Examples of Analytics

  • Shortest paths between selected nodes

  • Simple stats like median degree

  • Finding subgraphs based on certain properties

  • Complex stats like clustering

Analytics/Statistics vs Visualization

Metrics like PageRank, Clustering, etc. The more of that which can be done as a preprocessing step, the better off we are. If the discovery is directly mappable to one of these known algorithms, the better off we are. If the discovery is "I don't know what I'm looking for" then visualization might just be transferring the needle in a haystack problem from a numerical/statistical problem to a visual problem. A needle can be made to stand out, if and only if we know what constitutes a needle! Otherwise, we’re relying on serendipity, which does happen occasionally.

Resources

XSEDE Hardware

Bridges @ PSC - Large memory machine, works well with Java apps which want to load the entire dataset at once.

Wrangler @ TACC/IU - Data analytics machine, whatever that means. This is a really new approach for XSEDE so I’m not sure what fits it well.

Aesthetics/Survey

http://www.visualcomplexity.com/vc/, book of the same name, book links on his site

http://www.visualcomplexity.com/vc/books.cfm

http://yifanhu.net/GALLERY/GRAPHS/, THE super-large graph site. Many of these are polygonal meshes, rendered in batch with GraphViz.

http://vcg.informatik.uni-rostock.de/~hs162/treeposter/poster.html#Yang2002

HCI for discoverability is not something I know much about.

The standard textbook on infovis is Information Visualization: Perception for Design by Colin Ware. Has an amazing bibliography.

Software

Off the Shelf

For sustainability, this is probably the way to go.

  • Gephi - Was becoming well known, but the development has stalled. Java-based, slow for dense graphs.

  • Cytoscape - Big in the bio systems community. Can be made to do non-bio networks, also Java-based, also slow for dense graphs.

  • ParaView - Scivis package, would be able to draw any geometry, but wouldn’t supply graph analytics.

  • VisIt - Scivis package, similar capabilities to ParaView. Both of these would be the only realistic option for parallel, scalable vis.


Commercial

Not sure how much Tableau, Mathematica, Watson Analytics can do with graphs and to what scale. Tableau and WA can do dashboards relatively easily, though, so might be worth investigating for UI.

Custom

Hesitate to recommend because of sustainability issues.

  • VTK - C++/Python. Has some infovis support for drawing graphs. Can handle larger graphs than Gephi, Cytoscape.

  • Processing - Java dialect, unfortunately very slow, maybe ~100 nodes, 500 edges level

  • D3js - Javascript, great for web interactivity, but limited by what the browser can handle.

  • GraphViz - Main option for batch rendering of large graphs.

If I had to guess, the MEDLINE app is all custom Java… I’m not a Java person, it’s possible they had to roll their own graph drawing algorithms. Would guess that it doesn’t scale well.


Plotting and Charting

Some people see this as visualization. I see it as distinct because there’s a ton of really fantastic software available for 2D charts and plots, line graphs, histograms, etc. I would hesitate very strongly to ever custom make one of these standard charts, beyond tweaking the colors, fonts, etc.

HPC and Scalability

For graph analytics, there is the Graph 500. It’s not clear that these are that much different than the usual suspects from the Top 500, especially towards the top. There are some unique single-node machines that score relatively well, though. The benchmark doesn’t include any visualization.

To the best of my knowledge, no one on Blue Waters is doing any large graphs. I can’t speak for all of XSEDE, but I’m not aware of any. I certainly didn’t see any come through ESRT in the last year, so if there are any, they’re self-sufficient.

One of the main issues is, for large graphs, it can be easy to have the node density approach the pixel density, in which case it’s not clear visualization helps. I would consider contacting TACC or the EVL (UIC) for access to huge pixel count display walls.

Case Study

Ruby Mendenhall, et al: Rescued History

Topic modeling done with Mallet, a Java app from UMass. Training is multithreaded but not parallelized. Wants to load all the data at once, so limited to how much RAM was on a node of the UI campus cluster. Believe the largest training set done there was ~1000 documents (but don’t quote me on that.) This was when XSEDE came into play. Startup and then research allocations on PSC large memory machines. Trained on ~22K documents. Iterative process because even with Harriet Green’s guidance, there were a lot of irrelevant documents.  

Generated:

Topic word lists - words ranked by how likely they are to appear in a topic. Originally visualized these with word clouds and bar charts, but then settled on single-layer treemaps. One of these is visible at the NSF link. These were used to help interpret the topics.

Topic vectors - documents with likelihood of how they contributed to each of the 100 topics. From this, made a topic-topic correlation matrix and drew a ring graph. This is the other visualization at the NSF link. The real version is interactive. Not sure if this was actively used, it was more a proof of concept.

I have many other visualizations from this project which were done as experiments/one-off’s for various purposes.

Towards the end of the project, about the same time Ismini came up with the suggestion of using stats to show document relations, I had the opportunity to met with Danyel Fisher from Microsoft Research. He shared something with me that really solidified my understanding of the role of infovis, which can be seen on slide 10 here:

http://helper.ipam.ucla.edu/publications/caws2/caws2_13111.pdf

Currently I’m working with Nicole Brown, who started as grad with Ruby and now has a post-doc at NCSA. She successfully got an XSEDE startup allocation. She’s also doing topic modeling on JSTOR documents, but we’re taking a completely different approach on visualization, not network graphs. Instead, we’re doing a type of dimension reduction to plot the documents directly in 3D, hoping that clusters will pop out naturally.


  • No labels