Open discussions on specific topics selected by the Software Working Group and selected from the list of SWG Topics For Discussion.

Wednesday,August 25, 2021 - GridFS & other options to store large files in databases, moderated by Mikolaj Kowalik

Description: 

  • This round table will focus on Options for Storing Large Files and Databases.  How to deal with ever-growing number of files in your project? What tools can help you manage them? Is using a database for that a good idea?

Recording:

Attendees:

Luigi Marini

Elizabeth Yanello

Mikolaj Kowalik

Peter Groves

Galen Arnold

Chen Wang

Roland Haas

Dena Strong

John Maloney

Christopher Navarro

Charles Blatti

Vara Veera Gowtham Naraharisetty

Jong Lee

Matt Berry

Michael Bobak

Sandeep Puthanveetil Satheesan

Stephen Pietrowicz

Todd Nicholson

Santiago Nunez-Corrales

Yong Wook Kim

Rob Kooper

Jeff Terstriep

Michal Ondrejcek

Nathan Tolbert

Timothy Andrew Manning


Discussion:

Is it a good idea to put large data sets in a database?

  • Pros: Relationships and data integrity, automatic backups, better security
  • Cons: Slower to write, backups and restoration are slower, memory inefficiency, lots of layers to go through, files can't easily be shared with third parties
  • Sharing can become a burden in sharing, although you can go to S3 and download much faster.  There can be issues with firewalls
  • Amazon S3 is a cloud storage that's fast.
  • Dena Strong notes that she was in a consult literally just last week where people had been storing about 100 Gb of individual PDF files in an ArcGIS database, and they're now in the process of working out how to upload their files to S3 and replace the files with pointers in ArcGIS because the performance with all the files in there was getting so bad. So the "keep the pointer in the DB and have the files in a file system" approach is getting a +1 here.


But I do want to keep files in a database

  • MongoDB (GridFS)
    • Divides the file into chunks and is stored as a separate doc
    • Grants access to arbitrary sections of files
    • allows storing files larger than 15 Mb
    • can provide redundancy
    • does not support multi-doc transactions
  • RDBMS
    • Unstructured data can be stored as blobs
    • FILESTREAM (SQL server)
    • TOAST (Postgres SQL)
  • Clowder uses huge datasets.
    • It takes a long long time to transfer data.
    • MongoDB keeps cached files in memory, slowing thing down to a crawl
    • You can have user data stored in different shards
    • 3.x is what Clowder uses
    • We also get Jsons that are larger than 16 Gb (should they be stored as blobs or make them searchable?)
    • Rob Kooper notes that the maximum BSON document size is 16 megabytes.  This is true for Mongo in general
  • If you put a bunch of files in Postgres, has it ever failed?  General consensus is to keep the files separate, for fear of loss of data.  When data is too large, it is very difficult to retrieve the data quickly. So storing as an object rather than a blob depending on the infrastructure, makes sense.
  • Dena Strong adds the huge benefit to some of these that often goes overlooked is long term path compatibility. The reason we rejected Box/Google/OneDrive for the PDF project is because all their URLs are keyboardsmash, and S3 lets us say "this is the directory structure we're uploading, so here is the human readable path to the file." So if ten years from now we need to move that data to another location, we can script changing from an S3 path to the next server's path. There's no way to get from Box keyboardsmash URL to Google/OneDrive keybardsmash for the same URL. 

    Keyboardsmash is a hash of some type that points to a location in that particular database, but a human being can't look at something like 23n4d0xe5gf129dt20 and see "obviously that is the file stored at /home/subdir/myfile.png". And no two systems will make the same keyboardsmash for the same file.



Object storage solutions

  • Key features
  • objects are accessed via RESTful API
  • scalable,
  • distribution access
  • very useful for handling static data, but objects cannot be modified
  • Examples: Ceph, MinIO, OpenIO, OpenStack Swift, Amazon S3, Google Cloud Storage, Rucio, Open Storage Network (OSN)
  • JD Maloney says MinIO is very useful and is very happy with it.

It's incredibly easy to set up an IO!

Let's discuss the cost of using these storage solutions.  The nice thing about MinIO is it's affordable

JD is looking to combine MinIO, and Luster.  We hope to use S3, but we're not there yet... we'd like to put this on RADIANT! (Tiga). Ceph is also an option, but it can't handle the size of Delta

OSN runs Ceph

SDSC is switching from Swift to Ceph

Nightingale requires many security requirements due to HIPAA



Links Shared During the Talk:

https://docs.google.com/presentation/d/1-jsrWwQ-ngdfPXHFsDJOZ37cGTk33Mpuu5K3aLWV3qI/edit?usp=sharing






If you are interested in contributing to a Round Table, please see these links:

Round Table Discussions

SWG Topics For Discussion




  • No labels