When copying a lot of files with fast disk and network IO, I have often found it more efficient to copy the files as multiple threads.  Copying 4-8 sets of files at the same time can better saturate IO and usually sees a 4x or more improvement in speed of the transfer.

rsync is often the easiest choice for efficiently copying over lots of files, but unfortunately it doesn't have an option for parallel threads that is built in.  So, here's a rather simple way to do this using find, xargs, and rsync.

Parallel Rsync (bash)
#!/bin/bash
 
# SETUP OPTIONS
export SRCDIR="/folder/path"
export DESTDIR="/folder2/path"
export THREADS="8"

# RSYNC DIRECTORY STRUCTURE
rsync -zr -f"+ */" -f"- *" $SRCDIR/ $DESTDIR/ \
# FOLLOWING MAYBE FASTER BUT NOT AS FLEXIBLE
# cd $SRCDIR; find . -type d -print0 | cpio -0pdm $DESTDIR/
# FIND ALL FILES AND PASS THEM TO MULTIPLE RSYNC PROCESSES
cd $SRCDIR  &&  find . ! -type d -print0 | xargs -0 -n1 -P$THREADS -I% rsync -az % $DESTDIR/% 

 
# IF YOU WANT TO LIMIT THE IO PRIORITY, 
# PREPEND THE FOLLOWING TO THE rsync & cd/find COMMANDS ABOVE:
#   ionice -c2 

 

The rsyncs above can be extended to work through ssh as well. When using rsync over ssh, I've found that setting the ssh encryption type to arcfour is a critical option for speed.

rsync over ssh
rsync -zr -f"+ */" -f"- *" -e 'ssh -c arcfour' $SRCDIR/ remotehost:/$DESTDIR/ \
  && \
cd $SRCDIR  &&  find . ! -type d -print0 | xargs -0 -n1 -P$THREADS -I% rsync -az -e 'ssh -c arcfour' % remotehost:/$DESTDIR/% 
  • No labels

13 Comments

  1. Anonymous

    This version recurses all directories:

    #!/bin/bash

     

    # SETUP OPTIONS

    export SRCDIR="/Storage/data1"

    export DESTDIR="/Storage/data2"

    export THREADS="32"

     

    # FIND ALL FILES AND PASS THEM TO MULTIPLE RSYNC PROCESSES

    cd $SRCDIR

    if [[ $? -eq 0 ]]; then

            find . -type d | xargs -I% mkdir -p /$DESTDIR/%

            find . -type f | xargs -n1 -P$THREADS -I% rsync -a % /$DESTDIR/%

    fi

    1. Anonymous

      Hi,
      What if the "SRCDIR" is an online rsync repository?
      Like rsync://dir.foo.com/abc
      and "DSTDIR" is empty. ?

  2. Anonymous

    William,

    This is a slick idea. I have been doing multiple rsyncs on a shell-based engine, but using xargs -P… is much easier to set up.

    Suggestion: You can avoid the preparatory step and multiple mentions of "%" by using rsync --relative … and feed it the output of find -mindepth n -maxdepth n  to transfer source items at a given hierarchy depth n.

    Initiate the transfers on the source side:

    cd $SRCDIR
    export RSYNC_RSH="ssh -c arcfour -o Compression=no"
    ## uncomment and adjust as needed:
    #rsync_more_opts=" -Siv --delete"
    find dir1 dir2 dir3 ...  -mindepth 1 -maxdepth 1 | \
        xargs -n1 -P$THREADS -I% \
            rsync -a $rsync_more_opts --relative "%" dest_machine:$DESTDIR

    Shell-escaping the "%" will be useful to handle "funny" file names.

     

    -- Michael Sternberg, Argonne National Laboratory.

  3. Unknown User (jazoff)

    One should use null terminated strings to avoid shell escaping issues:

     

    find ... -print0 | xargs -0 ...

     

     

    1. Anonymous

      Yes. I agree completely with this. One should always use -print0 and -0 when using find and xargs together (unless there is a very good reason not to).

  4. Anonymous

    Be aware that this is quite inefficient if you're transferring small files as it runs separate processes for each file.

  5. Anonymous

    A while back I made a program called mtsync which is similar to rsync but uses multiple threads in a single process.  mtsync can be found at https://github.com/ScottDuckworth/mtpt.

    Caveats: only works for local mounted filesystems (not over ssh), and ACL's and extended attributes are not currently supported.  But it is much faster than rsync for very large directories.

  6. Anonymous

    This is a really cool idea. I'm not great with this stuff yet and can't get it running, can anyone take a look?

    This is what I'm trying to enter:

    find . -print0 | \ xargs -0 -P$THREADS -I% \ rsync -avP --relative "%" /mnt/orabak2ELP/ELPbackup2/PRD1HR/

    Nothing happens when I enter the command. 

    1. Anonymous

      I just realized how insane my question was. Here's another version:

      find /sourcedir -print0 | xargs -P8 rsync -avP /destinationdir

  7. Anonymous

    Thanks for this script.

    But as there is not only plain files which may be targeted, I've modified the "find . -type f" with "find . ! -type d"

    Also, for those interrested, on Linux you may easily default the value of $THREADS to the # of current CPU by using something like:

    THREADS=$(grep ^processor /proc/cpuinfo|wc -l)

    (wink)

    dlb

  8. Anonymous

    I'm not sure if I can readily adapt this so that the source is remote, but I'm going to try. Are there any pitfalls I should look out for in doing so?

    gc

  9. Anonymous

    This works great. My desktop computer and NAS have a full-duplex gigabit ethernet connection, but the various file transfer utilities do it one-at-a-time. Often only hitting 25% of potential.

    With this script, copying files saturates the connection, peaking at 100 MB/sec.

    Thank you for sharing this, William.

  10. Anonymous

    Too bad if your destination server is an NFS4 server. Trond Myrtlebust's rdirplus code patch to NFS4 will make your rsync remote listener take forever to produce a basic list of files on the destination server, and it gets even worse over high latency networks.

    To avoid Trond's buggy code you got to avoid apps/tools that list files on your destination NFS4 server.

    You could be better off just using a simple cp -rp command. Funny how you can transfer a file a few hundred megabytes in size in just seconds to and from an NFS4 server but to list a folder containing 50,000 files - forget it.