When copying a lot of files with fast disk and network IO, I have often found it more efficient to copy the files as multiple threads. Copying 4-8 sets of files at the same time can better saturate IO and usually sees a 4x or more improvement in speed of the transfer.
rsync
is often the easiest choice for efficiently copying over lots of files, but unfortunately it doesn't have an option for parallel threads that is built in. So, here's a rather simple way to do this using find
, xargs
, and rsync
.
Parallel Rsync (bash)
#!/bin/bash # SETUP OPTIONS export SRCDIR="/folder/path" export DESTDIR="/folder2/path" export THREADS="8" # RSYNC DIRECTORY STRUCTURE rsync -zr -f"+ */" -f"- *" $SRCDIR/ $DESTDIR/ \ # FOLLOWING MAYBE FASTER BUT NOT AS FLEXIBLE # cd $SRCDIR; find . -type d -print0 | cpio -0pdm $DESTDIR/ # FIND ALL FILES AND PASS THEM TO MULTIPLE RSYNC PROCESSES cd $SRCDIR && find . ! -type d -print0 | xargs -0 -n1 -P$THREADS -I% rsync -az % $DESTDIR/% # IF YOU WANT TO LIMIT THE IO PRIORITY, # PREPEND THE FOLLOWING TO THE rsync & cd/find COMMANDS ABOVE: # ionice -c2
The rsync
s above can be extended to work through ssh
as well. When using rsync
over ssh
, I've found that setting the ssh encryption type to arcfour
is a critical option for speed.
rsync over ssh
rsync -zr -f"+ */" -f"- *" -e 'ssh -c arcfour' $SRCDIR/ remotehost:/$DESTDIR/ \ && \ cd $SRCDIR && find . ! -type d -print0 | xargs -0 -n1 -P$THREADS -I% rsync -az -e 'ssh -c arcfour' % remotehost:/$DESTDIR/%
13 Comments
Anonymous
This version recurses all directories:
#!/bin/bash
# SETUP OPTIONS
export SRCDIR="/Storage/data1"
export DESTDIR="/Storage/data2"
export THREADS="32"
# FIND ALL FILES AND PASS THEM TO MULTIPLE RSYNC PROCESSES
cd $SRCDIR
if [[ $? -eq 0 ]]; then
find . -type d | xargs -I% mkdir -p /$DESTDIR/%
find . -type f | xargs -n1 -P$THREADS -I% rsync -a % /$DESTDIR/%
fi
Anonymous
Hi,
What if the "SRCDIR" is an online rsync repository?
Like rsync://dir.foo.com/abc
and "DSTDIR" is empty. ?
Anonymous
William,
This is a slick idea. I have been doing multiple rsyncs on a shell-based engine, but using
xargs -P…
is much easier to set up.Suggestion: You can avoid the preparatory step and multiple mentions of "%" by using
rsync --relative …
and feed it the output offind -mindepth n -maxdepth n
to transfer source items at a given hierarchy depthn
.Initiate the transfers on the source side:
Shell-escaping the "%" will be useful to handle "funny" file names.
-- Michael Sternberg, Argonne National Laboratory.
Justin Azoff
One should use null terminated strings to avoid shell escaping issues:
Anonymous
Yes. I agree completely with this. One should always use
-print0
and-0 when using find and xargs together (unless there is a very good reason not to).
Anonymous
Be aware that this is quite inefficient if you're transferring small files as it runs separate processes for each file.
Anonymous
A while back I made a program called mtsync which is similar to rsync but uses multiple threads in a single process. mtsync can be found at https://github.com/ScottDuckworth/mtpt.
Caveats: only works for local mounted filesystems (not over ssh), and ACL's and extended attributes are not currently supported. But it is much faster than rsync for very large directories.
Anonymous
This is a really cool idea. I'm not great with this stuff yet and can't get it running, can anyone take a look?
This is what I'm trying to enter:
find . -print0 | \ xargs -0 -P$THREADS -I% \ rsync -avP --relative "%" /mnt/orabak2ELP/ELPbackup2/PRD1HR/
Nothing happens when I enter the command.
Anonymous
I just realized how insane my question was. Here's another version:
find /sourcedir -print0 | xargs -P8 rsync -avP /destinationdir
Anonymous
Thanks for this script.
But as there is not only plain files which may be targeted, I've modified the "find . -type f" with "find . ! -type d"
Also, for those interrested, on Linux you may easily default the value of $THREADS to the # of current CPU by using something like:
THREADS=$(grep ^processor /proc/cpuinfo|wc -l)
2¢
dlb
Anonymous
I'm not sure if I can readily adapt this so that the source is remote, but I'm going to try. Are there any pitfalls I should look out for in doing so?
gc
Anonymous
This works great. My desktop computer and NAS have a full-duplex gigabit ethernet connection, but the various file transfer utilities do it one-at-a-time. Often only hitting 25% of potential.
With this script, copying files saturates the connection, peaking at 100 MB/sec.
Thank you for sharing this, William.
Anonymous
Too bad if your destination server is an NFS4 server. Trond Myrtlebust's rdirplus code patch to NFS4 will make your rsync remote listener take forever to produce a basic list of files on the destination server, and it gets even worse over high latency networks.
To avoid Trond's buggy code you got to avoid apps/tools that list files on your destination NFS4 server.
You could be better off just using a simple cp -rp command. Funny how you can transfer a file a few hundred megabytes in size in just seconds to and from an NFS4 server but to list a folder containing 50,000 files - forget it.