Trebuchet Library

Trebuchet is a multi-scheme file-transfer API and client library written in Java.

Purpose

The principal aim of this library is to provide an abstraction layer over the standard file-related operations (such as directory listing, directory creation, file transfer and file deletion) which allows switching between protocols without alteration of either source code or scripting. The library is also designed to be extensible so that new protocol support can be added in a reasonably clean manner.

Currently, there are two ways in which Trebuchet can be utilized:

As normal Java package imports (i.e., programmatic calls to the library by other Java code);
Through the available Ogrescript tasks: see the Ogrescript Trebuchet plugin.

Certain capabilities, such as restarting operations and inspecting the binary "cache" files, are also available from the command-line. There are plans to provide sometime in the near future an Eclipse-based RCP Trebuchet client for easy management of file transfers across multiple hosts.

Features

The following are some of the more salient features offered by the Trebuchet library/tasks:

Full support for UNIX-style operations (ls, touch, mkdir, cp, mv, rm) locally and via the SSH/SCP protocols.
Support for all of these operations except touch via GRIDFTP and WEBDAV.
GSI/certificate-based authentication/authorization (SSH and GRIDFTP).
Automatic one-hop handling of third-party transfers over mixed protocols (e.g., SCP on host A to GRIDFTP on host B).
Two ways of achieving file transfer or deletion:
- By specifying exact locations/paths;
- By scanning or listing.
Fully recursive pattern-based scanning (using the '*' and '**' wildcard characters; see UriPattern).
All operations can be customized (using the available settings appropriate to the given protocol) via a general-purpose configuration object.
All GRIDFTP options available in the jglobus library are exposed for configurability; in particular, optimization settings such as:
- TCP buffer size;
- setting active mode on the target.
Automated support for both LIST and MLST/MLSD (GRIDFTP); options for forcing existence checking through the LIST command.
Automated staging of files from UNITREE tape archive using GRIDFTP (= MSSFTP).
Full access (i.e., by non-Trebuchet-related code), if so desired, to source and target paths during and after operations.
Thread-pooled parallel copy operations.
Automated use of multiple GRIDFTP connections for a given endpoint (as specified by the SPAS command), when available, for non-striped operations.
Fail-over and retry capabilities on a file-by-file basis.
Flexibility in the kind and number of events the user can opt to receive.

Design overview

Scheme-to-protocol mapping

The basis for Trebuchet's multi-protocol functionality lies in mapping (via the Eclipse-RCP extension-point mechanism) URI schemes to a set of implementations.

As an example, let us consider the ssh protocol; in order to support the available operations (in this case, all of them), the following classes needed to be implemented:

Function	Abstract Class	Concrete Class
exists, is file, is dir	`ncsa.tools.trebuchet.core.clients.VerifyClient`	`ncsa.tools.trebuchet.ssh.clients.SSHVerifyClient`
ls	`ncsa.tools.trebuchet.core.clients.ListClient`	`ncsa.tools.trebuchet.ssh.clients.SSHListClient`
touch	`ncsa.tools.trebuchet.core.clients.TouchClient`	`ncsa.tools.trebuchet.ssh.clients.SSHTouchClient`
mkdir	`ncsa.tools.trebuchet.core.clients.MkdirClient`	`ncsa.tools.trebuchet.ssh.clients.SSHMkDirClient`
rm	`ncsa.tools.trebuchet.core.clients.DeleteClient`	`ncsa.tools.trebuchet.ssh.clients.SSHDeleteClient`
cp, mv	`ncsa.tools.trebuchet.core.clients.CopyClient`	`ncsa.tools.trebuchet.ssh.clients.SSHCopyClient`

Then a scheme-to-client mapping needed to be provided via extensions to the ncsa.tools.trebuchet.core.clientTypes extension point:

Operation	Source Scheme	Target Scheme	Client
verify		`ssh`, `scp`, `gsissh`, `gsiscp`	`ncsa.tools.trebuchet.ssh.clients.SSHVerifyClient`
list		`ssh`, `scp`, `gsissh`, `gsiscp`	`ncsa.tools.trebuchet.ssh.clients.SSHListClient`
touch		`ssh`, `scp`, `gsissh`, `gsiscp`	`ncsa.tools.trebuchet.ssh.clients.SSHTouchClient`
mkdir		`ssh`, `scp`, `gsissh`, `gsiscp`	`ncsa.tools.trebuchet.ssh.clients.SSHMkDirClient`
delete		`ssh`, `scp`, `gsissh`, `gsiscp`	`ncsa.tools.trebuchet.ssh.clients.SSHDeleteClient`
copy	`file`	`ssh`, `scp`, `gsissh`, `gsiscp`	`ncsa.tools.trebuchet.ssh.clients.SSHCopyClient`
copy	`ssh`, `scp`, `gsissh`, `gsiscp`	`file`	`ncsa.tools.trebuchet.ssh.clients.SSHCopyClient`
copy	`ssh`, `scp`, `gsissh`, `gsiscp`	`ssh`, `scp`, `gsissh`, `gsiscp`	`ncsa.tools.trebuchet.ssh.clients.SSHCopyClient`

This mapping is referred to when Trebuchet processes a URI or URIs for a given operation, so that the URI schemes indicate which client to use for the operation.

There are two other classes, the PooledClientGenerator and ListToCopyConverter which also need to be mapped for each protocol, but usually the default implementations for these classes will be sufficient; also, depending on the file system, a special parser may be necessary for interpreting directory-listing lines, but in most cases the core parsers will work. Finally, for each scheme associated with the protocol, a small definition class implementing ncsa.tools.trebuchet.schemes.IScheme needs to be created; this class defines the underlying protocol used for the operation for Trebuchet's internal use, representing the operations which the protocol can support.

The schemes which have been implemented in the current version of Trebuchet are listed here.

Operation Caches

It is not necessary here to describe all the layers which constitute Trebuchet's architecture, but some notion of the bottom-most layer is useful for an understanding of how Trebuchet works. This layer consists of a binary file for the operation, accessed using Java's NIO library, and abstracted out as a Trebuchet Cache object. This is admittedly something of a misnomer, since no entries are actually being cached in memory, and therefore no fixed size is maintained by booting entries from it; but it is cache-like in that it provides an access-point through which all aspects of an operation pass and is usually transient – i.e., to be deleted at the end of the operation. The cache can, however, be held on to after the operation and used to restart or retry the same operation without having to generate the listings from scratch or redo the successful transfers.

There are two standard caches, one for listing or scanning operations, and one for copy or transfer operations. Moreover, when a copy operation relies on scanning or listing to provide it with the source locations, there is a conversion procedure (supplied by the ListToCopyConverter mentioned above) for creating the copy cache entries from the associated list cache entries. Scanned operations for touch, delete and copy by default do the conversion asynchronously using a listener API: as a list entry is added to the list cache, the listener passes it to the converter to be added to the copy cache, with another listener responsible for passing off the copy entries to the appropriate client as they become available. There is an option to override this behavior such that the entire listing or scanning is done first, but in most cases the parallelized list-convert-copy is to be preferred.

The reasons for making all operations rest on a disk-I/O layer are primarily two:

Greater scalability: large or deeply recursive directory copies, for instance, can be handled without risk of running out of memory;
Greater reliability: because the cache serves as a full operation log, the failed parts of the operation can be retried simply by pointing to the original cache; moreover, the cache will be there should the JVM in which the operation was running crash.

As stated above, the underlying cache file is written in binary. The following tables describe the byte-structure of its respective entry. As can be seen, these are organized similarly to network packets.

LIST CACHE ENTRY

Fixed length "head" = 65 bytes. The subscripted properties are specific to the metadata returned by a given file system.

CONTENTS	TYPE	BYTE POSITION
status	`byte`	0
entry id	`long`	1
previous id	`long`	9
type	`byte`	17
symlinked parent	`byte`	18
mode	`char`	19
links	`int`	21
size	`long`	25
modified	`long`	33
user length	`int`	41
group length	`int`	45
relative dir length	`int`	49
name length	`int`	53
symlink length	`int`	57
n = num properties	`int`	61
property name i length	`int`	65 + 8i
property value i length	`int`	69 + 8i
user	`bytes`	65 + 8n
group	`bytes`	65 + 8n + user length
relative dir	`bytes`	65 + 8n + user length + group length
name	`bytes`	65 + 8n + user length + group length + relative dir length
symlink	`bytes`	65 + 8n + user length + group length + relative dir length + name length
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="6f30aea9-7b3d-4826-881a-66e9b0d40de8"><ac:plain-text-body><![CDATA[	property name i	`bytes`	65 + 8n + user length + group length + relative dir length + name length + symlink length + ?[0 <= k < i] property name k length	]]></ac:plain-text-body></ac:structured-macro>
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="a7eed2d4-4fba-4abe-b9c6-79b7e437abcd"><ac:plain-text-body><![CDATA[	property value i	`bytes`	65 + 8n + user length + group length + relative dir length + name length + ?[0 <= k < n] property name k length + ?[0 <= k < i] property value k length	]]></ac:plain-text-body></ac:structured-macro>
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="9f6a0bcd-8b0d-4db0-8963-d79a093025e6"><ac:plain-text-body><![CDATA[	(end)		65 + 8n + user length + group length + relative dir length + name length + ?[0 <= k < n] property name k length + ?[0 <= k < n] property value k length	]]></ac:plain-text-body></ac:structured-macro>

Child pages