- What is it?
- Sphere: Parallel Data Processing Framework like MapReduce
- Sector: Userspace Distributed File System
- Optimized for data intensive applications
- Apache v2.0 License
- Architecture
- Sphere
- Applies user defined functions to all elements like records, blocks, files or directories
- Output can be written to Sector files or sent back to client
- Sector
- Is a meta file-system i.e. it stores files on the native/local file system of each slave node.
- Sector does not split files into blocks
- Master Server
- Maintains status of slave nodes and other master nodes
- Maintains the file system meta-data
- Responds to user requests
- Security server maintains user accounts and IP access control for masters, slaves, and clients.
- Can use LDAP/System Accounts as account source
- Uses Certificates to authenticate masters and slaves
- Slave Servers
- Store sector files
- Process data
- Sphere
- Reliability
- Sector uses software-level replication for better reliability and availability
- Replicas can be made either at write time (instantly) or periodically
- User can configure replication factor, distance and location on a per-file basis
- Redundant Topology - Multiple master servers can be setup for high availability
- Hot Plugging - Master/Slave nodes can join and leave at runtime
- Hot swapping - Master nodes can restart slave nodes in case of failures
- Sector does not require any permanent meta-data. File system can be rebuilt from data.
- Sector uses software-level replication for better reliability and availability
- Performance 2
- Direct Data transfer channel between slaves and clients for better performance.
- UDT is used for high speed data transfer
- UDT is a high performance UDP-based data transfer protocol.
- Much faster than TCP over wide area
- Benchmark Study 2
- TeraSort Benchmark (Sorting 1 TB data)
- Rack Configuration
- 1 head node
- Dual-core Xeon @ 3.0GHz
- 16 GB RAM
- 30 slave nodes
- Xeon E5410 @ 2.4 GHz
- 16 GB RAM
- 4 x 1 TB RAIDO Disk
- 1 Gb/s NIC
- Racks are located at JHU (Baltimore), StarLight (Chicago), UIC (Chicago) and Calit2 (San Diego). Inter-rack bandwidth is 20 GE.
- 1 head node
- Results
Note(s):
Racks
Sector/Sphere a
Hadoop b
1
28m 25s
85m 49s
2
15m 20s
37m 0s
3
10m 19s
25m 14s
4
7m 56s
17m 45s
a Sector/Sphere do not require tuning
b Hadoop was tuned for better performance.
- Rack Configuration
- MalStone Benchmark (by Open Cloud Consortium)
- Configuration
- 20 nodes
- Xeon E5410 @ 2.4 GHz
- 16 GB RAM
- 4 x 1 TB RAIDO Disk
- 1 Gb/s NIC
- 20 nodes
- Results
MalStone A
MalStone B
Hadoop
454m 13s
840m 50s
Hadoop Streaming/Python
87m 29s
142m 32s
Sector/Sphere
33m 40s
43m 44s
- Configuration
- TeraSort Benchmark (Sorting 1 TB data)
- Multiple active master servers support load balancing
- Spatial Locality - Slave nodes process data stored locally or on nearest storage node
- Security
- Authentication/Access Control is managed by a Security Server
- Encryption of control messages and data transfers is supported
- Interoperability
- Client applications
- Specify Input, Output and user defined functions
- Input/Output are Sector directories or files
- May process data iteratively/combinatively
- User Defined functions are C++ functions following the Sphere specification (parameters and return value) and compiled into a dynamic library (*.so)
- FUSE plugin to mount Sector File System locally
- Client applications
Source(s):
1 http://sector.sourceforge.net/pub/Sector-cloudcom-tutorial.pdf
2 http://sector.sourceforge.net/benchmark.html