phpdfs: Algorithm changes and the next release

Hello again,

There is a new release of PHPDFS available here:

http://code.google.com/p/phpdfs/downloads/list

There are test results below for the new version.

I said in my last post that I was going to to talk about some algorithm changes so lets do that. There were two main drawbacks to the RUSHp algorithm implemented in the previous release of PHPDFS:

Reorganization

As can be seen in the prior test results (in the prior blog post), data reorganization was optimal only when a new sub-cluster was added. All other configuration changes caused a significant amount of data to be moved. In some cases, more than 90% of the data was reorganized. This is obviously undesirable. Say an installation contained 100 Terabytes of data. Obviously it is not good to have to move 90+ Terabytes to accommodate removing the first sub-cluster of old machines. This could cause some serious degradation in service quality if the internal network upon which the system is installed becomes saturated simply moving data between machines. Also, when that much data starts to move around, the probability of failure is increased, which is also undesirable.
Replication Policies

A replication policy is the number of replicas that a certain object has in the system. The previous version of PHPDFS using RUSHp had some drawbacks related to the replication policies. You could not have a replication factor that was greater than the number of nodes in the first sub-cluster. This means that if you started with 2 nodes in your first sub-cluster (which many folks would probably do) then you would not be able to increase the replication factor as you scaled and added servers unless you changed the number of nodes in the first sub-cluster, which would result in most of the data being moved, which as we pointed out above is undesirable.

Well, the current release of PHPDFS addresses these problems by using the RUSHr variant of the RUSH algorithms. RUSHr delivers near-optimal re-organization when a sub-cluster is removed or re-weighted. Additionally, the replication policy is adjustable with few constraints, the only constraint is that you cannot have a replication policy that exceeds the number of servers in the system which does not make sense anyway. So don't do that. :P the primary reason for the improvements is because we use the hypergeometric distribution for selecting replicas and where they belong. I will get into more detail about this in a future blog post.

Ok, below are the test results for the latest release of PHPDFS. We use the same tests as we did in the last blog post. I am quite pleased with these results and probably will not be making many changes to the core algorithms going forward. I might be switching to using SPRNG to increase the lookup speed but that will be all. In the future, all blog posts will be focused on using PHPDFS in the real world.

on to the results:

I used the paper here as a reference test results (section 3 specifically):

http://users.soe.ucsc.edu/~elm/Papers/ipdps04.pdf

all graphs were generated with jgraph:

http://www.aditus.nu/jpgraph/

All test code was executed on a macbook 1.83 Ghz Intel Core Duo with 2 GB 667Mhz DDR2 SDRAM

We cover four areas:

object distribution
failure resilience (replica distribution)
reorganization (adding and removing servers)
look-up performance

All tests except for the look-up performance were done with a configuration of three sub-clusters each containing 5 nodes. 10000 objects were created and placed with three replicas each. I started with a weight of 1 and each sub-cluster was weighted twice as much as the cluster to its left. Meaning:

cluster 1 has a weight of 1
cluster 2 has a weight of 2
cluster 3 has a weight of 4

===============================

Object Distribution:

we can see here that the object distribution is similar to what we saw in the previous tests.

================================

Failure Resilience:

Failure resilience deals with how PHPDFS will distribute load when a disk fails. The load distribution will be determined by the location of the replicas for the failed disk. We can see from the following three graphs that PHPDFS does an excellent job of distributing the load when a disk fails.

We can see that replica distribution follows the weighting pattern and that the ensuing load due to the failure will be spread amongst all other nodes. This is exactly what we want for dealing with failed servers. Below we include the replica distribution for disks 1 and 15 to illustrate that all nodes express the same characteristics for replica distribution.

replica distribution for disk 15

replica distribution for disk 1

================================

Data Reorganization:

For the reorganization tests I started with the configuration as described above and:

added a sub-cluster
removed the first sub cluster
re-weighted the second sub-cluster
removed the first disk from the first sub-cluster

What we are concerning ourselves with here is the number of objects that are moved during a reorganization. Reorganizing typically means we are adding new machines or de-allocating old resources. We compare our results to what we consider to be optimal, optimal meaning that we move the least number of objects to accommodate a new configuration. When we add servers we expect the optimal number of moved objects to be an equal amount taken from each server according to its sub-cluster weight. When we de-allocate resources, we expect the optimal number to be the number of objects held by that server. We see in the results below that PHPDFS does an optimal job when adding new sub-clusters, and functions at what we call a near optimal level.

The graph to the left indicates what happens when we add a new sub-cluster of five disks. we see that an optimal amount of objects are moved to accommodate the resources. This is good as adding new servers will probably happen more frequently then de-allocating old servers.

The graph to the left shows the objects that move when the first sub-cluster is removed. Removing a sub cluster, means setting the weight to 0 for that sub-cluster and waiting for the objects to be moved. After which point the servers can be taken off-line.

The graph to the left expresses near-optimal behavior. We say near-optimal because optimal would mean that only the objects in the first sub cluster would be moved. And we can see that in addition to the objects in the first sub-cluster, a small number of objects in the second sub-cluster are also moved. Specifically, we expect that 4247 objects would be moved and we see that actually 5581 objects are moved. This is a huge improvement over the last release of PHPDFS

The graph to the left shows what happens when we increase the weight on the second sub-cluster from a value of 2 to 4. This might happen when doing replacement of a sub-cluster with new machines where the new machines are equally as powerful as the servers in the more recently added sub-clusters.

We can see from the graph on the left that no objects are moved from the second sub-cluster at all and that all moved objects come from the first and third sub-clusters. This is exactly what we want to happen. The next question is whether or not an optimal or near optimal amount of data is moved. Optimally, we would expect that 4761 objects would be moved to accommodate the new weighting and we had 5368 objects moved in our tests, which is near optimal and again a huge improvement over the last release of PHPDFS.

This next graph to the left shows how many objects get moved when we remove a disk or server from the first cluster. This does not seem like something that would happen very often in a real deployment if at all, but we include it here for completeness.

We see that the data movement is not really optimal . We expect that only the objects from a single disk (in this case disk 1) should be moved, but we can see that this is not the case and that objects from the other disks in both sub-clusters 1 and 2 are moved. However not many are moved, so removing a single disk will not cause an extreme disruption as is the case with the prior implementation of PHPDFS where more than 50% of the objects in the system were moved.

Specifically we would expect that 877 objects would be removed from the first disk and that is all, however our tests showed that 2762 in total were moved. Again this is not an extreme amount and is a huge improvement over the previous implementation of PHPDFS.

================================

Lookup performance:

For the look-up performance tests, I created 1 million objects and started with a single sub-cluster of five disks and ran the look-ups for the 1 million objects and took the avg look-up time across all look-ups. This was repeated for 100 sub-clusters of five disks each where a sub-cluster was added with an exponential weighting increase of 1.1 and the look-ups repeated and the avg time for each look-up was recorded. I then repeated this experiment with even weighting so we could get a look at how weighting affects look-up performance.

The graph to the left shows lookup performance and scaling characteristics. The biggest factor affecting scaling in PHPDFS is the number of sub-clusters that must be investigated when performing the lookup.

With evenly weighted disks we expect to see super-linear scaling. We expect O(nr) time where r is the time taken to generate the random numbers for investigating clusters.

But with weighting we expect sub-linear scaling because more objects will be located in the more heavily weighted disks, the more weighted disks will be "rightmost" in the sub-clusters, meaning that they will be investigated first, meaning that we will probably find the object we are looking for within the first few sub-clusters.

We can see from the graph that our expectations are met. When the servers are all weighted evenly we have slightly super-linear scaling. This is because of the number of times we have to call the random number generator when investigating clusters. We could bring an even weighted configuration much closer to linear scaling by using a better random number generator that provides some sort of jumpahead functionality so we will not have to call the rand() function equal to the id of the sub-cluster we are investigating.

In all likelihood, deployments of PHPDFS will have sub clusters with uneven weights where the most recently added clusters are more heavily weighted than older clusters so we should get sub-linear scaling in real deployments of PHPDFS.

OK, so the test results above show a marked improvement in PHPDFS which is very exciting. I believe that we are getting ready to do a beta release. Before that happens though, want to improve the unit tests and improve the client library. So be on the lookout for the beta release.

phpdfs

Thursday, July 30, 2009

Algorithm changes and the next release

No comments:

Post a Comment

Followers

Blog Archive

About Me