phpdfs: July 2009

Thursday, July 30, 2009

Algorithm changes and the next release

Hello again,

There is a new release of PHPDFS available here:

http://code.google.com/p/phpdfs/downloads/list

There are test results below for the new version.

I said in my last post that I was going to to talk about some algorithm changes so lets do that. There were two main drawbacks to the RUSHp algorithm implemented in the previous release of PHPDFS:

Reorganization

As can be seen in the prior test results (in the prior blog post), data reorganization was optimal only when a new sub-cluster was added. All other configuration changes caused a significant amount of data to be moved. In some cases, more than 90% of the data was reorganized. This is obviously undesirable. Say an installation contained 100 Terabytes of data. Obviously it is not good to have to move 90+ Terabytes to accommodate removing the first sub-cluster of old machines. This could cause some serious degradation in service quality if the internal network upon which the system is installed becomes saturated simply moving data between machines. Also, when that much data starts to move around, the probability of failure is increased, which is also undesirable.
Replication Policies

A replication policy is the number of replicas that a certain object has in the system. The previous version of PHPDFS using RUSHp had some drawbacks related to the replication policies. You could not have a replication factor that was greater than the number of nodes in the first sub-cluster. This means that if you started with 2 nodes in your first sub-cluster (which many folks would probably do) then you would not be able to increase the replication factor as you scaled and added servers unless you changed the number of nodes in the first sub-cluster, which would result in most of the data being moved, which as we pointed out above is undesirable.

Well, the current release of PHPDFS addresses these problems by using the RUSHr variant of the RUSH algorithms. RUSHr delivers near-optimal re-organization when a sub-cluster is removed or re-weighted. Additionally, the replication policy is adjustable with few constraints, the only constraint is that you cannot have a replication policy that exceeds the number of servers in the system which does not make sense anyway. So don't do that. :P the primary reason for the improvements is because we use the hypergeometric distribution for selecting replicas and where they belong. I will get into more detail about this in a future blog post.

Ok, below are the test results for the latest release of PHPDFS. We use the same tests as we did in the last blog post. I am quite pleased with these results and probably will not be making many changes to the core algorithms going forward. I might be switching to using SPRNG to increase the lookup speed but that will be all. In the future, all blog posts will be focused on using PHPDFS in the real world.

on to the results:

I used the paper here as a reference test results (section 3 specifically):

http://users.soe.ucsc.edu/~elm/Papers/ipdps04.pdf

all graphs were generated with jgraph:

http://www.aditus.nu/jpgraph/

All test code was executed on a macbook 1.83 Ghz Intel Core Duo with 2 GB 667Mhz DDR2 SDRAM

We cover four areas:

object distribution
failure resilience (replica distribution)
reorganization (adding and removing servers)
look-up performance

All tests except for the look-up performance were done with a configuration of three sub-clusters each containing 5 nodes. 10000 objects were created and placed with three replicas each. I started with a weight of 1 and each sub-cluster was weighted twice as much as the cluster to its left. Meaning:

cluster 1 has a weight of 1
cluster 2 has a weight of 2
cluster 3 has a weight of 4

===============================

Object Distribution:

we can see here that the object distribution is similar to what we saw in the previous tests.

================================

Failure Resilience:

Failure resilience deals with how PHPDFS will distribute load when a disk fails. The load distribution will be determined by the location of the replicas for the failed disk. We can see from the following three graphs that PHPDFS does an excellent job of distributing the load when a disk fails.

We can see that replica distribution follows the weighting pattern and that the ensuing load due to the failure will be spread amongst all other nodes. This is exactly what we want for dealing with failed servers. Below we include the replica distribution for disks 1 and 15 to illustrate that all nodes express the same characteristics for replica distribution.

replica distribution for disk 15

replica distribution for disk 1

================================

Data Reorganization:

For the reorganization tests I started with the configuration as described above and:

added a sub-cluster
removed the first sub cluster
re-weighted the second sub-cluster
removed the first disk from the first sub-cluster

What we are concerning ourselves with here is the number of objects that are moved during a reorganization. Reorganizing typically means we are adding new machines or de-allocating old resources. We compare our results to what we consider to be optimal, optimal meaning that we move the least number of objects to accommodate a new configuration. When we add servers we expect the optimal number of moved objects to be an equal amount taken from each server according to its sub-cluster weight. When we de-allocate resources, we expect the optimal number to be the number of objects held by that server. We see in the results below that PHPDFS does an optimal job when adding new sub-clusters, and functions at what we call a near optimal level.

The graph to the left indicates what happens when we add a new sub-cluster of five disks. we see that an optimal amount of objects are moved to accommodate the resources. This is good as adding new servers will probably happen more frequently then de-allocating old servers.

The graph to the left shows the objects that move when the first sub-cluster is removed. Removing a sub cluster, means setting the weight to 0 for that sub-cluster and waiting for the objects to be moved. After which point the servers can be taken off-line.

The graph to the left expresses near-optimal behavior. We say near-optimal because optimal would mean that only the objects in the first sub cluster would be moved. And we can see that in addition to the objects in the first sub-cluster, a small number of objects in the second sub-cluster are also moved. Specifically, we expect that 4247 objects would be moved and we see that actually 5581 objects are moved. This is a huge improvement over the last release of PHPDFS

The graph to the left shows what happens when we increase the weight on the second sub-cluster from a value of 2 to 4. This might happen when doing replacement of a sub-cluster with new machines where the new machines are equally as powerful as the servers in the more recently added sub-clusters.

We can see from the graph on the left that no objects are moved from the second sub-cluster at all and that all moved objects come from the first and third sub-clusters. This is exactly what we want to happen. The next question is whether or not an optimal or near optimal amount of data is moved. Optimally, we would expect that 4761 objects would be moved to accommodate the new weighting and we had 5368 objects moved in our tests, which is near optimal and again a huge improvement over the last release of PHPDFS.

This next graph to the left shows how many objects get moved when we remove a disk or server from the first cluster. This does not seem like something that would happen very often in a real deployment if at all, but we include it here for completeness.

We see that the data movement is not really optimal . We expect that only the objects from a single disk (in this case disk 1) should be moved, but we can see that this is not the case and that objects from the other disks in both sub-clusters 1 and 2 are moved. However not many are moved, so removing a single disk will not cause an extreme disruption as is the case with the prior implementation of PHPDFS where more than 50% of the objects in the system were moved.

Specifically we would expect that 877 objects would be removed from the first disk and that is all, however our tests showed that 2762 in total were moved. Again this is not an extreme amount and is a huge improvement over the previous implementation of PHPDFS.

================================

Lookup performance:

For the look-up performance tests, I created 1 million objects and started with a single sub-cluster of five disks and ran the look-ups for the 1 million objects and took the avg look-up time across all look-ups. This was repeated for 100 sub-clusters of five disks each where a sub-cluster was added with an exponential weighting increase of 1.1 and the look-ups repeated and the avg time for each look-up was recorded. I then repeated this experiment with even weighting so we could get a look at how weighting affects look-up performance.

The graph to the left shows lookup performance and scaling characteristics. The biggest factor affecting scaling in PHPDFS is the number of sub-clusters that must be investigated when performing the lookup.

With evenly weighted disks we expect to see super-linear scaling. We expect O(nr) time where r is the time taken to generate the random numbers for investigating clusters.

But with weighting we expect sub-linear scaling because more objects will be located in the more heavily weighted disks, the more weighted disks will be "rightmost" in the sub-clusters, meaning that they will be investigated first, meaning that we will probably find the object we are looking for within the first few sub-clusters.

We can see from the graph that our expectations are met. When the servers are all weighted evenly we have slightly super-linear scaling. This is because of the number of times we have to call the random number generator when investigating clusters. We could bring an even weighted configuration much closer to linear scaling by using a better random number generator that provides some sort of jumpahead functionality so we will not have to call the rand() function equal to the id of the sub-cluster we are investigating.

In all likelihood, deployments of PHPDFS will have sub clusters with uneven weights where the most recently added clusters are more heavily weighted than older clusters so we should get sub-linear scaling in real deployments of PHPDFS.

OK, so the test results above show a marked improvement in PHPDFS which is very exciting. I believe that we are getting ready to do a beta release. Before that happens though, want to improve the unit tests and improve the client library. So be on the lookout for the beta release.

Friday, July 17, 2009

First round of test results

Below are the test results for PHPDFS in its current implementation.

In Summary, PHPDFS in its current implementation does perform as expected according to the RUSH variant algorithm in this paper:

http://users.soe.ucsc.edu/~elm/Papers/ipdps03.pdf

That is the main point of these tests, to show that the current implementation of PHPDFS is correct and does indeed do what it is supposed to do. However, after testing I am not sure that the current implementation will fit the needs of a cloud like store for a web application. There are three main variants of the RUSH algorithms known as RUSHp, RUSHr, and RUSHt. PHPDFS currently has a RUSHp implementation at its core. I am looking at RUSHr as being the best choice because it seems to have the most flexible replication policy for an object which seems to better map to the needs of a web app where the popularity of an asset can suddenly change (say due to a post on slashdot, digg, etc). I will get into my reasoning more in depth in the next blog post.

on to the results...

I used the paper here as a reference test results (section 3 specifically):

http://users.soe.ucsc.edu/~elm/Papers/ipdps04.pdf

all graphs were generated with jgraph:

http://www.aditus.nu/jpgraph/

All test code was executed on a macbook 1.83 Ghz Intel Core Duo with 2 GB 667Mhz DDR2 SDRAM

We cover four areas:

object distribution
failure resilience (replica distribution)
reorganization (adding and removing servers)
look-up performance

We can see here that PHPDFS is very close to optimal distribution. All of the RUSH variants exhibit this same behavior.

=================================

Failure Resilience:

This graph shows the replica distribution of objects that were located on disk 7 which theoretically failed.

What we can see here is that the more heavily weighted servers would take most of the request load from a failed disk which is what we would want assuming that the more heavily weighted servers are more powerful.

Immediately below are two more graphs that show replica distribution for disks 1 and 15. You will see from the graphs that the replica distribution has a tendency to favor the more heavily weighted disks. Which again is what we want assuming the more heavily weighted disks are the more powerful servers.

disk 15 replica distribution

disk 1 replica distribution

=================================

Reorganization:

For the reorganization tests I started with the configuration as described above

added a sub-cluster
removed the last disk
removed the first sub cluster

What we are concerning ourselves with here is the number of objects that are moved during a reorganization. Reorganizing typically means we are adding new machines or de-allocating old resources. We compare our results to what we consider to be optimal, optimal meaning that we move the least number of objects to accommodate a new configuration. When we add servers we expect the optimal number of moved objects to be an equal amount taken from each server according to its sub-cluster weight. When we de-allocate resources, we expect the optimal number to be the number of objects held by that server. We see in the results below that PHPDFS does an optimal job when adding new sub-clusters, but performs in sub-optimal fashion when de-allocating resources. This is expected according to the authors of the core RUSH algorithm.

This next graph shows what happens to server when the first sub-cluster is removed. This is something that might happen when a whole cluster of machines has been determined to be too slow to serve any further purpose or maybe they could be put to use elsewhere.

In our example the first sub-cluster contains approximately 4265 objects so we expect the optimal amount of objects moved to be 4265. However, we can see from the graph that MANY MANY more than that were moved. In fact, the total amount actually moved was 27943!!!. ugh that means that 27943 / 30000 = 91% of the objects get moved. That is almost a complete reorganization! As I stated above, we are going to use a different RUSH variant that will not cause such huge movement when we de-allocate resources. RUSHr does allow for sub-cluster and has near optimal reorganization characteristics.

This next graph shows what happens when a disk is removed from the last sub-cluster.

again we see a less than optimal number of objects moved.

the total objects that we expected to move was around 3480 and we actually moved a total of 16567. which means that we moved more than 50% of the objects to accommodate removing a single disk. this just affirms what we saw above for the sub optimal performance and affirms the idea that we should use a different variant of the RUSH algorithm.

=================================

Lookup Performance:

For the look-up performance tests, I created 1 million objects and started with a single sub-cluster of five disks and ran the look-ups for the 1 million objects and took the avg look-up time across all look-ups. This was repeated for 100 sub-clusters of five disks each where a sub-cluster was added with an exponential weighting increase of 1.1 and the look-ups repeated and the avg time for each look-up was recorded. I then repeated this experiment with even weighting so we could get a look at how weighting affects look-up performance.

The graph to the left shows lookup performance and scaling characteristics. The biggest factor affecting scaling in PHPDFS is the number of sub-clusters that must be investigated when performing the lookup.

With evenly weighted disks we expect to see super-linear scaling. We expect O(nr) time where r is the time taken to generate the random numbers for investigating clusters.

But with weighting we expect sub-linear scaling because more objects will be located in the more heavily weighted disks, the more weighted disks will be "rightmost" in the sub-clusters, meaning that they will be investigated first, meaning that we will probably find the object we are looking for within the first few sub-clusters.

We can see from the graph that our expectations are met. When the servers are all weighted evenly we have slightly super-linear scaling. This is because of the number of times we have to call the random number generator when investigating clusters. We could bring an even weighted configuration much closer to linear scaling by using a better random number generator that provides some sort of jumpahead functionality so we will not have to call the rand() function equal to the id of the sub-cluster we are investigating.

In all likelihood, deployments of phpdfs will have sub clusters with uneven weights where the most recently added clusters are more heavily weighted than older clusters so we should get sub-linear scaling in real deployments of phpdfs.

So that's it for now. As you can see, the current implementation of PHPDFS will meet the criteria for most of the stated claims. The problem area is in reorganization. So that will be addressed in upcoming releases. you can download the latest version of the code here:

http://code.google.com/p/phpdfs/downloads/list

My next blog post will be about what algorithm changes we want to make and why. based off of an analysis of the needs of a cloud like system that stores many small images for a web application.

peace,

-Shane

Wednesday, July 8, 2009

scaling improvements

howdy folks,

I have been speaking a bit with the authors of the core algorithm that I implemented in phpdfs and they directed me to some follow up research and papers that contain information on some scaling improvements.

If you are interested, you can read the papers here:

http://users.soe.ucsc.edu/~elm/Papers/ipdps04.pdf

http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

these papers are great. I am just now getting into collecting empirical data about phpdfs to prove that the implementation does, in fact, do what it is supposed to do and the first paper above has a section on "RUSH in Practice" (section 3) which can serve as a reference when I do my own tests. sweet.

Since I am able to easily swap out the locating algorithm in phpdfs I think what I will do before I get into working on the improved implementation of RUSH is test what I currently have and develop a solid test suite that can be used going forward. I know that I should have more fully fleshed out tests before implementation, but I was just too damn excited about doing this. sue me. :P

peace, look for test results soon...

Monday, July 6, 2009

pre-pre-alpha

here it is. the first release of phpdfs.

http://code.google.com/p/phpdfs/downloads/list

This is the first release and I would not call it ready for production, but it certainly is ready for some pounding. which is exactly what I intend to do.

now that this first release is out, I am going to be focusing on:

improving the tests. currently there are tests for the core algorithm but not much else. I want to add tests not only for the main PHPDFS class but also to prove that the system and the code will actually do what it says it does. This will include empirical tests that will show the distribution is correct and that an optimal (minimal) amount of data is moved when new resources are added (new servers)
improving and generating documentation.

as things go forward, I will be putting together scaling, maintenance, and performance guides and posting them here.

If anybody happens to read this blog and downloads phpdfs and tries it out, please give me any feedback that you might have.

peace mi amigos y amigas,

-Shane

phpdfs