Updated: 19 min 59 sec ago
Tue, 06/24/2008 - 11:45
I've been working a bit with WAN-acceleration for CTDB. Actually with two different approaches
for two different purposes.
WAN-accelerator #1 (general purpose)
The first approach was to add new "capabilities" to the CTDB daemon so that you could have a cluster of CTDB nodes where some nodes were located at a very remote site, across a high-latency WAN-link. This was tricky to solve since eventhough you have nodes that participate synchronously in the cluster you do not want the high WAN-link latency to affect performance on the nodes in the main datacentre.
Initial tests seems to indicate that it works quite well. Surprisingly well.
But this is not really a WAN-accelerator. A classic WAN-accelerator is more a device that performs a man-in-the-middle attack on the CIFS/NFS protocols and performs some (sometimes unsafe) caching.
In the CTDB approach above there is no man-in-the-middle attack, nor is it really a WAN-accelerator.
It is conceptually more like one single multihomed CIFS server where one on the NICs (the remote site) happens to be a few hundred ms away. Thus we dont have to play any tricks, nor do any questionable caching, we are still a single cifs server, with fully and 100% correct cifs semantics, its just that this cifs server is spread out across multiple sites.
I.e. clients on the remote site talk to the genuine real cifs server. Not an man-in-the-middle imposter that may or may not provide correct semantics.
WAN accelerator #2 (nfs)
A different solution was based on FUSE and providing very aggressive caching of data and metadata for NFS. This one also seems to perform really well but is obviously less cool than "a single multihomed cifs server spanning multiple sites".
Tue, 05/06/2008 - 14:53
CTDB has had event scripts to support managing iSCSI target service for a while.
These event scripts are designed for use with the STGT iSCSI target.
Why pick STGT? when there are so many different iSCSI targets available for Linux.
Well, STGT is the one that comes default with RHEL5 and also what I use with ubuntu.
It also comes with a decent SBC emulation (scsi block command set, to emulate a hard disk) MMC (multimedia commandset, to emulate a DVD drive) and SMC (media changer to emulate a robot/jukebox).
While which iscsi solution is "best" is a never ending source of controversy on lkml, I picked STGT.
To me STGT is attractive since it does all the SCSI processing in userspace and is very simple and easy to enhance. I personally dont like when network services run inside kernel space, and have many times had loud opinions on "why does the $%**@! nfs lock manager run inside the kernel? making it so difficult to fix nlm bugs" when it could(should) run much better in userspace as all other platform does it and would be so much more serviceable to me as a user.
Why would someone want iSCSI and hard disk emulation with CTDB, isnt ctdb just something to build a (VERY fast and VERY resilient) NAS server using samba?
iSCSI is block i/o, why use block i/o serviceses on a NAS device?
Many people that use a CIFS NAS service, and in particular an expensive high-end CIFS NAS server (such as what CTDB/Samba is) often have a large number of windows clients that they want to use and connect to CTDB/Samba.
But since you have a large number of windows clients, you probably also use Exchange and while you cant really put Exchange databases on a NAS share you can put these databases on an iSCSI LUN.
Since the CTDB/Samba NAS server is likely to be the fastest and most expensive storage device you have inside this hypothetical datacentre (and also one of the more fault resilient ones) it would be very attractive to also store a critical application like Exchanges databases on this device.
It is a great value-add to the very expensive NAS box you just installed if you could also store the data for the critical Exchange application on it.
Do try STGT out. It is quite cool and works really well.
Thu, 03/06/2008 - 09:36
Setting up a high-end cluster, that should be easy.
When we first started developing CTDB and clustered samba we thought that,
well if we just get CTDB and samba to work then everything else should just be a breeze.
Boy were we wrong.
Getting all components of Linux to work reliably and figuring out HOW to configure linux and its subsystems so that it works reliably is one of the most difficult tasks which we spend a lot of time in the SOFS team.
(IBM Scale Out File Services).
This is important since our customers want to know that they use a configuration that works and that is qualified. It is even moreso important since a naive implementation using stock default linux configurations will likely have "issues".
NEVER assume that anything is mature or works. In particular, DONT if your data depends on it.
You must TEST TEST TEST TEST and finally TEST some more that everything works and that all components can
handle a high load on your highly performing system.
Dont even try just slapping something together if you intend to store any business critical data on it.
Make sure that ALL components and ALL configurations you use are tested and qualified for your use pattern!
(If you use SOFS you can sleep better at night because we have already done all these tests and qualifications for you.)
What have we experienced?
HBA drivers. While a HBA driver may look mature and may look solid, do you know whether the HBA driver
developer is testing and qualifying the driver for use with YOUR cluster filesystem?
In our case we found that you must be VERY careful and change the default config for the HBA and SCSI subsystem to match the use patterns of your cluster filesystem. Or else bad things happen if you use the defaults.
(You dont want to learn about these problems when you are in production. At that stage it is too late)
Linux kernel and real-time signals/async i/o. I dont really know how well tested the stock kernels are
with respect to high stress testing of these features. I DO know that it is reasonably easy
to bring the entire real-time-signal layer in the distro kernels down in such a manner you need a full blown system reboot to recover using the default config settings in stock kernels.
Not fun.
Cluster filesystems and coherent filelocking.
Most cluster filesystems out there seems to never really have been tested for high lock contention where many many processes do byte range locking of the same file at the same time.
We use GPFS, and we use a customized configuration for GPFS that is qualified for SOFS.
Dont use the defaults! Bad things will happen.
Kernel oplocks and leases.
Another area where one needs to be very careful and configure things exactly right.
Kernel modifications and patches.
A lot of the subsystems used by a high end NAS application will excersise parts of the kernel that only has
had light load and testing applied before. There are numbers of kernel modifications that are required
and which are not yet in the distros that are needed. For example a stock linux distro and kernel has probably no hope at all to integrate with HSM in meaningful ways. It may look like it works, but sooner or later you will discover the parts that break.
Dont assume that just throwing some components and applications together will create a "solution".
It wont. Trust me, it will not work. Unless you know exactly how to configure all the components so that they are fully compatible with each others use patterns, I can guarantee that a stock linux distro using stock default configs will have nasty surprises for you waiting to happen.
Dont play games with your data. Make sure that ALL components in your solution are qualified to work together.