	This is revision number 5 of the clustering diffs.

=======================================================================
	A lot of people are unsure about how to use the new bdflush
daemon, and whether both bdflush and update should be running even
though update is now a symbolic link for bdflush.  THe answer is that
both processes should be run from the rc process for things to work
correctly.
=======================================================================

New features in version 0.5:

	1) Bug fixed where you would get kernel panics if you had
more than 16Mb and a host adapter that used either EISA or VLB (i.e.
no ISA DMA restrictions).

	2) The hash_table is now dynamicly sized (but just a bit).
The size is now either 997 or 4093 entries (for >= 4Mb systems).
I have no idea if this is really the most appropriate breakpoint, but
it is easy to tweak.

	3) The breada function has been replaced with something quite
a bit more useful.  The iso filesystem now uses this to read-ahead
directories and this vastly improves access times for cdrom directory
reads.

	4) Things have been rearranged a little bit - dirty buffers
now carry a timestamp that indicates when they should be written back
(previously it was the time the buffer was dirtied).  The general idea
is that some buffers (bitmaps, inodes, etc) can be given times that
are closer to the present so as to help ensure filesystem integrity in
the case of a system crash.  There is a function in buffer.c that
can be used to set this field properly.

	5) Patches updated for pl14.

	6) The bdflush daemon no longer screws up the load average.

	7) The update daemon now works with kernels that do not
support the bdflush syscall.

	8) Bug in scsi code fixed whereby the scsi system would hang
if you attempted to do regular disk I/O while something is using the
scsi_ioctl interface for another request.

TODO: 	Still need something for ext2 to ask for clusters.
==================================================================

New features in version 0.4:

	1) A real, honest to god free list is now present.
The buffers in it are guaranteed to be clean, unlocked and not
shared with any other process.  There is a refill function
that keeps it supplied with buffers - right now it is written
to supply 64 buffers any time it is called.  Also, there is a separate
free list for each different size of buffer, but the LRU list is
common for all of the different sizes.

	2) A bdflush process is now present, and runs in the
background when we need to write back some dirty buffers.  Currently
this only scans at most 1/4 of the buffer cache, and will write back
at most 500 buffers, which ever comes first.  These numbers are
wild-assed guesses as to what would be appropriate, and tuning would
probably help. A interactive method of altering parameters might also
be good.  Note: you currently need to run the process in rc.  It may
eventually be possible to get bdflush started automatically without
having to run a process, but there are a lot of tricky and subtle
issues at hand here.  The source code for bdflush is at the end of
this message.


	3) iozone on a naked partition consistently now yields numbers
like 1.1-1.4Mb/sec.  I believe that further tuning would be good in
order to improve performance.  In particular, if there is a big wad of
dirty buffers coming through the LRU list, we do not detect this until
it gets to the top.  At this point we wake up bdflush(), but until
bdflush finishes, we have to crawl past this wad each time the refill
function is called.  Even then, the refill function supplies 64
buffers so the penalty is nowhere near as bad as it once was.  Some
further adjustment of the amount of data that bdflush writes back
would certainly be good, I guess.

********************************************************************

	There is code in buffer.c to generate clusters, and it is now
used by the block device code.  I am finding that it is not terribly
efficient to search for a page that we can reclaim, so it is best to
limit the search to only a fraction of the buffer cache.  Currently
this is set to 25%, I may back this off a little bit more.  This is
a tuning parameter that can be modified at run time via the bdflush()
syscall interface.

	The only thing left to do is to modify the filesytems to
request clustered buffers.  In the block devices, I basically do
something like:

	if((block % 4) == 0) generate_cluster(dev, block, blocksize);

which as I look at it now is incorrect because it assumes a 1024 byte
blocksize.  Nonetheless, once this is fixed, it could be added
directly to getblk so that we always request clustered buffers.  It
would be good if the filesystems were to try and align things on
cluster boundaries, but as I understand it, ext2 tends to keep files
contiguous so it probably should not matter that much.

	One concern that have with this is the overhead of searching
for a page that can be reclaimed to be used for a new cluster.  I am
toying with the idea of discouraging the buffer cache from breaking
apart clusters so that things are always done on a page basis.  In
fact, the buffer cache would be reorganized so that things are
generally done by handling pages.  This would speed up a number of
parts of the buffer cache, but the filesystems are still expecting
buffer headers.  Linus was also thinking along these lines, and as I
look at it now, it is beginning to make more and more sense to me.
There are still some things that need to be thought out before I can
go ahead, but I suspect that on the whole it will lead to better
performance.


-Eric

