	This is a release of revision number 8 of the clustering
diffs.  The diffs are against a pl15A kernel.  With anything else you
may have difficulty patching the kernel.

=======================================================================
	********************* NEW FOR VERSION 0.8 **********************

	The bdflush program has been modified so that it automatically
forks both daemons.  Thus you no longer have to start both programs in
your /etc/rc script, just start /etc/update as you did prior to clustering
and you will be OK.

	Also, in pl15 there is a stub for bdflush, so the syscall
number should not change any more from now on.  You should, however,
compile and install the bdflush that is in this directory.

=======================================================================

New features in version 0.8:

	There is now code to handle reclaiming buffers of different
sizes from each other.  The technique is sort of crude - we keep a
load average of the useage of buffers of various sizes, and use this
to determine when to start reclaiming buffers of a particular size.
The show memory hotkey (Shift-scroll lock) now shows some detailed
numbers that at a glance show you how many buffers you have of
different sizes, and which lists these buffers are on.  It seems to
work on the surface of it, but this is new code, and there are perhaps
bugs.  If you are using only buffers of one size, then this should not
affect you at all.  I have tested it with my cdrom where I am able to
mount with a blocksize of 2048, and the cannibalization seems to work
correctly.

New features in version 0.7:

	I added a number of features to monitor the size of the I/O
requests being passed down to the low-level scsi drivers, and the
results of this information indicated that there were some problems.
Specifically, it appeared as if requests were being broken in strange
ways, which was reducing performance.  Initially I had guessed that
the clustering code was unable to find a page of buffers that could be
reused.  I added a few one line things in there to increment counters
so that I could keep track of how many times we try and generate a
cluster, and how many times it ssucceeds.  The results of this indicated
that clustering succeeded > 99% of the time - this led me to believe
that the buffer cache itself was probably OK.

	The upshot of this was that I discovered a stupid bug in the
clustering code - after we decide to reassign a page of buffers to new
block addresses, we were not reassigning the new b_blocknr to all of
the buffers in the page.  Naturally this effectively made it appear as
if clustering was broken.  There was a one-liner to fix this one.

	Once the previous bug was fixed, I started experiencing system
wedging.  This was also easy to fix - shrink_buffers() was only
searching the clean list for pages to reclaim, and after heavy
writing, the clean list could be empty.  Fixing the previous bug
seemed to aggravate this one.

	Finally, there was a strange alignment problem in block_dev.c.
We start by reading block 0 all by itself, and this sets the f_reada
flag.  The next time through, we try and read 64 buffers, numbers
1-65.  The clustering code in block_dev.c would try and align things
so that blocks 0-3 would be on one page, 4-7 would be on the next, and
so forth, and by asking for blocks 1-65, we effectively required
the system to read 17 clusters even though the 1542 can only read 16
at a time.  I made a few minor changes so that this was no longer
a problem.

	Once these three problems were solved, the iozone write
performance jumped up into the 1.75 - 1.8 Mb/sec range (this is
identical to the srawread numbers for my disk.  Thus we are truly disk
limited here.  In a sense I was expecting this to happen once
everything got ironed out - when writing all of the blocks are already
lined up ahead of time into neat little packages on the request queue,
and it is a simple matter to dump them to disk one by one.  The actual
dirtying of buffers takes place at the same time in the user process,
so we are making optimal use of the system resources.

	The read speeds went up a little but not as much.  These are
now hovering around 1.3Mb/sec. (The bug in the clustering code was the
reason that the iozone numbers fluctuated so much from one run to the
next - they are much more stable now).  In part I think the reason
that we are not getting close to the iozone numbers is that here the
actual reading of the buffers (transferring the data back to
user-space) does not take place at the same time as disk I/O.
Therefore while we are reading data from the disk, we are not
transferring data back to user-space, and while we are transferring
data back to user-space, we are not reading the disk.  In theory we
could alter the read-ahead code such that the read-ahead is taking
place in the background while we are copying other data back to user
space.  Even though this might sound like a useful thing to do, this
would only buy you anything if you were reading extremely large
amounts of data from one file - thus I am not sure how much this would
help you in the long run.  

	One thing still perplexes me, however, and this really should
be explained before we consider tring to have a background pre-fetch
running at the same time that we are transferring data to user space.
The srawread program is also copying data back to user space, albeit
in larger chunks at a time.  For some reason the overhead here is
quite small so that it is still possible for me to get 1.8Mb/sec back
from the disk to user-space.  For some reason the data copying in
block_dev.c is much slower even though it is using the same algorithm
in the srawread program (or even in a user-mode test program).  This
point has been discussed in the past with no firm resolution - it
would seem as if the memcpy_tofs requires that the processor caches be
set up a certain way, and there is overhead in doing this.  Something
about the code that is generated for block_dev.c makes this a
relatively slow process, while a naked series of memcpy_tofs calls
executes much faster.  If someone really wants to dig their teeth in,
I am still interested in a definitive answer.  I would be reluctant to
add a processor dependent hack into the kernel to boost I/O speed, but
if the solution is generic to all processors I would not mind so much.
Someone recommended that followups questions be made to comp.arch as
they seem to really get into these sorts of questions.  Who knows -
perhaps there is something that gcc is doing which is screwing us in a
big way.

	Finally, I am too lazy to remove some of my debugging hacks
from the diffs that I am uploading.  They really do not hurt
performance, and if other people want to fool with it, it is all in
there and ready to go.  When we get ready to incorporate this into the
distribution kernel, all this crud should be cleaned out.  I have
enclosed my version of iozone that harvests the statistical data from
the scsi code and from the buffer cache, and you can use this to
verify that in fact we are pushing clusters through at an optimal
rate.  I have a patched version of iozone that automatically gets the
statistical information and prints it after both the write and then
the read phases.

New features in version 0.6:

	All of the bh->b_dirt = 1; statements in all of the
filesystems have been replaced by some code to actually move the
buffer to another list.  This way we do not have to search multiple
lists to find all of the dirty buffers.

	SCSI host adapters with very large supported scatter-gather
lists have clustering disabled.  It was just a performance drag and
you do not gain anything.

	The ext2 filesystem now requests clusters for file read
operations.  A few optimizations were added to buffer.c so that it
would work better with ext2.

	The read-ahead was fixed everywhere so that it represents an
optimal number of sectors to transfer, not the additional number of
sectors to read.

New features in version pre-0.6:

	A new set of performance enahancements have been added, and
they do indeed seem to function as planned.  I now see iozone
benchmarks that are consistently in the 1.3Mb/sec range, with
occasional numbers that are much higher.  The fastest write time that
I have seen is 1.6Mb/sec, and the fastest read time is 1.79Mb/sec.  My
srawread numbers for this disk are about 1.76Mb/sec.  The thing that
really bugs me is that these large numbers only come up every so often
- most of the time I get numbers in the 1.3 Mb/sec range.  There is
evidently something else going on that I do not yet understand.
Hopefully there is something simple that I am missing, and once I
understand it the 1.7Mb/sec numbers will be the norm.  Note that
people with really fast disks will tend to notice a much more dramatic
improvement in performance.

	I have a theory that in the write phase, the kernel is getting
starved for free pages and in the process dumps some of the pages for
iozone itself.  Ultimately this means that iozone itself will start
page faulting, and slow down.  I have no evidence to support this, so this
is just an idea that needs to be investigated.

	With these patches, the I/O performance should not depend upon
memory usage so much.  Previously if you had a lot of shared text pages,
the I/O performance would inevitably suffer.  I believe that this situation
is now quite improved.

	1) Buffer cache divided up into a number of different lists.
This includes a shared list(for buffers shared with the buffer cache),
a dirty list (for dirty buffers that need to be written back to disk),
a clean list (for buffers that can be reclaimed).  Note that the free
list is still separate - this contains buffers that have already been
reclaimed, but have not yet been reassigned to new block/device
numbers.

	2) A new function refile_buffer was added to buffer.c.  This
is called in brelse and whenever we find a buffer that needs to be
moved.  

	3) ll_rw_block.c was patched so that dirty buffers that have been
scheduled for writing are moved to a new unlocked list.

	4) free_page checks to see if the usage count has dropped to
1, and the page contained buffers in the buffer cache.  This indicates
an image exit or the freeing of text pages that were shared with teh buffer
cache.  We basically call refile_buffer if we think the buffers need to be
moved - they are moved to an unshared list.

	5) The refill_freelist function picks things off of the
various lists, choosing the oldest buffers each time.  Note that
we do not search the list of shared buffers or the list of dirty buffers.

New features in version 0.5:

	1) Bug fixed where you would get kernel panics if you had
more than 16Mb and a host adapter that used either EISA or VLB (i.e.
no ISA DMA restrictions).

	2) The hash_table is now dynamicly sized (but just a bit).
The size is now either 997 or 4093 entries (for >= 4Mb systems).
I have no idea if this is really the most appropriate breakpoint, but
it is easy to tweak.

	3) The breada function has been replaced with something quite
a bit more useful.  The iso filesystem now uses this to read-ahead
directories and this vastly improves access times for cdrom directory
reads.

	4) Things have been rearranged a little bit - dirty buffers
now carry a timestamp that indicates when they should be written back
(previously it was the time the buffer was dirtied).  The general idea
is that some buffers (bitmaps, inodes, etc) can be given times that
are closer to the present so as to help ensure filesystem integrity in
the case of a system crash.  There is a function in buffer.c that
can be used to set this field properly.

	5) Patches updated for pl14.

	6) The bdflush daemon no longer screws up the load average.

	7) The update daemon now works with kernels that do not
support the bdflush syscall.

	8) Bug in scsi code fixed whereby the scsi system would hang
if you attempted to do regular disk I/O while something is using the
scsi_ioctl interface for another request.

TODO: 	Still need something for ext2 to ask for clusters.
==================================================================

New features in version 0.4:

	1) A real, honest to god free list is now present.
The buffers in it are guaranteed to be clean, unlocked and not
shared with any other process.  There is a refill function
that keeps it supplied with buffers - right now it is written
to supply 64 buffers any time it is called.  Also, there is a separate
free list for each different size of buffer, but the LRU list is
common for all of the different sizes.

	2) A bdflush process is now present, and runs in the
background when we need to write back some dirty buffers.  Currently
this only scans at most 1/4 of the buffer cache, and will write back
at most 500 buffers, which ever comes first.  These numbers are
wild-assed guesses as to what would be appropriate, and tuning would
probably help. A interactive method of altering parameters might also
be good.  Note: you currently need to run the process in rc.  It may
eventually be possible to get bdflush started automatically without
having to run a process, but there are a lot of tricky and subtle
issues at hand here.  The source code for bdflush is at the end of
this message.


	3) iozone on a naked partition consistently now yields numbers
like 1.1-1.4Mb/sec.  I believe that further tuning would be good in
order to improve performance.  In particular, if there is a big wad of
dirty buffers coming through the LRU list, we do not detect this until
it gets to the top.  At this point we wake up bdflush(), but until
bdflush finishes, we have to crawl past this wad each time the refill
function is called.  Even then, the refill function supplies 64
buffers so the penalty is nowhere near as bad as it once was.  Some
further adjustment of the amount of data that bdflush writes back
would certainly be good, I guess.

********************************************************************

	There is code in buffer.c to generate clusters, and it is now
used by the block device code.  I am finding that it is not terribly
efficient to search for a page that we can reclaim, so it is best to
limit the search to only a fraction of the buffer cache.  Currently
this is set to 25%, I may back this off a little bit more.  This is
a tuning parameter that can be modified at run time via the bdflush()
syscall interface.

	The only thing left to do is to modify the filesytems to
request clustered buffers.  In the block devices, I basically do
something like:

	if((block % 4) == 0) generate_cluster(dev, block, blocksize);

which as I look at it now is incorrect because it assumes a 1024 byte
blocksize.  Nonetheless, once this is fixed, it could be added
directly to getblk so that we always request clustered buffers.  It
would be good if the filesystems were to try and align things on
cluster boundaries, but as I understand it, ext2 tends to keep files
contiguous so it probably should not matter that much.

	One concern that have with this is the overhead of searching
for a page that can be reclaimed to be used for a new cluster.  I am
toying with the idea of discouraging the buffer cache from breaking
apart clusters so that things are always done on a page basis.  In
fact, the buffer cache would be reorganized so that things are
generally done by handling pages.  This would speed up a number of
parts of the buffer cache, but the filesystems are still expecting
buffer headers.  Linus was also thinking along these lines, and as I
look at it now, it is beginning to make more and more sense to me.
There are still some things that need to be thought out before I can
go ahead, but I suspect that on the whole it will lead to better
performance.


-Eric

