Work notes
Interface: FTP or WWW?
As it was already mentioned, the idea behind File Juggler is rather simple:
using CGI scripts as processing
tools and WWW browser as user interface, display the structure and
content of typical anonymous FTP server in a convenient, easily navigated
form.
The advantages of File Juggler's approach are:
- archive user only needs WWW browser (and nowadays, almost every
Internet user has WWW browser); when system is properly implemented,
even users with outdated and primitive browsers can work with the
archive;
- all processing (such as searching) is done on relatively powerful
server machine;
- CGI scripts can provide sophisticated functions such as searching
through entire database of file names/descriptions and displaying contents
of compressed archives (*.tar.gz, *.zip, etc.), which are impossible to
perform on user side without massive transfers.
The drawbacks are also quite obvious:
- indexing software has to be installed on server; even simple
software as File Juggler requires some amount of administrator work;
- workload of ftp server is significantly increased; while classic
anonymous ftp service can be run on rather underpowered machine,
CGI scripts with lots of processing need fast hardware (fast CPU and
large amounts of random access memory);
- some ftp user agents (like my humble
NFTP) have quite advanced capabilities
which cannot be superceded even with very sophisticated CGI processing;
Sizes and times
This section describes performance problems in File Juggler 1.0. In this
version, file information is stored in proprietary binary format in
disk file. Each time CGI script is invoked, database is read into memory
with single read() call.
Currently, each entry takes 22 bytes -- not counting file name itself
and description. On my server, about 32 bytes per file are used for
name/desc. Then we have 54 bytes per file; let it be 50 for a round number.
My P5-133 spends about 0.025 ms on one file
when doing linear search with simple criterion (690 ms for 27,000 files,
search for substring among file names/descriptions with 70 hits). Let's
assume the hardware is more powerful: 0.01 ms for one entry as
an approximation (in theory, the searches might be more complicated such
as regular expressions, but very few people will use them). This is with
entire database loaded into memory. Of course, there can be unknown
factors.
In case you wonder why it takes 32 bytes to store 8+3 filenames:
on my server, where statistics are gathered, there's a lot of looong
filenames in the lyrics archive, and mirrors from hobbes and LEO
have descriptions which also add to the count. Note that symlinks
are also represented as file descriptions. On more typical ftp server
such as ftp.sai.msu.su, it's 29 bytes per file -- apparently, long Unix
filenames are long! On ftpsearch site, average filename takes
14 bytes, but the method of counting is different and we have to add
directory names (in File Juggler, they are in separate name space but
still take memory and disk space).
- huge project (like ftpsearch) 100,000,000 entries
- Database size: 5GB. Search time: 1000 sec, i.e.
16 min. This is inappropriate both in terms of memory and search time,
therefore other methods should be used for such projects (like ftpsearch
itself :-}
Conclusion: FileJuggler is not suited for the job.
- large-scale project 10,000,000 entries
- suppose we have 100 sites, each containing 100,000 files
(look at ftpsearch statistics for better guess).
Then we have 10,000,000 entries in the database.
Therefore, the total size of database is 500MB. This is quite
straining when database is loaded into memory, and apparently
some methods should be employed to lessen the memory requirements.
Of course, reading 500MB from disk each time search engine
is run is completely inappropriate; therefore we need some kind
of server process sitting and doing the work when asked; database
must be constantly loaded into memory by that server process.
Linear search will take 100 sec, which is rather poor but close
to the acceptable values. Then File Juggler 1.0 has to be rewritten
significantly to handle this project.
Conclusion: FileJuggler needs severe optimizations.
- average archive (e.g., network of large university) 1,000,000 entries
- Let's now look whether we can handle 1,000,000 files. Database will
take 50MB, linear search -- 10 sec. This is good. If the processing
machine have enough RAM to cache 50MB file, there will be no significant
problems with File Juggler.
Conclusion: FileJuggler will struggle, but do the job.
- small archive (one medium-sized ftp server) 100,000 entries
- Database size is 5MB, search time -- 1 sec. Even underpowered machines
can do it. E.g., initially FileJuggler was installed on poor 486DX4/100
with 16MB of RAM and IDE drive, and response times were very good with
~ 20,000 files.
Conclusion: there will be no performance problems.
General conclusion: the concept of reading database each time
CGI interface is invoked is poor and is one of the main performance
bottlenecks.
Other possible methods of improving performance are:
- use of general-purpose full-scale database management software
to store information about files
- this adds a huge level of complexity in installation and support.
performance advantages are questionable
- use of (g)dbm files as database management software
- while it is simple in terms and installation and maintenance,
the performance improvements are very unlikely. However, this method
allows third-party software to access the database much more easily
- use of memory-mapped files instead of plain disk files read into
memory
- the performance advantage is unknown, but this approach certainly
makes sense
- use of separate server process which deals with database
- this should give a huge performance boost. The main problem is
in communication between this server and WWW gateway. In theory,
they could even be run on separate computers.
Further reading:
ftpsearch homepage.
return to File Juggler homepage