Work notes

Interface: FTP or WWW?

As it was already mentioned, the idea behind File Juggler is rather simple: using CGI scripts as processing tools and WWW browser as user interface, display the structure and content of typical anonymous FTP server in a convenient, easily navigated form. The advantages of File Juggler's approach are:

archive user only needs WWW browser (and nowadays, almost every Internet user has WWW browser); when system is properly implemented, even users with outdated and primitive browsers can work with the archive;
all processing (such as searching) is done on relatively powerful server machine;
CGI scripts can provide sophisticated functions such as searching through entire database of file names/descriptions and displaying contents of compressed archives (*.tar.gz, *.zip, etc.), which are impossible to perform on user side without massive transfers.

The drawbacks are also quite obvious:

indexing software has to be installed on server; even simple software as File Juggler requires some amount of administrator work;
workload of ftp server is significantly increased; while classic anonymous ftp service can be run on rather underpowered machine, CGI scripts with lots of processing need fast hardware (fast CPU and large amounts of random access memory);
some ftp user agents (like my humble NFTP) have quite advanced capabilities which cannot be superceded even with very sophisticated CGI processing;

Sizes and times

This section describes performance problems in File Juggler 1.0. In this version, file information is stored in proprietary binary format in disk file. Each time CGI script is invoked, database is read into memory with single read() call.

Currently, each entry takes 22 bytes -- not counting file name itself and description. On my server, about 32 bytes per file are used for name/desc. Then we have 54 bytes per file; let it be 50 for a round number. My P5-133 spends about 0.025 ms on one file when doing linear search with simple criterion (690 ms for 27,000 files, search for substring among file names/descriptions with 70 hits). Let's assume the hardware is more powerful: 0.01 ms for one entry as an approximation (in theory, the searches might be more complicated such as regular expressions, but very few people will use them). This is with entire database loaded into memory. Of course, there can be unknown factors.

In case you wonder why it takes 32 bytes to store 8+3 filenames: on my server, where statistics are gathered, there's a lot of looong filenames in the lyrics archive, and mirrors from hobbes and LEO have descriptions which also add to the count. Note that symlinks are also represented as file descriptions. On more typical ftp server such as ftp.sai.msu.su, it's 29 bytes per file -- apparently, long Unix filenames are long! On ftpsearch site, average filename takes 14 bytes, but the method of counting is different and we have to add directory names (in File Juggler, they are in separate name space but still take memory and disk space).

huge project (like ftpsearch) 100,000,000 entries: Database size: 5GB. Search time: 1000 sec, i.e. 16 min. This is inappropriate both in terms of memory and search time, therefore other methods should be used for such projects (like ftpsearch itself :-}
Conclusion: FileJuggler is not suited for the job.
large-scale project 10,000,000 entries: suppose we have 100 sites, each containing 100,000 files (look at ftpsearch statistics for better guess). Then we have 10,000,000 entries in the database. Therefore, the total size of database is 500MB. This is quite straining when database is loaded into memory, and apparently some methods should be employed to lessen the memory requirements. Of course, reading 500MB from disk each time search engine is run is completely inappropriate; therefore we need some kind of server process sitting and doing the work when asked; database must be constantly loaded into memory by that server process. Linear search will take 100 sec, which is rather poor but close to the acceptable values. Then File Juggler 1.0 has to be rewritten significantly to handle this project.
Conclusion: FileJuggler needs severe optimizations.
average archive (e.g., network of large university) 1,000,000 entries: Let's now look whether we can handle 1,000,000 files. Database will take 50MB, linear search -- 10 sec. This is good. If the processing machine have enough RAM to cache 50MB file, there will be no significant problems with File Juggler.
Conclusion: FileJuggler will struggle, but do the job.
small archive (one medium-sized ftp server) 100,000 entries: Database size is 5MB, search time -- 1 sec. Even underpowered machines can do it. E.g., initially FileJuggler was installed on poor 486DX4/100 with 16MB of RAM and IDE drive, and response times were very good with ~ 20,000 files.
Conclusion: there will be no performance problems.

General conclusion: the concept of reading database each time CGI interface is invoked is poor and is one of the main performance bottlenecks.

Other possible methods of improving performance are:

use of general-purpose full-scale database management software to store information about files: this adds a huge level of complexity in installation and support. performance advantages are questionable
use of (g)dbm files as database management software: while it is simple in terms and installation and maintenance, the performance improvements are very unlikely. However, this method allows third-party software to access the database much more easily
use of memory-mapped files instead of plain disk files read into memory: the performance advantage is unknown, but this approach certainly makes sense
use of separate server process which deals with database: this should give a huge performance boost. The main problem is in communication between this server and WWW gateway. In theory, they could even be run on separate computers.