Gaffaweb > Love & Anger > 1993-20 > [ Date Index | Thread Index ]
[Date Prev] [Date Next] [Thread Prev] [Thread Next]


Re: Megabytes

From: uli@zoodle.robin.de (Ulrich Grepel)
Date: Wed, 9 Jun 93 23:46 MET DST
Subject: Re: Megabytes
To: love-hounds@uunet.UU.NET

>     If you don't have a CD-ROM mastering unit you can get use of somebodies
> else's at about ?250 a go. Thus at today's costs we could get this thing made
> at...

Well - if we need enough time to get to the production state: I know a guy who
is going to buy such a CD-ROM recorder as soon as he finds the right one. The
Kodak/Phillips machine availlable now needs a computer and a hard disk to 

work, but Kodak is going to present a new device at the Berlin Funkausstellung
that has an optical digital in. That will be his device of choice, since he
is after music, not after CD-ROMS. Assumed that this device is not without the 

SCSI port it would not be a problem to find enough hd space, at least for a short 

time (DAT backup of a large one etc.). The only remaining problem might be the 

software to produce the master from SCSI. And I don't think he would want to have 

money from me using it. Except for the master itself, of course. (Empty Kodak 

Photo-CDs are about 20 DM - cheaper than a normal audio CD.)

> - ?250 for use of machine,

maybe, I said maybe, we can even get rid of this one

> - ?50 to by one master,

this might be much cheaper (20 DM ~ 12 US-$)

> - ?5/copy  for each CD-ROM.

This completely depends on the amount of CDs we want to have. The first one
is by far the most expensive one, since what you (Graham) call 'master' CD isn't
really THE master. That one is made out of glass in the clean room and is
used to press the little pits into the actual CDs. You'll need such a thing
if you don't want to produce hundreds of Kodak-CDs (that would need A LOT of 

TIME, since you need about 1/2 hour/disc (it's already a double-speeder)).

>     ... thus if at least 300 people wanted a copy it would cost us ?6 per unit!
> Cheap!!

It's pretty unimportant if it's $6 or $3 or $10, since postage might easily 

double these costs :-(... (at least for non-natives to the CD)

>     As I suggested, we should try and cater for EVERYBODY. That would mean we
> need to find out just how many different PC platforms are used by those who
> read gaffa and would buy our CD-ROM.

ok, suggestions again in a separate post.

>     Hu?? Amigas and Ataris are cool BUT can you even get a C64 kitted out with
> CD-ROM!!?? BTW RFT is OK but it's like the poor mans idea of compatibility. We
> got space to burn here guys, use it.

Attention, Graham: RFT != RTF. RFT (revisable format text) is a format developed
by IBM to exchange documents. It is ugly, undecipherable for humans, difficult
(moderately) to generate and analyse. Actually RFT is part of DCA-RFT (or was it
the other way round?), and the DCA means 'Document Contents Architecture'. The
DCA-RFT is the text format that can be edited afterwards, as opposed to the
format of a name not known to me at the moment that is designed to drive 

printers (like PostScript it is not very revisable). It contains things like
page headers that are marked as such, but unlike (La)TeX the page/line breaks are
completely done but revisable. BTW: RFT is an EBCDIC based format that can be
read by MS-Word and IBM Text n, n in {3, 4, ...}, and probably more.

RTF (rich text format) is a format developed by Microsoft and used by MS-Windows, 

NeXTSTEP and (maybe? at least in MS-Word) on the Macintosh platform. Maybe 

others. RTF documents are a bit like LaTeX documents as they contain headers
and commands in human-readable form (normally hidden), are quite easy to generate 

and (if you understand how to use YACC and LEX (I do)) are quite easy to parse. 

RTF documents can contain pictures (at least in the NeXT derivate) and as many 

fonts as you have (of course we shouldn't use more than one, maybe two...). They 

are - like DCA-RFT - editable, searchable, printable etc. Some software firm even 

has modified the GNU C preprocessor to read RTF coded C-sources, just to make 

programming easier.

>     I don't want to add or subtract ANYTHING from the archives. Just supply a
> more convenient method of accessing and using them.

Not subtracting anything is ok, but why not adding? Unless, of course, you want
Bill to include everything we add in the archives ;-).

> [about data formats for the archives:]
> Data
> ====
>     Store the data in a manageable meaningful form.
> 

> /archive/1984/jan84
> /archive/1984/feb84
> etc...

Well - that's almost the way it is now. You just have to translate numbers
going from ranges like 1-8 to ranges like 1-49 into (parts of) months, since the 

archive at ftp.uu.net is organized as 


/archive/1985/Mail01
...
/archive/1985/Mail08
...
/archive/1992/Mail32

with each file containing about 100 messages. The file format of these files
is exactly the format most (all?) email readers use for their mailboxes, so
you can read that stuff with your favourite mail reader. Not that this is
an easy way to search anything in that amount of data, but without manual
indexing work there is no other way but full text search or specific header
line contents search. Both is crap.

>     Each of the data files is textual and contains gaffa for that month in the
> form...
> 


> Subject:
> From:
> Date:
> Text:

There are 3 differences from your format to the one actually used: you strip
all unneccessary header lines (good idea, especially if you look at your
own headers ;-)), then you change the empty line separating the body from the 

headers to 'Text:', and you skip the lines that begin every message in mailbox 

files, namely a line like

>From uli@zoodle.robin.de Wed Jun 09 22:35:00 1993

(The '>' in front of the 'From' is actually inserted by most mailing software
to inhibit confusion at the receivers side.) So all we need to generate
the basis for your format is a utility to strip unneccessary header lines and
then you even got a standard format!

> [about pre-indexing:]

Well - there actually are full text indexing systems. Take Digital Librarian
from NeXT as an example. I would like to add the indices for DL to the stuff.
It comes out at about 10-15 % of the real stuff. Again: Why not use a standard
(well...) if there is one? Well: If there's no _*STANDARD*_, then use whatever
is appropriate, usable for some people, and has place on the distribution 

media. Not that I am against your indexing idea, it's ok, but it shouldn't be
the only one.

> Archive Readers
> ===============
>     So long as we all agree on what platforms the readers need to be written we
> could supply a folder on the CD-ROM for every PC platform with it's own archive
> reader. What do we need? Amiga, Macintosh, PC-DOS, PC-Windows, NExT, Atari,
> Pro-DOS... any others? I got the Amiga, Macintosh, PC-DOS, PC-Windows and
> Pro-DOS covered. Who could do others?

I could cover NeXT, and I don't think we absolutely need a reader. But it would
be GREAT to have one. On NeXTs it might be an interface between the Indexing Kit
(that is the base for Digital Librarian) and NextMail. The main problems are
the following with each thinkable automatism: 


- We have much more than 20.000 messages.
- If you don't group then you'll get about 1.000 hits for any keyword in your
  search.
- If you group them by date you'll get even more stuff to check, since you
  will get about 50 hits with each about 100 messages (unless you restrict
  your search to some time area).
- Everything else means WORK. Like grouping after contents. Adding specific
  search arguments to the articles. Throwing out articles (NO!!!). Assume you
  had an easy-to-use keyword distribution software. Scan the article, click
  with mouse or keyboard at some keywords, continue with the next article.
  I think you'll need about 10 seconds absolute minimum (after a lot of training)
  to categorize a short article. That makes more than 200.000 seconds of time.
  That makes about 3 24h days. That's a no-no, since 10 seconds is by far
  a too low estimation.

As you see I would LOVE to have any suggestions about this one. BTW: Subject
ordering is not that helpful. I can do this with my mailboxes and they only
contain less than 4000 messages for the last 9 months of Ecto, and it is
impossible to find anything that way.

>     All we need do is agree on what the data would look like and what the
> applications should do. I've detailed my idea for the data above. I believe the
> readers need to perform the following....

Data format should be the UNIX (and others?) mailbox format since it is
almost what you thought of, since it is able to do the same stuff, since
it is already in that format etc. I don't think discussion is needed here.

> 1\ List View:
> A presentation of a list of the currently found data set. Basically a big list
> of [Subject...Date.... From]. From this view one should be able to...
>     a\ Sort the data in what ever key and subkey required. The pre-defined
>        indexes make this very fast.

Your indices add speed to what I already have. And even if I remove the speed
factor of my computer, then the speed factor of myself is getting into the way.
You never know exactly what you are searching for. I don't even think you'll
need non-date sub-indices, since no thread is THAT long (only wanna see Jorn's
part of the recent flamewar? ;-) (no offence intended, Jorn)), and no person
is posting in hundreds of threads at the same time. The date criteria is already
present in the natural order of the messages in the data base, so this needs
no time at all (and no index).

>     b\ Search on each field. This could be (Subject is "RE: Stuff", From
>        is not "graham@bhp.com", Date is "19/03/87-22/12/91", Text contains
>        "Kate") obviously there are a lot of things to think about on this
>        and I don't want to detail them all here. One could find/omit with
>        successive searches on the archive to home on on the sub-list
>        required.

yes, that would be fine. But if you exclude subjects like your example (where
actually hundreds of messages belonging to also hundreds of different threads
appear) you only need to sort by subject, then scroll to the relevant part
(that is PRETTY fast on my machine). The non-Graham-messages are easily picked
out of the not too long list.

>     c\ Print List/Articles
>     d\ Save List/Articles
>     e\ Read currently selected article.

of course...

> 2\ Article View:
> A presentation of the currently selected article. Should be just like reading
> the news, with a header section (that can be hidden) and the text section. From
> this view one should be able to...
>     a\ Print article
>     b\ Save article
>     c\ Read next
>     d\ Read previous

again, of course.

> 


>     That's it. If you can think of anything else that might be needed then let
> me know!

Another function that's needed for the software (if you write one) is to
forward/follow-up an article.

Now what you've added to my mail program are two things: indexed search to be
faster with that (important) and full-text search in the text bodies (important
too). Every other thing should be possible with your vanilla mail reader.

Unfortunately the archives don't contain enough information to use a threaded
newsreader, unless you just rely on the Subject: lines. For those who don't know
what a threaded newsreader does: They fold together any thread into a single
pseudo-message that expands to the list of actual messages when selected.
Another advantage of threaded newsreaders is that they only present any article
once, even if crossposted to other newsgroups. But we only have one newsgroup
and there's not enough information in the archives (Message-IDs are generally
removed).

Another thing that a procedure to improve the archive headers should do is to
remove those bogus headers from news systems that catch posted messages and
resend them to love-hounds@uunet.uu.net. These articles generally don't have
a subject and don't have a proper 'From:' line. Only a certain amount of 

sites actually 'destroy' articles like that.

>     What else... Well how about a pictures section with subsections for things
> like
> 

> 1\ A scan of every album front and back. Includes all singles and maybe boots?
> 2\ A scan of EXACTLY WHERE each of the hidden KT's are!!
> 3\ A "Family Album" section with Kate and friends through the years
> 4\ A "Scrapbook" section with lots of different KT shots.
> 5\ A KateCon section!

Yes, organized pictures would be great. Many of the above pictures are availlable
on ftp.uu.net, but again in a quite disorganized state. Maybe we could at least
make some GOOD scans of all the albums, and in the same format for each album?
Who has a good scanner?

>     Can anybody think of anything else here? I feel the best format to use with
> pictures is GIF since it's SO universal. Every platform I can think of has GIF
> viewers.

Well - GIF is quite standard, but it's about the worst I can imagine. Yes, it
is compressed, but badly. Yes, it's color, but only 256 (geee - I only have a 4
grayscale machine!). What about TIFF? Better compression, better color options.
How universally is TIFF? Or JPEG? (Toooo good compression...) Or again more than
one format? This one is a point where we might have to decide against the
multi-format issue, since it would be easy to produce MUCH more pictures than
a CD can hold.

P.S.: I would LIKE to have pictures within the discography. Like the 'Illustrated
Collector's Guide...', but please: COLOR (again said by a user of a b/w screen).

>     How about a Sound section? Various interesting Kate sound clips. (Like her
> talking, the Sensual World "laugh" and "yeah". Maybe clips showing hidden
> messages/sounds from various songs. I'm not sure how to store these. Amiga
> could use 8SVX, Mac 'snd ' and PC a .WAV, but what about the rest? Maybe we
> could store them all as one type of sound and include some public domain
> convertors for all other platforms. This would save on space.

My NeXT can play about everything digital up to 2 channel 16 bit 44.1 kHz, at
least if you can convert it somehow. Minimum sound quality should be 8 bit
8 kHz mono, probably MULAW encoded to improve quality (MULAW is used by SUNs
and NeXTs and ...?)

>     What else? We could have the up-to-the-minute Kate FAQ's (the very detailed
> version). Maybe the complete Kate Discography. Maybe an introduction to the net
> and how to find rec.music.gaffa or love-hounds-request. Address for other
> fanzines and KBC... I don't know, I'm running out of ideas. So what do you
> think guys? Do we give it a go?? I imagine a nice white CD case with a BIG KT
> logo on it. Maybe written on the spine 'Gaffa: 10 years of
> Love-Hounds-Request.'

The net-intro is really a GOOD idea. That one should even go into the booklet
of the CD, just to enhance the chance anyone can find us. Since the KT logo
is as round as a circle I always imagine the CD itself covered by the KT logo.
But isn't it copyrighted? (Well - we might try to become approved... any chance
of success?)

I will post another message to summarize the ideas that have appeared by now.

Bye,

Uli