Date: Mon, 6 Jan 2003 23:22:00 -0500 (EST)
From: "Keith F. Lynch" <kfl at KeithLynch.net>
To: WSFAList at KeithLynch.net
Subject: [WSFA] Re: baseballs, priests, and newsgroups
Reply-To: WSFA members <WSFAlist at keithlynch.net>

ronkean at juno.com wrote:
> Can anyone calculate terminal velocity?  I suppose the question is:
> how fast a wind would it take to exert 1.42 newtons of force on a 43
> cm^2 cross section?

Not enough information.  The surface makes a big difference.

> Apparently, molestation within the Catholic Church has been going on
> for many years, but it has only been in the past year or so that the
> resulting scandal has come to the forefront of public attention.

Yes, it was the public attention I was interested in.  I had thought
it had had that attention for years, certainly long before Joe's
death, but apparently there was a big increase in that attention about
a year ago.

> A big drawback, though, is that the results will be highly dependent
> on how the search is formulated, and properly formulating the search
> terms might require much trial and error and reading of sampled
> messages to verify the relevancy of chosen search words.

Very true.  For instance there was the journalist who reported that
there were many thousands of satanic cults on the Internet.  He had
done a search for all web pages containing either "satanic" or "cult,"
and assumed that all such pages were home pages for distinct groups
of Satan worshippers.

One thing I didn't mention about the 107 messages containing both
"Disclave" and "flood" is that the first such message, with a quite
plausible subject of "Re: ASB in Fandom (was Re: Report: Disclave)"
(ASB refers to the group which caused the flood) was dated 1992, five
years before the flood!

> Another drawback is that while that research tool may show to what
> extent people are discussing a given topic (and especially how that
> has changed over time), it does not show what they are saying about
> the topic.  You would have to sample the messages to characterize
> what is being said.

Right.  Fortunately, anyone can do so.  Want to see what those 107
messages said?  Go look!

> Perhaps Keith can readily answer these questions:  About how many
> distinct newsgroups are there which are reasonably active (at
> least a hundred messages per month), how far back do the newsgroup
> archives go, and has the use of newsgroups generally, as measured
> by the number of messages posted, been increasing, stagnating, or
> falling off, in recent years?

Those are very difficult questions.  There are tens of thousands of
newsgroups, but it's not clear how many are "reasonably active" by
your definition.  I could have answered this a few years ago, but ISPs
all now have their newsfeed on one machine and their shell on another,
meaning I can't use standard Unix tools to get an answer.  I would
have to write and install a variant newsreader program.

Also, many newsgroup messages are spams.  Many others are spam
cancels.  Users normally see neither, as they cancel each other out.
The newsgroup spam problem, unlike the email spam problem has been
largely solved thanks to these cancels.  (It's a problem for ISPs,
who have to bear all the excess traffic, but not for Google, which
does not archive them, nor for users, who usually never see them.)

(Aside:  New York fan Seth Breidbart is the person who formally defined
newsgroup spam for purposes of determining what should be canceled.
Look up "Breidbart Index".)

Many newsgroup messages are crossposted, i.e. exist in two or more
newsgroups at once.  Only one copy of the message is stored by
each ISP or by Google, and only one copy of the message is seen
by any user, even if he reads all messages in all the groups it's
crossposted to.

Google says they archive (and fully index) more than 700 million
newsgroup postings.

DejaNews started archiving in early 1995.  In early 2001, DejaNews (by
then Deja.com) went out of business, and sold their archive to Google.
In late 2001 Google extended the archive back to mid-1981 thanks to
Toronto fan Henry Spencer, who had archived much of the early material
on tape, and also thanks to Usenet feeds which were available via
CD-ROM for a while in the late 80s and early 90s.

The message numbers on Panix, one of the first ISPs, are probably a
good estimate of how many postings each newsgroup has had since it
began.  Rec.arts.sf.fandom has had over 600,000 (of which Google has
archived about 375,000), and rec.arts.sf.written should hit a million
in a few months (of which Google has archived 567,000).  For comparison,
the old SF-LOVERS email list had about 500,000 from 1979 through the
end of 2000, and this list has had about 3700 in just under 11 months.

> It might be interesting to compare the annual message volume of all
> newsgroups with the corresponding data for some of the large email
> list services, such as yahoo and topica.

The S.M. Stirling Yahoo list has had 68,000 postings, of which about
3000 were from Stirling himself.  I haven't checked the other Yahoo
groups.  The Lois Bujold list (not a Yahoo or Topica list) doesn't
have message numbers, but seems to currently average about 3000
messages per month.

In addition to the more than 700 million newsgroup postings, Google
also indexes over 3 billion web pages, something it's much better
known for.

And they do it *quickly*, too.  They're already indexed the November
1992 WSFA Journal, which I just put up yesterday, and didn't tell
anyone about.  (Googling for "drunken badgers" will find it.)  (They
don't, however, index the archives of this list, since I carefully
make sure there are no links they can follow.)

If I had more time, it might be interesting to compare the numbers of
web page hits for various stfnal terms to the numbers of newsgroup
posting hits.  Would the cross-plot fall on a straight line on log-log
paper?  What would be the significance of terms which diverged strongly
from that line, i.e. which were mentioned a lot in newsgroups but
hardly at all on the web, or vice versa?

Sometimes I wish I had a copy of all these archives, immortality,
and a time machine.  I'd go back to one of the warmer and less
carnivore-infested periods of earth's pre-history, and spend a few
eons just reading it all.

On second thought, immortality may be all I need.  Even if I were to
only read one message per year, and even if the volume continues to
increase exponentially forever, I will eventually catch up.  Every
message, after all, can be given a unique message number N, and I'll
read message N in year N.  So by the year infinity, I'll be fully
caught up, despite having fallen further and further behind up until
then.  That's math for you.
--
Keith F. Lynch - kfl at keithlynch.net - http://keithlynch.net/
I always welcome replies to my e-mail, postings, and web pages, but
unsolicited bulk e-mail (spam) is not acceptable.  Please do not send me
HTML, "rich text," or attachments, as all such email is discarded unread.