[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: [Watchers of the Web] The evolving form of information on


watchers web
On Sun, May 07, 2006 at 10:12:32AM -0400, Costello, Roger L. wrote:
[...]
> I would like to know:
> 
> Of all the information being exchanged on the Web:
>  
> what percentage of the information is in the form of the HTML content
> type, what percentage of the information is in the form of the XML
> content type, [...]

It depends on what you mean by "information" here.  If you are
referring to Shannon, for example, the second pass of an interlaced
GIF carries less information than the first, in most cases, and the
same is true for JPEG images.

If you're concerned about total volume of data and bandwidth,
my guess right now would be closer to:

music and video files: 70%
  (much more if you count non-Web transfer methods)
Other binary files: 20%
  (e.g. pirated copies of PhotoShop, stolen fonts, as well as
  legitimate installation files, Windows update etc.,
  accessed via a URI-based mechanism)

Of the remaining 10%, I'd guess 95% by size is image content.

The entire King James Bible weighs in at around 5 Megabytes as
plain text.  A Megabyte isn't all that huge for an image from
a digital camera these days, _vide_ flickr.

I'd also guess that RSS (XML-based) is significant in traffic.

> In addition, I am interested in seeing how the percentage is changing
> over time - I am interested in seeing the evolving form of information
> on the Web.
> 

You might find that some of the search engine people have some sort
of metric based on number of documents, or numbers of URIs and
corresponding MIME types.  Ian Hickson has done some investigation
of this sort I think.

Anyone running a large HTTP proxy, e.g. for a school, college,
corporation or ISP, will have figures on bandwidth.

Actually analysing images and text for information content is a much
harder thing. Do the random art criticism texts generated by a program
I wrote [1] contain information? Or the random sonnets from Rich
Salz's program [2]? What about lists of things? Or, worse, lists of
randomly-generated things such as fake fantasy names [3]?

Sometimes it's better to tackle the easier problem and get data that
is useful than to tackle the more interesting but probably intractable
one.  The debate about whether illustrations carry information can be
illustrated at [4]. :-)

Liam


[1] random art criticism with random artwork:
    http://www.holoweb.net/~liam/sol/

[2] randomly generated "poetry"
    http://www.holoweb.net/~liam/sonnet/

[3] randomly generated fantasy gaming names
    http://www.valinor.sorcery.net/names/names.cgi?which=default

[4] http://www.fromoldbooks.org/Blades-Pentateuch/pages/discourse-into-the-night/

-- 
Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin/
http://www.holoweb.net/~liam/  * http://www.fromoldbooks.org/

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.