[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: [Watchers of the Web] The evolving form of information on
On Sun, May 07, 2006 at 10:12:32AM -0400, Costello, Roger L. wrote: [...] > I would like to know: > > Of all the information being exchanged on the Web: > > what percentage of the information is in the form of the HTML content > type, what percentage of the information is in the form of the XML > content type, [...] It depends on what you mean by "information" here. If you are referring to Shannon, for example, the second pass of an interlaced GIF carries less information than the first, in most cases, and the same is true for JPEG images. If you're concerned about total volume of data and bandwidth, my guess right now would be closer to: music and video files: 70% (much more if you count non-Web transfer methods) Other binary files: 20% (e.g. pirated copies of PhotoShop, stolen fonts, as well as legitimate installation files, Windows update etc., accessed via a URI-based mechanism) Of the remaining 10%, I'd guess 95% by size is image content. The entire King James Bible weighs in at around 5 Megabytes as plain text. A Megabyte isn't all that huge for an image from a digital camera these days, _vide_ flickr. I'd also guess that RSS (XML-based) is significant in traffic. > In addition, I am interested in seeing how the percentage is changing > over time - I am interested in seeing the evolving form of information > on the Web. > You might find that some of the search engine people have some sort of metric based on number of documents, or numbers of URIs and corresponding MIME types. Ian Hickson has done some investigation of this sort I think. Anyone running a large HTTP proxy, e.g. for a school, college, corporation or ISP, will have figures on bandwidth. Actually analysing images and text for information content is a much harder thing. Do the random art criticism texts generated by a program I wrote [1] contain information? Or the random sonnets from Rich Salz's program [2]? What about lists of things? Or, worse, lists of randomly-generated things such as fake fantasy names [3]? Sometimes it's better to tackle the easier problem and get data that is useful than to tackle the more interesting but probably intractable one. The debate about whether illustrations carry information can be illustrated at [4]. :-) Liam [1] random art criticism with random artwork: http://www.holoweb.net/~liam/sol/ [2] randomly generated "poetry" http://www.holoweb.net/~liam/sonnet/ [3] randomly generated fantasy gaming names http://www.valinor.sorcery.net/names/names.cgi?which=default [4] http://www.fromoldbooks.org/Blades-Pentateuch/pages/discourse-into-the-night/ -- Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin/ http://www.holoweb.net/~liam/ * http://www.fromoldbooks.org/
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|