[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

[Watchers of the Web] [Research Initiative] Measure the Evolving Form of

  • To: "XML Developers List" <xml-dev@l...>
  • Subject: [Watchers of the Web] [Research Initiative] Measure the Evolving Form of Information on the Web
  • From: "Costello, Roger L." <costello@m...>
  • Date: Fri, 12 May 2006 09:08:48 -0400
  • Thread-index: AcZ1xTS0H+zAPaGNQ3OWm46+VxFkHQ==
  • Thread-topic: [Watchers of the Web] [Research Initiative] Measure the Evolving Form of Information on the Web

marc nobile

Hi Folks,

I would like for us (the xml-dev community) to collaborate on a world-wide research initiative.  Below is a description of the research.  I have two requests:

1. I have tried to be both clear and complete in the research description.  But if you find at any point a lack in clarity of the description, or an incompleteness then please let me know  (and hopefully, provide advice to how to improve things).

2. I need your participation (see below). 

/Roger

Research Question

What is the relative usage of the various content (MIME) types on the Web, and how is that usage evolving over time?

Background

There are over 350 different content (MIME) types.  Some common content types include HTML, XML, GIF, JPG, JPEG, MP3, MPEG, RSS, SVG.  Information exchanged on the Web is in the form of one of these content types. 

What is the state of the Web today with respect to the use of the different content types?  For example, 15 years ago HTML was clearly the dominate content type.  Is that true today?  Has a shift occurred?  Have other content types nudged out HTML for top ranking?  The purpose of this research is to get answers to these questions.

Methodology

Collect data from the web caches and logs of one or more large retail Internet Service Providers (ISP) on each of the following continents:

Europe
North America
South America
Asia
Australia
Africa

The data to be collected is a numerical count of accesses to resources, per content type.  That is, look at the Internet Service Provider's log file and count the number of requests that were made by users for HTML documents, count the number of requests that were made by users for RSS documents, count the number of requests that were made by users for XML documents, and so forth for each content type.

Here is an example to demonstrate the methodology:

Today (May 8, 2006) cnn.com is running a news story about using quantum science to determine the best way to score a goal in soccer. CNN allows you to consume the news story in any of these forms:

- HTML
- audio (MP3)
- video (MPEG)
- RSS

Let's suppose that CNN uses an ISP, and the ISP log file contains all the requests for that news story.  At the end of the day we open the log file and tally up all the requests for the news story.  And here are the numbers:

- 50 clients consumed the news story in HTML form
- 20 clients consumed the news story in audio (MP3) form
- 10 clients consumed the news story in video (MPEG) form
- 20 clients consumed the news story in RSS form

If these numbers represented a statistically significant sampling of the Web, then we could state:

"On May 8, 2006 the information on the Web took this form:"

Content Type    Percentage
---------------------------
HTML            50%
MP3             20%
MPEG            10%
RSS             20%

Obviously, examining just the log file for one story on CNN is not a statistically significant sampling.  We need to collect data from the log file of a large ISP for all requests that occurred.  And we need to do the measurement in different geographies.

Note: the data to be counted is the "main content type", not "dependent content types".  Let me explain what I mean.  Suppose that the HTML form of the above news story contains two embedded GIF images.  The HTML document is the "main content type".  The two GIF images are the "dependent content types".  Only the HTML document is counted, i.e., increment the count of the number of HTML content types by one.

Period of Data Collection

24 hours (day and time to be determined)

Request for Participation

Do you have access to the log file of a large ISP?  Would you be willing to sift through their data?  If so, please contact me.

Acknowledgement

I wish to gratefully acknowledge the valuable contributions the following people have made to the formulation of this research initiative:

Len Bullard
Joe Chiusano
Jay Crossler
Ian Graham
Chris Gray
Greg Hunt
Bob Irving
Michael Kay
Tim Kehoe
Frank Manola
Rick Marshall
Marc Nobile
Joe Nyangon
Dave Pawson
Martin Probst
Liam Quin
Bryan Rasmussen
Sterling Stouden
Nathan Vuong


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.