a permalink

news and thoughts on and around the development of the iCite net
by Jay Fienberg

Web pages as "feeds"

posted: Jan 26, 2004 3:16:41 PM

I always think it is funny that we refer to RSS / Atom files as "feeds", but there is something to it I hope to post more about later. In any case, I have been recently experimenting with treating web pages as "feeds" in the sense that they can and often do contain all of the same information in RSS / Atom files.

Les Orchard links to two resources I find useful in my current experiments: DOM-based Content Extraction of HTML Documents and Jon Udel's The forest and the trees. The main issues I am dealing with are: how to get a clean parse of an HTML web page (e.g., in particular using XQuery), and how to track what parts of the page are useful to extract.

I recently came across the CyberNeko HTML Parser, which looks like it might help with the clean parsing. (Of interest: James Strachan recently posted about his use of nekotml in converting HTML documents to wiki markup.)

The other thing that I am thinking about is how does one indicate on (or in relation to) a webpage what parts of the page have what meaning. Blog information architecture, and RSS in general, suggest one part of a web page (or one feature of a website) to be the thing that has meaning worth parsing. If you want more than one, you need multiple feeds.

But, web pages more generally have multiple areas that change, and it might be interesting to feed on more than one of these. (So, this is where the feed concept is sort-of confusing: it is perhaps easier to just think about the general task of monitoring and replicating changes within a larger information structure.)

I like to think about web pages as being mini databases. Each page is actually a complete database with bunch of tables, albeit with only one or a few rows in each table. (And, in this vein, I think of a website as essentially a distributed database.)

With this view, I think the metadata concept is also somewhat confusing: we are really talking about plain old data, though it is perhaps not visually displayed. The metadata in this context is really in the domain of things like the HTML DTD or the HTTP URI specification.

So, altogether, for each web page, we have a database and, expressed in database terms, we might want to monitor changes within specific tables, create new views of specific tables, cache or snapshot replicas of specific tables, etc.

With the HTML web page, the boundaries of these database structures can often be discerned, but how can they be made explicit? And, likewise for the whole website and its relationships between pages?

These questions are interesting to me :-)

permalink | comments {0} · trackbacks {1}

also available as: rss · rss2 · rdf · atom

Comments and Tracbacks

trackback from: the iCite net development blog
posted: Jan 26, 2004 4:44:06 PM
title: My Yahoo RSS as feed reader for WAP phones?

My Yahoo to RSS uses code described in Mikel's paper, Screen Scraping with Finite State Machines [pdf], which coincidentally looks relevant to what I was talking about

Note: All comments and trackbacks are moderated. Spam is deleted. Other comments are approved as promptly as possible.

Note: Older posts no longer accept new comments or trackbacks.

« prev post
Ancient cluetrains of the future

» next post
My Yahoo RSS as feed reader for WAP phones?

the iCite net

     news / blog
     software
     related
     about

blog newsfeeds

brief content:

XML · RSS · RDF · Atom

full content:

XML · RSS · RDF · Atom

blog archive

2006:
jan · feb · mar · apr
may · jun · jul · aug 
sep · oct · nov · dec
		
2005:
jan · feb · mar · apr
may · jun · jul · aug
sep · oct · nov · dec

2004:
jan · feb · mar · apr
may · jun · jul · aug
sep · oct · nov · dec

2003:
may · jun · jul · aug
sep · oct · nov · dec

first post: 
April 30, 2003

highlight views:
Spammers' Choice

Jay elsewhere online
Jay Fienberg - the official home page

Wrong Notes - the music blog of the Ear Reverends

Fine & Full, aka, a fine and full burger

Sociomobilepoetextologia (moblog, currently inactive due to lack of proper mobile)

to enjoy roll
sites I like to read when I start from here

· Anastasia Fuller
· Andy Baio
· Biz Stone
· Boris Mann
· Bre Pettis
· Chris Dent
· Danny Ayers
· Dare Obasanjo
· David Czarnecki
· David Weinberger
· Don Park
· Evan Williams
· Greg Narain
· Jason Kottke
· Jim Benson
· Lucas Gonze
· Marc Canter
· Matt May
· Matt Mullenweg
· Michal Migurski
· Nancy White
· Rebecca Blood
· Reg Cheramy
· Richard MacManus
· Sam Ruby
· Shelley Powers
· Tim Bray
· danah boyd