news and thoughts on and around the development
of the iCite net
by Jay Fienberg
posted: Jan 26, 2004 3:16:41 PM
I always think it is funny that we refer to RSS / Atom files as "feeds", but there is something to it I hope to post more about later. In any case, I have been recently experimenting with treating web pages as "feeds" in the sense that they can and often do contain all of the same information in RSS / Atom files.
Les Orchard links to two resources I find useful in my current experiments: DOM-based Content Extraction of HTML Documents and Jon Udel's The forest and the trees. The main issues I am dealing with are: how to get a clean parse of an HTML web page (e.g., in particular using XQuery), and how to track what parts of the page are useful to extract.
I recently came across the CyberNeko HTML Parser, which looks like it might help with the clean parsing. (Of interest: James Strachan recently posted about his use of nekotml in converting HTML documents to wiki markup.)
The other thing that I am thinking about is how does one indicate on (or in relation to) a webpage what parts of the page have what meaning. Blog information architecture, and RSS in general, suggest one part of a web page (or one feature of a website) to be the thing that has meaning worth parsing. If you want more than one, you need multiple feeds.
But, web pages more generally have multiple areas that change, and it might be interesting to feed on more than one of these. (So, this is where the feed concept is sort-of confusing: it is perhaps easier to just think about the general task of monitoring and replicating changes within a larger information structure.)
I like to think about web pages as being mini databases. Each page is actually a complete database with bunch of tables, albeit with only one or a few rows in each table. (And, in this vein, I think of a website as essentially a distributed database.)
With this view, I think the metadata concept is also somewhat confusing: we are really talking about plain old data, though it is perhaps not visually displayed. The metadata in this context is really in the domain of things like the HTML DTD or the HTTP URI specification.
So, altogether, for each web page, we have a database and, expressed in database terms, we might want to monitor changes within specific tables, create new views of specific tables, cache or snapshot replicas of specific tables, etc.
With the HTML web page, the boundaries of these database structures can often be discerned, but how can they be made explicit? And, likewise for the whole website and its relationships between pages?
These questions are interesting to me :-)
permalink | comments {0} · trackbacks {1}
also available as: rss · rss2 · rdf · atom
Note: All comments and trackbacks are moderated. Spam is deleted. Other comments are approved as promptly as possible.
Note: Older posts no longer accept new comments or trackbacks.
« prev post
Ancient cluetrains of the future
» next post
My Yahoo RSS as feed reader for WAP phones?
blog archive
2006: jan · feb · mar · apr may · jun · jul · aug sep · oct · nov · dec 2005: jan · feb · mar · apr may · jun · jul · aug sep · oct · nov · dec 2004: jan · feb · mar · apr may · jun · jul · aug sep · oct · nov · dec 2003: may · jun · jul · aug sep · oct · nov · dec first post: April 30, 2003 highlight views: Spammers' Choice
Jay elsewhere online
Jay Fienberg - the official home page
Wrong Notes - the music blog of the Ear Reverends
Fine & Full, aka, a fine and full burger
Sociomobilepoetextologia (moblog, currently inactive due to lack of proper mobile)
to enjoy roll
sites I like to read when I start from here
· Anastasia Fuller
· Andy Baio
· Biz Stone
· Boris Mann
· Bre Pettis
· Chris Dent
· Danny Ayers
· Dare Obasanjo
· David Czarnecki
· David Weinberger
· Don Park
· Evan Williams
· Greg Narain
· Jason Kottke
· Jim Benson
· Lucas Gonze
· Marc Canter
· Matt May
· Matt Mullenweg
· Michal Migurski
· Nancy White
· Rebecca Blood
· Reg Cheramy
· Richard MacManus
· Sam Ruby
· Shelley Powers
· Tim Bray
· danah boyd
trackback from: the iCite net development blog
posted: Jan 26, 2004 4:44:06 PM
title: My Yahoo RSS as feed reader for WAP phones?
My Yahoo to RSS uses code described in Mikel's paper, Screen Scraping with Finite State Machines [pdf], which coincidentally looks relevant to what I was talking about