trenchant.org

by adam mathes
archive · subscribe

On Atom

[Note: I wrote this piece a while ago, but shelved it because I thought it had no real audience. The audience that might understand what I’m talking about doesn’t read this site, and wouldn’t agree with me anyway. And most people do not (and should not) care about this. But I’m publishing it anyway, because it’s been sitting on my desktop staring at me and I figured I’d post it before trashing it. The draft specification I refer to now specifically states “DO NOT implement it or ship products conforming to it. This work has migrated to the ATOMPUB Working Group in the IETF.” Of course, that wasn’t there when I wrote this piece, and it certainly hasn’t stopped people from publishing Atom feeds.]

The Atom Syndication Format is listed as “0.3 (PRE-DRAFT)” but is already being implemented and recommended, notably by Google owned Blogger, and their new Groups.

I am not, in this piece, attempting to write about the “syndication wars” between RSS and Atom. This is a critique that consists of my thoughts on the current Atom syndication format pre-draft. It is the result of some experiences designing and implementing content management systems, as well as work on personal web sites, weblogs and webzines over the years. It is also heavily influenced by my recent studies at GSLIS in a Document Modeling course

While I will go into some specific details of problems I have with the current “pre-draft” specification that is being widely implemented, I want to note that being “good enough to criticize” is actually a high praise.

The most important issue I have with Atom is that it has large, broad design goals that are essentially contradictory. As the roadmap states, Atom is not simply a more fully documented, rigorous syndication format. (In comparison to the widely deployed RSS 2.0, almost anything would be more rigorous.) The Atom roadmap is:

  • Decide on the conceptual model of a log entry. Primer, ConceptualModel (Active)
  • Decide on a syntax for this model. Syntax
  • SyntaxConsiderations (Preliminary)
  • Build a syndication format using this syntax.
  • Build an archiving format using this syntax.
  • Build a weblog editing protocol using this syntax (the Atom API).

While I believe in the importance of conceptual modeling (although I think “log entry” is a poor word choice, reflecting the bias of the developers toward weblogs rather than the larger world of electronic publishing) the unfortunate reality is that, as far as I can tell, the project has done very little actual modeling of web content. Instead, at least based on their first deliverable of the syndication format, they have focused on creating a standard XML dialect for metadata of web content. As has been pointed out by Ian Davis in the The Nucleus of Atom -

Nearly every element in Atom is already contained in one of the DC specs. I took the time to compare the Atom elements with their DC equivalents and found something quite interesting: when you remove the overlap with Dublin Core what's left is pure syndication.

The only elements that don't have obvious counterparts in DC are those that deal with the syndication aspect of Atom, and to be honest there aren't many of them: atom:feed, atom:info, atom:entry, atom:link, atom:content.

This is not to say that this work isn’t valuable, but that there’s very little that’s new here. And, when you reinvent the wheel, you’re likely to end up with something round that looks a lot like other wheels.

I am not, however, advocating that Atom be pared down to its core syndication elements and Dublin Core be used. I wouldn’t be unhappy with that, but the basic logical equivalence between the two makes it basically a non-issue for me.

My point is that there has been lots of work on generalized metadata, and this just seems to duplicate it, rather than focusing on standards for actual content. For example, one way to do this would be to define a set of classes for XHTML elements to facilitate a standard way of organizing content in weblog or journal entries that would be machine readable and facilitate interchange.

While I think working on a standard for better content exchange might be interesting, let us focus on what is being worked on. Specifically, the draft specification of a syndication format.

The problem with having a project with many broad, possibly conflicting goals is that design decisions become difficult. I think there is a compelling case to be made that a syndication format does not have to match the conceptual model of a post. More specifically, the needs for information representation about a post for archiving, exporting, importing, or creation through an API are fundamentally different than those for syndication. Based on my preliminary experiences and analysis of the Atom syndication format, I think it is overly strict and reflects the divergent interests of an archival format, rather than one designed for syndication.

The design goals for an archival format - and by archival I’m broadly referring to a format designed to facilitate archiving, creating, importing, and exporting - are likely a maximal format with a large required base. In contrast, with syndication - and by syndication again I’m broadly referring to a class of activities that includes things like including headlines on other sites and reading summaries in newsreaders - I believe the goal should be to encourage as many divergent sources to agree on a common format. This would encourage a looser, rather than stricter format, with a small required base and many optional elements

This distinction between archival activities and syndication activities may itself be too simplistic. In Developing SGML DTDs: From Text to Model to Markup - the definitive (and only as far as I know) substantive text on document modeling - in addition to reference DTD development, developing derivative DTDs are also discussed for interchange, authoring, conversion, and presentation.

Some of my specific issues with the Atom Pre-Draft syndication format:

Author elements are required:

4.5 The "atom:author" element is a Person construct that indicates the default author of the feed. atom:feed elements MUST contain exactly one atom:author element, UNLESS all of the atom:feed element's child atom:entry elements contain an atom:author element. atom:feed elements MUST NOT contain more than one atom:author element.

I do not believe that publishers should be forced to indicate an author for everything they publish. Although the person construct “MAY be the name of a corporation or other entity no individual authors can be named.” [sic] First, the language here should be changed - it reads to me as though the only time a corporate author is justified is if the authors can not be named, when in fact corporate authors should be allowed whenever the publisher does not wish to name the authors. But I’m splitting hairs here: I simply think that the requirement is too stringent. Since any item in a feed is associated with a web site, that is enough of an “authorship” restraint for me. Additionally, this requirement makes it impossible to programmatically transform a valid RSS .91 feed to valid Atom without lying about the author element. More importantly, some publications simply do not specify an author for everything they publish. Author elements are single: Instead of defining multiple authors, Atom requires a single author and additional people to be listed as contributors:

4.6 "atom:contributor" Element The "atom:contributor" element is a Person construct that indicates a person or other entity who contributes to the feed. atom:feed elements MAY contain one or more atom:contributor elements.

The problem with this is that, again, it reflects (I think) the biases of internal representations of a few popular personal content management systems. That is, a single author, and possible multiple contributors. It does not accurately reflect the reality of large publications that currently use syndication technology. As I write this, the top story on the New York Times, a current user of RSS 2.0, is a story with two authors. While it may be tempting to suggest to just declare one of them the primary author, and the other a contributor, this is a bad idea. First, it does not accurately reflect the document we are modeling, which means there is something wrong with our model. Second, issues of authorship and credit are important to writers. Slighting them because of the requirements of a syndication format is wrong. As syndication formats are currently one of the only publicly available, usable forms of XML data, it is not unthinkable to eventually imagine a system like Citeseer that measures “impact” of authors and articles. (Citation analysis is important, those kids at Google realized it and repurposed it into PageRank.) Forcing author to be a one-to-one relationship with feed items will inevitably confuse these systems.

There is precedent for single author and added contributors in the cataloging world (it’s really much more complicated than that, don’t make me break out “AACR2) but I personally have no intention of repeating the mistakes of 1950’s cataloging systems designers.

Modified element required:

4.12 "The "atom:modified" element is a Date construct that indicates the time when the state of the feed was last modified, including any changes to entries therein. atom:feed elements MUST contain exactly one atom:modified element."

This is not something that my content management system keeps track of in the format necessary for Atom feeds. (I may get around to adding proper support for it, but it’s not a priority.) As such, I can not output a valid Atom feed right now without lying about the modification date. Since these are internet syndication formats, and are meant to be served over HTTP which has facilities for determining a documents last modification date, this requirement seems redundant. My assumption may be wrong here, perhaps Atom is being designed to be useful even if not served over HTTP. I can certainly see how this would be useful. If you simply have a set of Atom files, you could determine their last modification dates by examining their content. But any system that is gathering Atom files should, in my view, keep track of this internally, rather than force every syndicator to do so. My suggestion is that a top level modified tag be strongly encouraged, but not required.

At the individual item level, the part I find needlessly strict and complex are the dates. Sections 4.13.6, 4.13.7 and 4.13.8 specify three distinct dates associated with an item - modified, issued, and created.

Modified and issued are required. Modified requires a timezone which should be UTC, issued does not require a timezone. I’m not quite sure I understand the logic in either case. While reporting modification times is a good idea, I think modified should be optional. Aggregators already do a decent job with RSS feeds, many of which have no dates at all. One required date may already be too much, two certainly is.

The created element has no business being in a syndication format - there’s no reason anyone should know whether I write my daily entries at 4pm or 4am unless I want them to, and I don’t think the New York Times or any other publication is interested in telling you they’re just now publishing a story filed a week ago. The created element is optional, but “If atom:created is not present, its content MUST considered to be the same as that of atom:modified.” This is, well, just wrong. I don’t know why an aggregator should be forced to make an assumption which will be wrong in such a large number of cases.

· · ·

If you enjoyed this post, please join my mailing list