Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When I was working as a web editor at a metro daily paper a few years ago, I proposed something similar: an XML-like syntax that would allow for metadata to be included in drafts of news stories, some (but not all) of which could be made use of in online versions of the story (such as a link to a map when you're referencing a location).

A lot of wire copy already includes metadata, but it's generally just in a header that accompanies the story.

What I was envisioning was something more like what is being proposed for the semantic web:

<name id="1394">John Smith</name> was elected president of the <organization id="2315">New Castle County Council</organization> on <date value="2014-12-10">Wednesday</date> at the <place lat="39.685881" long="-75.613047">county headquarters</place>.<source id="23" name="Mila Jones" title="New Castle County public relations officer"></source>

I also wanted to use the metadata to help copy editors trim wire stories:

<priority value="1">This amounts to de facto resegregation. <priority="4">(And we all know how we segregation worked out the first time.)</priority> If the school district still values integrated schools, it must act swiftly to correct this effect.</priority>

It turns out, though, even when you create a UI that lets reporters and editors easily plug in this metadata without having to understand XML, they are not apt to fill it in, because they are just so overworked as it is.

Plus, in order for this to work on a larger scale, you'd have to get an incredible amount of buy-in. You'd have to get reporters and editors to agree that it's worth their time. You'd have to build software to support it. You'd have to get all of the different media companies out there to agree on standards.

It's just ... not what the media industry is (or should be) focused on right now. They've got bigger things to think about, like how to find a viable business model.



Absolutely true that journalists are too busy to bother with markdown. What would be great, however, is to have tools for copy editing that do things like recognize people's names, Google them and spell check them. Recognize people's titles, Googles and confirms them and spell checks them. Style passes that could do simple grammatical edits around a site's style guide: call it a style filter.

I have to say, for me as a journalist in tech, the one thing that ends up taking the most time in my stories is looking up name spellings and people's titles, and most importantly, trying to figure out if your freaking company is spelled TheCompany, The Company, or Thee Cmpany or some crazy variant. You startups and your mid-word-capital-letters. The bane of copyeditors everywhere.


You can do quite a lot of that in Microsoft Word, including the use of research tools, but it's a lot of work to set up.

Agree about the problems coping with company and product titles, etc. The problem could be reduced by refusing to play that game, eg by using registered names rather than marketing styles or logos.


That only works until marketing calls your sales people and says "you spelled our name wrong!"


So you point to the company registration documents and tell them you are happy to display their logos in the ads they can buy to correct it ;-)


  <priority value="1">This amounts to de facto resegregation. <priority="4">(And we all know how we segregation worked out the first time.)</priority> If the school district still values integrated schools, it must act swiftly to correct this effect.</priority>
Forget about editors, use this kind of mark-up for your readers. Imagine changing an in-depth article into a truncated, 200-word summary with the click of a button. Activating a different tag would include the reporter's subjective commentary (or perhaps have multiple editorials based on the same "scaffolding")?


In print journalism, you're supposed to include information in descending order of importance. As a reader, you're expected to stop reading once you get to details you don't really care about, confident that you won't miss something more important buried farther down.


That's largely true for hard news stories. But what about news feats or op-eds or sports profiles, where the traditional inverted pyramid structure isn't employed?


Exactly. Feature stories often use a delayed lead to set the scene or catch the reader's attention.


While I can understand the arguments for this, in practice the approach frustrates me to no end.

Worse are authors whose writing has no excerptable lede. Gina Kolata and mumber Morgenstern (health / Well articles in the NY Times) especially do this.

I find some old-school journos -- Dan Gilmore particularly comes to mind, I've called out a few others in G+ posts -- still practice strong ledes and heads. Many newer ones start with "My latest at <some website somewhere> read more".

Which ... tells me fucking nothing.

Lede with your lede. Trail with your link or call-to-action.


The so-called "inverted pyramid". I'm finding it applied less and less frequently.

I've also noted that it's become pretty much standard practice for news bureaus to write single-sentence paragraphs. Literally, every sentence of a story is its own paragraph. I don't know when that became standard practice, but sometime in the later 1990s or 2000s, particularly as stories moved online.


Sadly, hardly anyone writing now outside big traditional media orgs has done any sort of formal journalism training.


That's ... not all bad.

I think there's too much insularity within much of the journo community. But many of the newcomers are also subject to outside influences which call their credibility strongly into question. Lack of uniform copyediting, for better or worse, means a wide range of writing quality.

Though I'm seeing that even in long-standing brands -- NY Times, Forbes, and elsewhere.


Zinsser mentions this one sentence paragraph thing in "How to Write Well", citing an AP article from 1993. (In the "Paragraphs" section of Chapter 10.)


Thanks. And yes, it makes sense that it would be AP style or similar.


this seems to be a lost art, unfortunately.


>Forget about editors, use this kind of mark-up for your readers. Imagine changing an in-depth article into a truncated, 200-word summary with the click of a button.

I imagine it, and most users would still not care. They skim articles anyway.


Actually, this is an extremely simplified example, and once you get into the nitty gritty, it becomes a lot more difficult to add/remove various elements. For instance, you might need to capitalize a word differently depending on whether something has been excised immediately before it, or you may need to adjust punctuation in ways that you can't do simply using an XML-like format. Really, you need what amounts to a Natural Language Generation library to implement a robust system.


I implemented that for my blog somewhere around 2001 or so. It's quite tedious to write and I haven't done it since around 2001 or so. Even two versions of a story is a lot.


I did something like this in college, and I was thinking semantic annotations would be an editorial pass like copyediting. These days you'd probably let some ML system take a crack at it before using human effort to bring the quality up to your publication's standard. In any case, it doesn't have to get in the way of writing a good story.


You can do this with lisp-style syntax (which also (despite opinions to the contrary) is quite natural).


>> It turns out, though, even when you create a UI that lets reporters and editors easily plug in this metadata without having to understand XML, they are not apt to fill it in, because they are just so overworked as it is.

Nailed it. I'm a developer in a newsroom, and I'm dealing with flack just asking them to write a non-automated teaser text for their blog posts. They can't be bothered.


> They've got bigger things to think about, like how to find a viable business model.

This is actually potentially a big part of that. People are reading more than they ever have—it's just not necessarily newspapers that they're reading.


>> "People are reading more than they ever have—it's just not necessarily newspapers that they're reading."

Any stats on that. I agree people are reading more than ever but disagree that they're not reading newspapers (online). I think they are, they just aren't paying for it anymore.


If you look at this NiemanLab post, table 1 shows that even as online time has increased, time on online newspapers has definitely decreased.

http://www.niemanlab.org/2014/06/are-online-ads-more-valuabl...


Interesting, thanks.


stats requested, stats delivered, hackernews


That said, you could end up applying this to a "news article IDE" automatically, with less human intervention required -- or at least, provide automatic suggestions. I couldn't find any clear links on the topic, but here are a few that can be followed with a bit of research:

http://www.nltk.org/book/ch07.html

http://stanbol.apache.org/docs/trunk/components/enhancer/nlp...

This last one was actually trained on WSJ content: http://nlp.lsi.upc.edu/freeling/index.php?option=com_content...

At this point, though, I'm thinking it'd be the equivalent to spelling and grammar suggestions in Word, appreciated somewhat but ultimately considered useless the first time it screws up. But still better than nothing, right? ;-)


Stanbol could definitely be used as part of a system like this. In fact, that sort of thing is a big part of how we're using it at Fogbeam, although aimed at assorted knowledge workers in an enterprise setting, and not at journalists specifically.


There are automated services that let you do this, I've used Open Calais and MetaCarta in the past with great results.

To be honest I'm surprised services like those aren't automatically used on new content within every major media outlet as a standard.


Yeah, just that: they're too busy. Most news articles are valid for just one edition of a newspaper, so about half a day or even less if the newspaper is published more than once a day. It's write it and move on, in a lot of cases. Investigative journalism probably has more use for a system like this, but even then I doubt they'd want to fill in XML forms when they could just write sentences. Besides, usually you can trust an editor to read an article and remove the bits that aren't relevant based on their own judgment, instead of a 'priority' hint by the original author.


A lot of this information can be extracted directly from the text. It is not like "New Castle County Council" is particularly ambiguous. The trouble for a lot of reporting is that the stories simply lack depth and good links to external content.


I agree that this can be generated at publish time. The question is, how much value does it provide over something generated at read time by a browser plugin? The answer is - at publish time presumably someone at the publisher outfit will take a cursory look at it. That's it. May as well have readers mark up the articles!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: