Data And Metadata In News Gathering and Management by Hacks/Hacks NYC / Hacks/Hackers

Data And Metadata In News Gathering and Management by Hacks/Hacks NYC

By Chrys Wu

Sep 30, 2010

Suggest edits

(A big thanks to Daniel Bachhuber for initiating notes on TypeWithMe.com, with help from Greg Linch and Chrys Wu. This is version 5 of the notes, lightly edited, with more context added.)

Since time immemorial, two major knowledge management questions have bedeviled news organizations. First, when faced with a giant pile of primary source material, how does a reporter intelligently and efficiently discover the newsworthy bits? Second, how should the organization index and expose the latest news and archival material to both consumers and reporters?

To answer these questions, Lotico the New York Semantic Web and Hacks/Hackers teamed up on Sept. 30 to have speakers from ThomsonReuters, The Associated Press, The New York Times, The Wall Street Journal and Aol News present on their work in Computer Assisted Reporting. Turnout was tremendous with most people showing up despite the fact that the event was free.

The lineup:

— Ken Ellis, Proposition Leader at Thomson Reuters;

— Stuart Myles, Deputy Director of Schema Standards at The Associated Press;

— Tom Torok of The New York Times;

— Maurice “Mo” Tamman of The Wall Street Journal;

— Justin Cleary, Senior Product manager at Aol News

Moderator: Evan Sandhaus of Lotico the New York Semantic Web, who reminded audience about

Semantic Web Summit in Boston, November 16th and 17th. (Confluence of technology and business topics.) Semantic Web meetup group has almost 1,500 members now, dedicated to taming large piles of data.

— http://twitter.com/greglinch/status/26021697444

Ken Ellis, Thomson Reuters

– Boss, Tom Tague, bailed on his talk because of a bug, had five hours prep time. Was depending on first speaker to prepare speech

– Tom was scheduled to show us how we can use Reuters OpenCalais platform. He is developing a new group in Reuters Media

– Mandate of the new group is to create new packages to be launched over the next year or two.

“Besides the content business, Reuters is planning to offer solutions to help publishers make better products, additional revenue, etc.”

Stuart Myles, The Associated Press

– Metadata is at the center of information management. The key is how to keep the content straight. Goal is to constantly improve the metadata.

– The news industry has a new standards board, and AP (Stuart) is on it.

– Information management is both a team and a process within the Associated Press. The mission: to “establish and maintain metadata standards for all AP and member content”

– Single platform for all content, more precise searching

– AP uses a Windows “Teragram” tool that looks like a taxonomy tree and helps determine which terms to apply to a given piece of content

– Sister system tracks whether the rules are being applied appropriately

– Classification of metadata:

– Subjects (4,200 terms in 16 categories)

– Entities (people, companies, organizations: 70,000 people, 34,000 companies, 500 organizations

– Places (2,000 regions, countries, states, provinces)

– Metadata allows AP to “slice and dice” content into several different products. It does so better than simple search.

– http://twitter.com/HacksHackersNYC/status/26023163810

– News organizations commonly want just news about a particular location; metadata also allows them to classify content by media type.

– AP uses metadata in content search.

– Myles has worked on hNews, which allows you to put semantic data in HTML

– Q: Does Teragram support RDF input?

– Nope.

– Q: Is all of your data hand-crafted, or machine-generated?

– Company information is purchased, info on notable people is handcrafted, other is mixed.

– http://twitter.com/greglinch/status/26023876927

– Teragram allows AP to manually define the rules, and apply those rules automatically to content.

Tom Torok, New York Times

– Q: How many people work in the “news collection” business?

— Small number of hands are raised

– Q: How many practice “calculator-assisted reporting”? “notepad-assisted reporting”? We don’t because these come with the job. So there really shouldn’t be any reason to have a special category for “computer-assisted reporting.”

– http://twitter.com/greglinch/status/26025066057

– Computer-assisted reporting was a grassroots movement, it didn’t come from management.

– When Tom got an Osborne computer, it also included an offer to buy a 10 MB external hard drive for… $10,000.

– Movement took off in 1989 when Bill Dedman won a Pulitzer Prize for exposing redlining, the systematic avoidance/denial of mortgage lending based on race: “The Color of Money”

– In 1994, an outfit called Investigative Reporters and Editors, in conjunction with the Missouri School of Journalism, started NICAR (National Institute for Computer-Assisted Reporting). Listserv offered quick fixes to journo-data questions; answers in 15 minutes or less.

– Tom Torok’s only formal computer training was Fortran. Everything else was on his own from manuals, message boards, etc. http://twitter.com/greglinch/status/26025428210

– http://twitter.com/greglinch/status/26025534030

– CAR reporters don’t consider themselves as such. They often have to scrape and complete data packages because the data is never presented in whole, only in parts for presentation or conveying a specific message. Also, they often have to negotiate with governments or private parties for complete data sets.

– The Times has a “stable” of data accessible only to the newsroom.

— http://twitter.com/MacDivaONA/status/26026221799

– “A lot of people think CAR reporters only deal with structured data. But in reality they often have to deal with unstructured data.”

– Example of news project: How many judges recuse themselves in Ohio before making a judgement? (http://www.nytimes.com/2006/10/01/us/01judges.html). After crunching the data, The Times found that Ohio justices “routinely sat on cases after receiving campaign contributions from the parties involved or from groups that filed supporting briefs. On average, they voted in favor of contributors 70 percent of the time. Justice O’Donnell voted for his contributors 91 percent of the time, the highest rate of any justice on the court.” It took six weeks of work in cleaning up the data by hand, mostly straightening out the contributors’ names

– Torok shows us Jo Craven McGinty’s report on hate crime in New York – Very powerful analysis of a tricky topic — one that would have been difficult to report by calling people up.

– technews.nytimes.com — looks like an internal tool

– Torok demoing a tool called fastESP, which allows you to search against a keyword and it will give you the entities it computationally identifies: size, companies, locations, people and keywords (with facets on the side for easy filtering of top results)

– Torok can infer Hilary Clinton’s foreign experience based on whether other heads of state show up as entities on any given country search. Indicates whether her name appears in the same documents as them.

– Among the several tools they NYT used was the OmniPage Capture SDK for OCR

– fastESP also extracts phone numbers. Program can identify the phone lines of heads of state by doing a multi-value search, first for phone numbers then keyword for the person’s name

– NYT has a private arrangement with IBM to get a service called PrivateEyes, which includes a text tree

— http://twitter.com/Hoenikker/status/26027150665

– Q: How tricky is the interpretation of the judge contribution data? Do you have a PhD in statistics?

— Torok: Journalism is presenting the data and allowing the reader to decide for themselves.

**Maurice “Mo” Tamman, The Wall Street Journal

– Self-described 2nd generation CAR guy, introduced in the 1990’s

– Frustration with the fact that most journalists build stories off tips, or anecdotes. Sometimes this is pure rubbish, unless you have skills with data analysis.

– Example of election data. There’s a data trail with electronic voting machines, which is an advantage over the older paper voting schemes

– Tamman shows raw voting data from the Sumter County general elections in 2006, which showed a significant under-vote in one race. His team re-created every ballot cast in the election, cleaned up data: county, ballot, precinct, voter (id number?), race (political office), candidate, party. The goal was to find where the 18,000 uncounted votes went. See whether there were trends in race, sex, age, type of voting machine. The data was skewed to precincts that mostly older voters. Tamman: “We did all of the responsible things and got jack shit for it”.

– So they went back to do the reporting after looking at the data. They scored each ballot (- Republicans got +1 and democrats received -1. So +6 for a straight ticket dem, minus-6 for straight ticket rep); the expectation was the straight-ticket voters would have the highest turnout.

– Result from data collection in this photo: http://www.flickr.com/photos/danielbachhuber/5039924079/

– Identified a problem with how the ballots were stacked: http://www.flickr.com/photos/danielbachhuber/5039928533/

– Mo worked on the story while at the Sarasota Herald-Tribune:

“Vote analysis points to bad ballot design”

“About the Herald-Tribune analysis”

“Audit to review computer code”

“Bad design unlikely to get revote”

– Mo’s presentation, originally delivered at the Knight Digital Media Center (PowerPoint)

– Identified incompetency in design, which was exactly the opposite of what everyone told us was going on; it wasn’t cheating, it was just pure incompetence

– Mo has been working in data journalism for about 20 years. All of his stories are built on an “empirical spine.”

“We live in a world where no one trusts journalists. If you build an empirical spine around a story, it inoculates you to some degree that you’re not fair or accurate.”

– He wants to do projects that people will read and take seriously

“Empirical journalism gives you that.”

– Another project looked at server logs recently installed in a local school district. What Mo was looking for: teachers looking for porn. “And I found them.”

– Another project looked at the projected results from bank stress tests. Worst case hasn’t happened, but the baseline case has, over the last couple of years. CAR provided the understanding needed of the data to come to this conclusion.

– Feedback:

— http://twitter.com/jonathanstray/status/26028847761

— http://twitter.com/jonathanstray/status/26029144152

Justin Cleary, Aol- “Data and Editorial workflow on Aol News”

– Justin works on the product team at Aol News

– Compared to the other presenters, Aol has a small staff. Focused on working as efficiently and effectively as possible. Still, Aol News is the fourth largest general news site on the web, averaging 30 million unique visitors.

— http://twitter.com/Hoenikker/status/26029453532

– Content of Aol News is original reporting, news blogging and wire services

– Increased focus on a new type of breaking news blogging: Surge Desk story assignments originate from trending and search data and focus on delivering news coverage to users as they demand it.

– News Production Lifecycle:

1. Choose a Topic – what’s happening today? (traditional news-gathering techniques), what’s trending? (Google trends, Twitter trends, other web apps; web pub and Twitter analysis driven by their Relegence tool), what are consumers looking for? what’s left over?

— Google Trends is used by everyone and not that useful.

— Relegence, an internal product, helps see what will be interesting, not just what has been interesting (what have people looked at, what are they looking at, what will they be looking at?)

— “The magic of journalism happens” and the story is written.

— Gives the number of stories, all the links, when they were published, and source within 6 minutes of being published

2. Create and Classify: what are the top-level categories? what tags apply? what Qs does this content answer?

— Top-level categories are editorially selected

— Tags are suggested from a semantic analysis of the content, ranked by relevance. Suggested tags are opt-out

— Questions it answers: suggest questions from a subset of Aol search data; apply relevant Qs as tags

3. Track performance

— Watch basic traffic metrics in real-time (page views, uniques, and internal/external referral links)

— Pay attention to what keywords your search visitors are entering on, not just what your content is ranking on

– Does It Work?

— Volume of content produced up approx 20%

— Created thousands of optimized long-tail pages

— Natural search referrals up significantly, impressed both Cleary and his manager

– Q: Why not ask the audience directly what they want to know?

— A: If we could vet 350 million responses a month (???), we’d do that. Right now we don’t have the technology to handle interactivity on that scale.

— http://twitter.com/HacksHackersNYC/status/26030528534

– Q: Do you ever cover the statistically “unpopular” stories?

— We look for the stories already being covered in three ways, and what’s left over (not being covered) is the opportunity for original reporting.

— http://twitter.com/jonathanstray/status/26030672447

Questions for the end:

– AP: How much has the Associated Press invested in semantic technologies (man hours, etc.), and what’s the return on investment?

– AP: Why microformat over RDFa for hNews?

– AP: How has hNews been adopted? What number of companies are using it, and how is it being used?

– AP: What are the applications of the news registry and how is it being used?

– Aol: Could we see a demo of the Surge Desk?

– Aol: Do you have any machine learning in the Surge Desk? If so, what is it doing?

— Basic systems that return entities is a dynamic system. Crawling 30,000 sources and identifies new entities within 6 minutes. Want to expand machine learning to pay attention to how editors are opting in and out of tags.

— Relegence, a company Aol acquired several years ago, provides the machine-learning technology and Aol has worked to incorporate it in-house

Q: About open government data – how useful is it?

Torok gives the sample of British parliament members’ expenses reporting

The (London) Telegraph MP expenses

Guardian MP expenses //broken link//

– Torok: The Telegraph owned that story. The value of the data lies in the interpretation and the context in it. Guardian offered up data for people to poke around it.

Q: What about RDFa?

– Myles: We are looking at RDFa in the IPTC standards body, and how it might be incorporated. Call coming up next week.

@todo Better research the relationship between hNews (microformats) and RDFa

Q: What tools do you wish you had every day that you didn’t?

– Tom Torok, New York Times: Clear Forest, the paid version of OpenCalais, is phenomenal. It understands questions like “who builds bombs” and “who gave money to who,” and presents data. The semantic connections are phenomenal, so is the price.

– Mo Tamman, Wall Street Journal: 10 terabytes

– Justin Cleary, Aol: A better tool to combine learning systems with strong editorial judgment

– Stuart Myles, AP: Algorithms to find good, meaningful original content. We’ve been working on this for five years and we still don’t have this.

– Ken Ellis, Reuters: algorithms don’t help that much with news, yet

http://twitter.com/kraykray/status/26031117344

Categories: Meetups, News