One of the most striking developments in online news in the past year or so has been the rapid proliferation of interesting database applications
. Gannett Co. newspapers have been leaders in this area, driven by the company's "information center
" initiative, which is yielding new organizational structures and approaches to information gathering and presentation. The "data desk" is one of the seven pillars of the company's new approach to news.
As Gannett realizes, data should be a driving force in online journalism, for at least the following reasons:
- Data is "evergreen" content. Its value to users does not end after 24 hours.
- Data can be personal. What's more relevant to someone than, say, reported crimes in their neighborhood, or nearby property assessments?
- Data can best be delivered in a medium without space constraints. The most valuable databases (say, property assessments or state employee salaries) contain too much information to publish in print. And even when print publishing is practical (say, listing real estate transactions in zoned editions), the data will be much more valuable if they are accessible and searchable at the user's convenience.
- Data takes advantage of the way people actually use the Web. It's a medium for active behavior -- for instance, research and interaction -- not passive activity like reading or viewing.
- Data, once gathered, can be excerpted in print. Once you've done the work of acquiring, formatting and enabling online access to data, it is easy to pull information from the database for traditional publications.
The Indianapolis Star has been one of the most successful Gannett papers in implementing online databases. I'm particularly impressed with the diversity of data-driven projects the paper has deployed on its online Data Central
site. Thanks to the Star, online users can see police and fire emergency calls
in real time, look up their property tax assessments
, review school test scores
, check out CEO salaries
, look up crime statistics in their neighborhood
, and -- proving that databases can be used for much more than hard news -- find out how Indianapolis Colts quarterback Peyton Manning performs
in different game situations. Between March and October, the Star's online databases generated 7.2 million page views - roughly 3 percent of the paper's online traffic.
I recently had the opportunity to talk with some of the key Star staff members involved in developing these and other databases: Mark Nichols
, computer assisted reporting coordinator; Mike Jesse
, manager of research and data resources; and Bob Jonason
, digital operations director. They also emphasized the contributions of Chris Johnson
, a graphic artist at the Star whose background in software development has proven extremely valuable.
Behind each database lies a story - where the idea came from, how the database was developed, what skills were needed to make the data available and usable. For this post, I'm going to focus on three of the databases and then discuss some lessons other news organizations can draw from the Star's experience of news organizations like the Star. I'll also share some thoughts about the the range of possibilities for database journalism online.Police fire and emergency calls
For years, the Marion County police/fire communications agency had made an electronic "feed" of police and emergency services calls available to the Star and other media organizations. The information was very limited -- just date, time, address and an abbreviation describing the nature of the call (CRIM *VANDLISM, or THEFT *WALLET). The feed came in, via a dial-up modem connection, to the paper's photo desk. Photo and news editors would scan the feed for emergency calls that might be significant enough to send a photographer or reporter to the scene.
As the paper ramped up its database efforts, Star editor Dennis Ryerson
, "kept saying over and over again, 'Here's what I'd love to have. I'd love to have something where if I hear a siren in my neighborhood, I would like to go on IndyStar.com and find out why,' " Jesse said. "I said to myself, 'Yeah, right,' but then I remembered that 911 feed."
Johnson, the graphic artist with a software background, was able to convert the feed to a format that could be published on the Web. He also developed a slick map-based interface
using Yahoo! Maps. Even though the information is still cryptic (a PDF file of abbreviations is provided to make sense out of it), it's extremely popular. When I visited the site around noon one day this week, there were 47 people using the map.Property tax assessments
Property tax assessments, a hot topic in most communities, have been particularly controversial in Indiana as the state moves to a system where all properties are supposed to be assessed at their true market value. A reassessment in Marion County, which includes Indianapolis, was made public in June, generating complaints and outrage in the city and creating political turmoil statewide. The paper acquired the entire database and published it online
Then, when the governor announced a proposal to cap property tax bills at 1 percent of assessed value, the paper published the tax bill databases as well.
The paper was working on a story about the new assessments and the fact that they would be much higher than people were used to. "I was able to talk with the Marion County assessor and get them as a database," Nichols recalled. "We were able to post them online before the county did."
The paper was able to publish the databases quickly, in part, because it had licensed Caspio Bridge
, a technology toolkit that makes online data publishing possible without advanced database development skills. In Nichols and Jesse, the paper had staff with the key necessary skills.
Nichols has more than a decade of experience in data analysis for journalistic projects. "I'm seeing this as a new wave, and I'm enjoying it and learning from it," Nichols said.
Jesse's background was in the Star's library, where he learned HTML and data skills in publishing Star archival content to the Web. "Mark really understands the data much better than I do, and he helps me get the data together so I can work with it and design a page that puts it online," Jesse said.The "Manning-meter"
Assistant sports editor Ted Green
, whose responsibilities include developing projects for the Star's Web site, saw an opportunity to provide deep information about the Colts' star quarterback, Peyton Manning. "Ted came to us with a 'what if' scenario --what if we could determine what Manning's passing percentage would be in a variety of situations," Nichols said. "We started looking around for available data resources, and there really wasn't anything out there we could just acquire that would have all the information we needed. I finally told Ted that the only way we would be able to do this is to build our own database."
Though Manning has attempted more than 5,000 passes, the Star decided to go ahead and put the data together. "Our sportswriters were able to get play-by-play sheets for every game that Manning has played in as a Colt," Nichols said. Several Star staff members typed the data into a spreadsheet, and Johnson once again designed an attractive interface
. Sports fans can now see how Manning has performed in game situations: which stadium, which opponent, what down, on turf or grass, in different weather conditions, and much more.
"The wonderful thing about this data is that it's not available anywhere else," Jonason said. "If you're covering a team now and you're competing with NFL.com, ESPN and Sportsline, you need something different."Lessons from the Star's experience
The Star's experience suggests some lessons for other news organizations that want to develop data-driven online applications:
Have a plan. Before the Star began building Data Central, the key staff members created a list of subjects where data might be available and useful to its online users, Jonason said. The list included public safety, business, sports, real estate and community information. The process generated a list of data ideas that the Star's data team has been using to build out Data Central.A hierarchy of database journalism
Involve computer-assisted reporting experts. The reporters and editors with experience in computer-assisted reporting have the deepest knowledge of what data is available, how to negotiate to obtain it, how it can be used to generate enterprise journalism -- and, most importantly, what the pitfalls and limitations of databases are. CAR specialists have learned, sometimes painfully, that every database has incomplete, misleading or downright inaccurate information. Even though the data may have come from another source, such as a government agency, the news organization can easily end up being blamed for inaccuracies. Nichols recalls a teacher salaries database where one teacher was listed with an annual salary of $700,000. Someone apparently entered an extra zero in that data column -- illustrating the simple fact that while computers store and output data quite reliably, the information originally had to be typed in by a human who can make human mistakes.
"Back when we would go out and get a huge database for an investigative project, we would present readers with a summary," Nichols said. "Now all of a sudden you are posting hundreds of thousands of records about people, and if it's not right, we're going to be getting a call. It changes the whole landscape of how you deal with things as a journalist."
Put together a team. Effective publication of online databases requires a variety of skills unlikely to be held by a single staff member. The team needs the journalistic and analytical skills of CAR specialists, the research understanding of news librarians, programming and HTML skills that are typically the province of technology developers, design talent that might come from a graphics department and, in at least some cases, the ability to create a Flash interface like the one used for the "Manning-meter."
Raise the profile of the news library. Most news organizations have significantly cut back their investment in what newspapers used to call the "morgue." Staff are no longer needed to clip and store articles in little yellow envelopes. Reporters can use online databases to do their own research instead of needing to ask a news researcher to do it. But as Gannett has shown, the job of the "data desk" is quite consistent with the mission of the news library, and the people who have made careers in news research, like Jesse, can play an enormously important role in database publishing.
Find tools that make database publishing easy. Caspio's software makes data publishing possible for newspapers that can't deploy a database developer of their own to these kinds of projects. Caspio also hosts the data on its servers, which means no investment in hardware or bandwidth is needed. On the other hand, Caspio doesn't allow for the kind of rich customization that your own developer can provide. An interesting discussion about Caspio has erupted among the small fraternity of journalists who do data analysis and database development. (Check out the blogs of Derek Willis of the Washington Post and Jacob Kaplan-Moss of the Lawrence Journal-World to delve more deeply. You'll want to be sure to read the comments thread on Willis' blog post, where Caspio's David Milliron, a veteran computer assisted reporting specialist, responds to Willis' criticisms of Caspio.)
The debate over Caspio obscures what might be the most important point, which is that the news industry needs tools that open up online data development to people other than professional technology developers. The Django technology framework, pioneered at the Lawrence Journal-World and popularized by people like Kaplan-Moss and journalist/programmer Adrian Holovaty, has a steeper learning curve than Caspio. But sites like the Journal-World, the Washington Post and the St. Petersburg Times (check out Politifact, a Django-powered site developed by Matt Waite) are showing that developing a competency in Django can make it possible to develop sophisticated database-driven Web applications quite rapidly.
Apply news judgment in deciding what and when to publish. The most popular and valuable online databases are those that are provided in a clear journalistic context. The Star put the property assessment database online when those assessments were a hot news topic. The Asbury Park Press published public employee payroll databases salaries along with an investigative project in which the paper revealed that some government workers were drawing multiple "full-time" salaries from different agencies. By contrast, the Lansing State Journal came under heavy criticism when it published a database of state employee salaries without providing enough journalistic context about why they were doing so. State employees were outraged, the publisher felt compelled to apologize, and government officials who don't want to provide data to news organizations got more ammunition for their battles to resist disclosure of public databases.
Back in the mid-1990s, quite a few of the people given responsibility for launching online news sites (including me, as the first new media director at The Miami Herald) came from computer-assisted reporting backgrounds. We saw the potential for online databases but found that a variety of factors - including complex, expensive technology and corporate control of access to Web servers - pushed these kinds of projects to the back burner. Now that the technology has gotten more accessible, it's exciting to see so many news organizations get involved in database publishing. (For more on this topic, you might want to check out this case study of the Asbury Park Press
' DataUniverse site
and this excellent roundup of online database applications
from Steve Buttry
at Newspaper Next.)
But it's also becoming clear that there's a wide range of possibilities for database publishing. And that some projects are clearly both more complex and potentially more rewarding - for news organizations and their online audience - than others. It might be useful to think about a hierarchy of database publishing. At the low end are the simplest kinds of projects in which the news organization doesn't do much beyond making the data available. At the high end are the most ambitious applications, in which the news organization adds value through smart interface development, journalistic analysis, creativity in presentation or connections to storytelling.
Level 1: Data delivery. Here a news organization obtains data and makes it available in a browsable form. There's no additional reporting and little functionality for the online user. The Star's CEO salaries database is an example.Last words
Level 2: Data search. This is by far the most common way data is made available. Users are expected to find relevant information by entering text into a search box. An example: The Cincinnati Enquirer's database of home sales prices.
Level 3: Data exploration. Compare the search results page for a typical searchable database like Cincinnati home sales prices to the browse options on Adrian Holovaty's chicagocrime.org. There's a search box on the page, but the site allows easy exploration of the data in a way most online databases do not. Click on any of the browse options and you are presented with additional links that you can click on and explore the information more thoroughly. I recently heard Adrian talk about his approach to developing database application. He talks about applying "The Treatment" to online data, by which he means, "Present it in ways that make it fun and serendipitous." His motto is: "Everything that can be linked should be linked." His work shows that searchability is just the beginning.
Level 4: Data visualization. Rows and columns usually aren't the most effective way to present data. For many databases, the most valuable thing a news organization can do is provide a way for people to visualize what the data show. The most obvious approaches involve mapping, at least for databases that have a geographic element. Thanks to Google and Yahoo!, it is relatively easy to add maps to any database that includes addresses. But the possibilities for data visualization go way beyond mapping. A site that is doing some very interesting things with data visualization is Digg.com, a tech-oriented site where content is prioritized based on user voting. Check out Digg Labs for some creative ways the Digg team is finding to prioritize news using visual interfaces.
Level 5: Data experiences and storytelling. When a news organization can effectively marry traditional reporting and storytelling with database development capabilities, truly new forms of journalism can emerge. Here are a few examples of what I'm talking about:
- The Los Angeles Times' homicide map. What makes this project interesting is that behind the map is a page (actually a blog post) about every individual murder in Los Angeles this year. And for each murder, the Times allows comments (after staff review), which often take the form of tributes to the homicide victim. These comments are often poignant and compelling - transforming dry statistics into human stories.
- Politifact, a joint project of the St. Petersburg Times and Congressional Quarterly. This is a data-driven application designed "to help you find the truth in the presidential campaign."
- Digital Trails, a story produced by the News21 reporting project, a foundation-funded initiative involving graduate students from Northwestern University, the University of Southern California, Columbia University and the University of California-Berkeley. (Disclosure: I was an adviser to the students and helped them work with Flash developers From Scratch Design Studio to present the story.) "Digital Trails" is a story about how information about people is captured and stored as they go about their daily activities. Student journalists Phil Stuart and Meredith Mazzotta followed a young woman around the Washington area and identified every instance in which she left a "digital trail," then found out where that information was stored and how it might be shared with companies or the government. Underlying the multimedia reporting and Flash interface is a database in which every trail is a data element.
While changes in audience behavior and the business of media are creating tough times for news organizations these days, online databases offer new ways of engaging audiences and delivering quality journalism that people need and value. Ryerson, the Indianapolis Star's editor, said he sees continuing database development as a critical aspect of his newspaper's future.
"If we can find enough of these things that intersect with the lives of our readers," he said, "I think we will be all right."
By Rich Gordon (firstname.lastname@example.org)
Rich Gordon is Associate Professor and Director of Digital Technology in Education at the Medill School of Journalism at Northwestern University.