Applying the 5 Star Open Data model to your high value public data

Introduction

The World Wide Web Consortium (W3C) has developed a five star model to describe different characteristics of open data, and its usefulness for people wishing to reuse it. It is being used globally as a model for assessing data readiness for re-use.

Applying this five star data model along with metadata standards will result in well understood and “mashable” datasets (datasets easily joined together to create a new dataset).

The three star level is considered the minimum standard for release of government’s public data for re-use: non-proprietary, machine –readable, and accessible via the web, and licensed for reuse in accordance with NZGOAL.

5 Star Open Data model

The five levels are:

  1. Data is visible, licensed for reuse, but requires considerable effort to reuse.
  2. Data is visible, licensed, and easy to reuse, but not necessarily by all.
  3. Data is visible and easy to reuse by all (not restricted to using specific software).
  4. Data is visible, easy to use and described in a standard fashion.
  5. Data is visible, easy to use, described in a standard fashion and meaning is clarified by being linked to a common definition.

Detailed descriptions

Below are more technical descriptions of each star:

 On the web with an open license

Making data available on the web with an appropriate license to re-use is the first step. However if it is only presented in PDF then it is very difficult and labour intensive to actually re-use.

Data in tables on a webpage are also difficult to re-use. It takes a lot of effort to strip away the HTML code surrounding the actual data. Data in PDF documents is even harder to reuse. It is difficult for a software programmer to identify where the data starts and ends within such documents, so that they can extract it.

Machine-readable data

Machine-readable data are structured and predictable, with well established and known ways to query and consume it using software code. They are in a standard format, where standard rules apply to the structure in which the data is presented. This means they can be readily consumed by a software program accessing it from across the web and developers can consume the data from within their program/application. Note that being machine-readable does not equate to being easily read by humans.

A very basic machine-readable format is a Comma Separated Values (CSV) file. Each row in a CSV file has the same number of columns (attributes or characteristics) separated by a comma. The first row is usually the column headings (the name of each attribute). This format is simple and predictable when it comes to coding a program to read it and use it.

For example, a CSV file of motor vehicle data could look like this:

Make,Model,Colour,Engine (headings
Toyota,Corolla,Fuschia,1600 (data following the same pattern as the headings)
Toyota,Camry,Blue,2200  
Mitsubishi,Galant,Green,3000  

 

Other standard machine-readable formats allow for more sophisticated used re-use of the data, for example XML. Two common versions of XML called ATOM and JSON are most commonly used to deliver the data when establishing an Application Programming Interface (API). Geospatial data have a number of specialised formats.

Non-proprietary formats

Non-proprietary formats can be accessed by any software. The data is in a format that does not require specific software or systems to access it. For example, a CSV file can be opened in any spreadsheet software, whether it is Microsoft Excel, Open Office, an iPad app etc. It can also be readily imported into any database software, or any newly developed software. By contrast, a Microsoft Excel spreadsheet is only accessible using Microsoft Excel software, or a few others applications that have been written to be compatible.

RDF Standards

RDF stands for Resource Description Framework which is a framework for describing resources on the web. RDF breaks down data into a series of facts. A fact is expressed as a “triple” of the form: Subject, Predicate, Object.

For example, take the first record in the CSV file above: Toyota,Corolla,Fuschia,1600.

In RDF this could be expressed as:

(Subject) (Predicate) (Object)
:Car :isMake :Toyota
:Car :isModel :Corolla
:Car :isColour :Fuschia
:Car :hasEngine :1600

 

Linked RDF

Linked RDF goes a step further and describes meaning by making reference to “a source of meaning” on the web. The result is linked data that is comparable.

It does this by providing a link to a source on the internet that defines what, for example, ‘fuchsia’ means. An example may be a colour palette at Resene. Any other RDF data anywhere on the internet that makes the same reference to the same source of meaning can then be used? with confidence that the different data sets are describing the same colour (because is fuchsia red or pink? Or reddish-pink? Or pinky-red?).

So if someone combined RDF data about car production and about car accidents, if both datasets referenced the Resene colour palette (on the web) for fuchsia, then we can be certain they are talking about the same colour car.

The link in Linked RDF is via a Uniform Resource Identifier (URI), a form of hyperlink. Sometimes it’s an address to a static webpage location that contains the information, or sometimes it’s actually an address plus a query to the database at that location.

Creative Commons logo Re-use of this content is licensed under a Creative Commons Attribution 4.0 International License.

5star
Page last updated: 03/05/2016