NZGOAL Guidance Note 2: File formats, updated August 2015

This Guidance Note provides practical advice for agencies when selecting the formats for releasing public information and data for re-use in accordance with the NZ Government Open Access and Licensing framework (NZGOAL), as required by the 2011 Declaration on Open and Transparent Government.This replaces the January 2015 version.

Application of this advice will assist agencies to assess data readiness for re-use in line with the 5 Star Open data measure.

Format Principles

Cabinet has approved the following format principles:

NZGOAL Open Format Principle (PDF, 970kB)

“When licensing copyright works and releasing non-copyright material for re-use, agencies should:

    1. consider the formats in which they ought to be released, taking into account, where relevant, the wishes of those who will or are likely to re-use the works or material;
    2. release them in the formats they know or believe are best suited for interoperability and re-use and are searchable and indexable by search engines; and
    3. in the case of datasets, add their details into data.govt.nz.

When releasing works or material in proprietary formats, agencies should also release the works or material in open, non-proprietary formats”.

Re-usable Principle in NZ Data and Information Management Principles (PDF, 180kB)

“Data and information released can be discovered, shared, used and re-used over time and through technology change. Copyright works are licensed for re-use and open access to and re-use of non-copyright materials is enabled, in accordance with the New Zealand Government Open Access and Licensing framework.

  • at source, with the highest possible level of granularity
  • in re-usable, machine-readable format
  • with appropriate metadata; and
  • in aggregate or modified forms if they cannot be released in their original state.

Data and information released in proprietary formats are also released in open, non-proprietary formats. Digital rights technologies are not imposed on materials made available for re-use”.

Guidance

  1. Agencies must plan to and move to the ideal state of creating, storing and releasing data and information in open, non-proprietary and machine readable formats . This will evolve as agencies update their information systems, move to open standards, introduce web services and/or amend their publication processes. This will ensure that information released by agencies is also reusable.
  2. Agencies releasing data and information in proprietary formats must move to releasing the data also in open, non-proprietary and machine readable formats. This may initially involve releasing in different proprietary formats, for example, tables in a report released in portable document format (PDF) also released as separate MS Excel files. The next step would be to release the tables also as an open Comma Separated Values file (CSV) and to progress to the standard open formats listed below.
  3. Agencies must work with users of their data and information to understand the formats they prefer to enable them to re-use this material.
  4. Agencies will then progress along the 5 Star Open data measure.

Formats for Open Data

This table lays out common formats for releasing data for re-use. If you are considering releasing data for re-use in formats not listed below, you should consult the Open Government Information and Data Programme. Note that document formats such as PDF and Word are not suitable formats for providing data for re-use.

Recommended formats for the release of open data
FormatMachine-readable for purposes of data re-useOpen standardBest used for
JSON Yes RFC 7159, ECMA-404 General data interchange and is commonly used as part of a RESTful API service.
Comma Separated Variable (CSV) Yes RFC 4180 Tabular and statistical data
Spreadsheets (XLSX, ODS) Yes if laid out in CSV-like format (may be as a supplementary worksheet). Spreadsheets laid out for visual understanding require manipulation to be made machine readable. ISO 29500 (XLSX)
ISO 26300 (ODS)
Tabular and statistical data
Spreadsheets (XLS) Yes if laid out in CSV-like format (may be as a supplementary worksheet). Spreadsheets laid out for visual understanding require manipulation to be made machine readable. Proprietary Tabular and statistical data
Hypertext Markup Language (HTML) Yes, but additional formats should be provided for data. W3C Recommendation Web documents
Extensible Markup Language (XML) Yes W3C Recommendation Documents / data structures conforming to published schemas
Resource Description Framework (RDF) and Linked RDF Yes Suite of W3C Recommendations Any data
iCal Yes Proprietary (maintained by Apple Inc.), but widely supported Used for sharing events and calendar based information
Open Geospatial Consortium standards (WFS, WCS, WPS, WMS, WMTS) Yes OGC Standard All geospatial data
Keyhole Markup Language (KML) and Geography Markup Language (GML) Yes OGC Standard Geospatial data, but has limitations compared with other OGC standards; may be convenient for non-geospatial specialists.
GeoPackage Yes OGC Standard Sharing geospatial data, modern alternative to Shapefile.
GeoJSON Yes Publicly developed, freely available specification. Geospatial data, but has limitations compared with OGC standards; may be convenient for non-geospatial specialists.
Shape Files (SHP) Yes Proprietary, but specification published and maintained by ESRI. Geospatial data, but has limitations compared with OGC standards; may be convenient for non-geospatial specialists.
Sensor Observation Service (SOS) Yes OGC Standard Sensor data, generally associated with a geospatial location.
CityGML Yes OGC Standard Storage and exchange of virtual 3D city models.

A note on data formats

Always provide alternatives. Re-users have a range of needs, capabilities and tools at their disposal. Providing data in alternative formats or layouts facilitates broader opportunities for re-use.

Consider industry- or sector-specific formats. Many industries and verticals have specialised formats for data representation and interchange, often in XML or JSON format. It is recommended that these be explored with industry or sector groups before releasing specialised data and used where possible. Some examples of industry specific formats are:

Tabular and statistical data

Providing data in the form of spreadsheets laid out to aid human comprehension is useful for many people, but generally requires laborious manipulation to be made machine readable and usable by software programmes and visualisation or analytical tools.

Human-friendly spreadsheets should always be accompanied by raw data in CSV format, or at the very least a worksheet containing all the raw data that underpins the spreadsheet, laid out CSV-style (one row of headings, complete rows of data cells and no visual formatting). A good example can be found in this spreadsheet from Treasury (.xls, 348kB) - go to the Raw data worksheet.

If agencies need to release data in tab, tilde (~) or other delimited formats, it should be noted in descriptive text accompanying the release, and on data.govt.nz.

Agencies should also consider providing readily-available query methods (such as JSON APIs) for commonly accessed data, to allow advanced users to search and retrieve a subset of the raw data in machine readable form as and when needed. APIs should be accompanied by thorough documentation and example implementations to facilitate their use.

Geospatial data

Some users have rigorous requirements of geospatial data in order to ensure high degrees of accuracy over time, and need data in the form of OGC web services and ISO 19115 metadata. These formats support the development of robust spatial data infrastructures, local and national physical infrastructure, surveying and geographical services etc.

Others however can benefit from simpler mechanisms such as KML and the Google Maps APIs, or converting KML or Shapefiles for use on OpenStreetMap. They are useful for people who may not be geospatial professionals but are using spatially-aware tools to develop services or products such as visualisations, simple mapping or real-time plotting services.

Where possible geospatial data should be provided in alternative formats - via web services and download - to support a range of uses.

Granularity

Datasets should be listed on data.govt.nz at the most granular level possible. For example, agencies publishing survey data as a collection of spreadsheets or CSV files should provide a descrioption of each spreadsheet or file and list them individually on data.govt.nz.

For tabular and statistical data presented as spreadsheets of multiple worksheets but also containing a worksheet of raw data, the spreadsheet can be considered sufficiently granular to list on data.govt.nz.

Agencies publishing formatted spreadsheets without the accompanying raw data should include the raw data – in a CSV-like layout – underpinning all worksheets in the spreadsheet as an additional worksheet.

In all cases the metadata description for a record on data.govt.nz should be sufficiently detailed that users can understand what type of data they will find in the dataset, and have confidence that the data to be downloaded is the data they want.

Geospatial data is more easily discoverable when listed on data.govt.nz as individual layers, as the LINZ Data Service and others do, rather than as aggregated collections of data. Individual layers accessible in a range of formats and comprehensively described in metadata should be listed as individual entries on data.govt.nz.

Definitions

Open

In the simplest terms, an open format is a format that has an open standard associated with it. An open standard is made through a transparent, collaborative process, fairly accessible for zero or low cost, mature and supported by the market.

Non- proprietary

Proprietary formats are formats designed to work only in the proprietary programmes that created them. When releasing high-value public data and information for re-use, it should be released in open and non-proprietary formats. However, if a proprietary format is commonly used the data may be released in a proprietary format, as well as a non-proprietary format.

Machine readable

Machine readable data is data that is designed to be consumed directly by computer programs (applications) without a human middleman.

Page last updated: 07/08/2015