In Depth Guide

This is a start guide for how the internals of CfGRIB.jl work, targeted towards advanced users or those who want to work with the internals of the code.

If you want a quick guide on how to use the package then check the Quick Start Guide

Internals

The package internals are covered in the library section of the documentation in greater detail, however it is useful to have a vague sense of what is happening when you load a dataset.

First, we load the package, and for convenience create a string pointing to our file path:

julia> using CfGRIB

julia> sample_data_dir = abspath(joinpath(dirname(pathof(CfGRIB)), "..", "test", "sample-data"))
"/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data"

julia> demo_file_path = joinpath(sample_data_dir, "era5-levels-members.grib")
"/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data/era5-levels-members.grib"

Whenever you load a grib file, the first thing that happens is that the file index is read. The file index contains metadata which describes which messages contain what information inside the file. We can explore the index by manually creating a FileIndex object.

First, we can look at the docstring for the FileIndex constructor by typing in ? at the REPL to enter help mode, then type in CfGRIB.FileIndex, press enter, and we get the docstring:

help?> CfGRIB.FileIndex
  Summary
  ≡≡≡≡≡≡≡≡≡

  mutable struct FileIndex

  A mutable store for indices of a GRIB file

  TODO: Should probably change this to a immutable struct

  Fields
  ≡≡≡≡≡≡≡≡

    •    allowed_protocol_version::VersionNumber

        Version number used when saving/hashing index files, should change if
        the indexing structure changes breaking backwards-compatibility

    •    grib_path::String

        Path to the file the index belongs to

    •    index_path::String

        Path to the index cache file

    •    index_keys::Array{String,1}

        Array containing all of the index keys

    •    offsets::Array{Pair{NamedTuple,Int64},1}

        Array containing pairs of offsets[HeaderTuple(header_values)] => offset_field

    •    message_lengths::Array{Int64,1}

        Array containing the length of each message in the GRIB file

    •    header_values::OrderedCollections.OrderedDict{String,Array}

        Dictionary of all of the loaded header values in the GRIB file

    •    filter_by_keys::Dict

        Filters used when creating the file index

  Constructors
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  FileIndex()

  defined at dev/CfGRIB/src/indexing.jl:34
  (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/indexing.jl#L34).

  FileIndex(grib_path, index_keys)

  defined at dev/CfGRIB/src/indexing.jl:38
  (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/indexing.jl#L38).

The docstring is quite long, it explains the fields contained in the object, as well as giving a list of the constructors which can be used to create an instance of the object.

We'll use the second constructor, which takes in a path to the file and a list of keys. First, we pick which keys we want to use. In this case we'll just use the ALL_KEYS constant:

julia> println(CfGRIB.ALL_KEYS)
["DxInMetres", "DyInMetres", "J", "K", "LaDInDegrees", "Latin1InDegrees", "Latin2InDegrees", "LoVInDegrees", "M", "N", "NV", "Nx", "Ny", "angleOfRotationInDegrees", "centre", "centreDescription", "cfName", "cfVarName", "dataDate", "dataTime", "dataType", "directionNumber", "edition", "endStep", "forecastMonth", "frequencyNumber", "gridDefinitionDescription", "gridType", "iDirectionIncrementInDegrees", "iScansNegatively", "indexing_time", "jDirectionIncrementInDegrees", "jPointsAreConsecutive", "jScansPositively", "latitudeOfFirstGridPointInDegrees", "latitudeOfLastGridPointInDegrees", "latitudeOfSouthernPoleInDegrees", "level", "longitudeOfFirstGridPointInDegrees", "longitudeOfLastGridPointInDegrees", "longitudeOfSouthernPoleInDegrees", "missingValue", "name", "number", "numberOfDirections", "numberOfFrequencies", "numberOfPoints", "paramId", "pl", "shortName", "step", "stepType", "stepUnits", "subCentre", "time", "totalNumber", "typeOfLevel", "units", "valid_time", "verifying_time"]

julia> index = CfGRIB.FileIndex(
                  demo_file_path,
                  CfGRIB.ALL_KEYS
              );

From here you can explore fields contained in this object. Typically you will never interact with the FileIndex directly, as it's just used in the background to load the data.

`DataSet`

Once the FileIndex has been created, the next step is to use it to create a DataSet object. The DataSet is what what you use to access the stored data. The docstring says:

help?> CfGRIB.DataSet
  Summary
  ≡≡≡≡≡≡≡≡≡

  struct DataSet

  Map a GRIB file to the NetCDF Common Data Model with CF Conventions.

  Fields
  ≡≡≡≡≡≡≡≡

    •    dimensions::OrderedCollections.OrderedDict{String,Int64}

        OrderedDict{String,Int} of $DIMENSION_NAME => $DIMENSION_LENGTH.

    •    variables::OrderedCollections.OrderedDict{String,CfGRIB.Variable}

        OrderedDict{String,CfGRIB.Variable} of $DIMENSION_NAME => $DIMENSION_VARIABLE, where the the variable is a CfGRIB.jl Variable.

    •    attributes::OrderedCollections.OrderedDict{String,Any}

        OrderedDict{String,Any} containing some metadata extracted from the file.

    •    encoding::Dict{String,Any}

        Dict{String,Any} containing metadata related to CfGRIB.jl, e.g. filter_by_keys

  Constructors
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  DataSet(dimensions, variables, attributes, encoding)

  defined at dev/CfGRIB/src/dataset.jl:127
  (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L127).

  DataSet(path; read_keys, kwargs...)

  defined at dev/CfGRIB/src/dataset.jl:140
  (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L140).

Here we see references to Variable, so we'll briefly explain those.

`Variable`

A Variable is a basic struct in CfGRIB.jl which contains information for a variable read from a GRIB file:

help?> CfGRIB.Variable
  Summary
  ≡≡≡≡≡≡≡≡≡

  struct Variable

  Struct describing a cfgrib variable

  Fields
  ≡≡≡≡≡≡≡≡

    •    dimensions::Tuple{Vararg{String,N} where N}

        Name of the dimension(s) contained in this variable

    •    data::Union{CfGRIB.OnDiskArray, Number, Array}

        Data contained in the variable, can point ot in-memory data or to a CfGRIB
        OnDiskArray

    •    attributes::Dict{String,Any}

        Dictionary containing metadata for the variable, typically the units, the long name,
        and the standard name

  Constructors
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Variable(dimensions, data, attributes)

  defined at dev/CfGRIB/src/dataset.jl:108
  (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L108).

`OnDiskArray`

As explained above, Variables contain a data field, this data can either be in-memory data (Array, Number), or it could be an OnDiskArray. On disk arrays are, as the name hints, a way to represent data stored on the disk before that data is loaded.

This is done do make it a bit easier to deal with large datasets, as the data is only lazily loaded in when the user attempts to read it. And then, only the requested data is stored in memory.

help?> CfGRIB.OnDiskArray
  Summary
  ≡≡≡≡≡≡≡≡≡

  struct OnDiskArray

  Struct that contains metadata for an array, used to lazy-load the array from disk only when
  requested

  Fields
  ≡≡≡≡≡≡≡≡

    •    grib_path::String

    •    size::Tuple

    •    offsets::OrderedCollections.OrderedDict

    •    message_lengths::Array{Int64,1}

    •    missing_value::Any

    •    geo_ndim::Int64

    •    dtype::Type

  Constructors
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  OnDiskArray(grib_path, size, offsets, message_lengths, missing_value, geo_ndim, dtype)

  defined at dev/CfGRIB/src/dataset.jl:27
  (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L27).

The OnDiskArray object contains enough information to fully describe the data stored on disk, and to allow for easy indexing into this data. A custom getindex method dispatches off of this type which opens the grib file at grib_path and reads only the relevant messages.

For example, if a 3 dimensional array is described by OnDiskArray, and the user requests information with index [1, :, :], then only messages within that index are loaded from the grib file.

`DataSet` Constructors

Now that the groundwork is laid down, lets look into how files are read and used in the end. The most basic option is calling DataSet with a string as a path, this will use the constructor defined at dev/CfGRIB/src/dataset.jl:140 (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L140).

As you can see this creates a FileIndex, and then returns:

DataSet(build_dataset_components(
    index;
    errors=errors,
    encode_cf=encode_cf,
    squeeze=squeeze,
    read_keys=read_keys,
    time_dims=time_dims,
)...)

The call to build_dataset_components returns the dimensions, variables, attributes, and encoding read from a file. These four variables are then passed to the other relevant constructor defined at dev/CfGRIB/src/dataset.jl:127 (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L127).

The constructor then returns a DataSet object.

Getting Data from a `DataSet`

Onc you have a DataSet object, you probably want to access its data.

Direct Access

The most basic way to do this is to just access the variables directly. For example:

julia> dataset = CfGRIB.DataSet(demo_file_path);
┌ Warning: Missing from GRIB Stream directionNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
┌ Warning: Missing from GRIB Stream frequencyNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
┌ Warning: Missing from GRIB Stream directionNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
┌ Warning: Missing from GRIB Stream frequencyNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327

julia> dataset.dimensions
OrderedCollections.OrderedDict{Any,Any} with 5 entries:
  "number"        => 10
  "time"          => 4
  "isobaricInhPa" => 2
  "longitude"     => 120
  "latitude"      => 61

julia> dataset.variables
OrderedCollections.OrderedDict{Any,Any} with 9 entries:
  "number"        => Variable(("number",), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], Dict…
  "time"          => Variable(("time",), [1483228800, 1483272000, 1483315200, 1…
  "step"          => Variable((), 0, Dict{String,Any}("units"=>"hours","long_na…
  "isobaricInhPa" => Variable(("isobaricInhPa",), [850, 500], Dict{String,Any}(…
  "latitude"      => Variable(("latitude",), [90.0, 87.0, 84.0, 81.0, 78.0, 75.…
  "longitude"     => Variable(("longitude",), [0.0, 3.0, 6.0, 9.0, 12.0, 15.0, …
  "valid_time"    => Variable(("time",), [1483228800, 1483272000, 1483315200, 1…
  "z"             => Variable(("number", "time", "isobaricInhPa", "longitude", …
  "t"             => Variable(("number", "time", "isobaricInhPa", "longitude", …

From here you can check the Variable documentation to see what is stored in these. So, if we want to get the data for z:

julia> dataset.variables["z"]
CfGRIB.Variable(("number", "time", "isobaricInhPa", "longitude", "latitude"), CfGRIB.OnDiskArray("/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data/era5-levels-members.grib", (10, 4, 2, 120, 61), OrderedCollections.OrderedDict((1, 1, 2) => 0,(2, 1, 2) => 14760,(3, 1, 2) => 29520,(4, 1, 2) => 44280,(5, 1, 2) => 59040,(6, 1, 2) => 73800,(7, 1, 2) => 88560,(8, 1, 2) => 103320,(9, 1, 2) => 118080,(10, 1, 2) => 132840…), [14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752  …  14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752], 9999, 2, Float32), Dict{String,Any}("GRIB_units" => "m**2 s**-2","long_name" => "Geopotential","GRIB_dataType" => "an","GRIB_totalNumber" => 10,"GRIB_jScansPositively" => 0,"GRIB_name" => "Geopotential","GRIB_gridType" => "regular_ll","GRIB_Ny" => 61,"GRIB_longitudeOfLastGridPointInDegrees" => 357.0,"GRIB_stepUnits" => 1…))

julia> dataset.variables["z"].data
CfGRIB.OnDiskArray("/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data/era5-levels-members.grib", (10, 4, 2, 120, 61), OrderedCollections.OrderedDict((1, 1, 2) => 0,(2, 1, 2) => 14760,(3, 1, 2) => 29520,(4, 1, 2) => 44280,(5, 1, 2) => 59040,(6, 1, 2) => 73800,(7, 1, 2) => 88560,(8, 1, 2) => 103320,(9, 1, 2) => 118080,(10, 1, 2) => 132840…), [14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752  …  14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752], 9999, 2, Float32)

julia> convert(Array, dataset.variables["z"].data)[:, :, 1, 1, 1]
10×4 Array{Union{Missing, Float32},2}:
 14201.8  14016.0  13708.7  13255.5
 14209.2  14010.4  13702.5  13259.5
 14212.2  14023.9  13696.9  13254.6
 14205.7  14034.3  13710.4  13246.6
 14203.8  14023.8  13711.3  13270.9
 14205.9  14009.2  13691.9  13261.1
 14197.3  14011.9  13711.4  13246.1
 14198.7  13994.9  13702.4  13254.3
 14215.0  14001.4  13710.5  13245.8
 14202.1  14008.7  13709.8  13251.7

Since it's an OnDiskArray it has to be converted (which in this case just reads the data from disk) into an Array. Once that's done, it's just a standard array type which can be accessed.

For a normal variable stored in memory this is a bit easier as the reading step does not have to be performed:

julia> dataset.variables["number"]
CfGRIB.Variable(("number",), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], Dict{String,Any}("units" => "1","long_name" => "ensemble member numerical id","standard_name" => "realization"))

julia> dataset.variables["number"].data
10-element Array{Int64,1}:
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9

Accessing all of the data this way would be extremely awkward, so we provide a number of multidimensional named-axis backends which make data access far easier.

Using Named Dimensional Backends

The recommended way to use CfGRIB.jl is to use an array backend. More information about backends can be found on the Backends documentation page.

If one of the backend dependencies is available you can convert to that backend data type with the convert function:

julia> using AxisArrays

julia> dimensional_dataset = convert(AxisArray, dataset)
AxisArrayWrapper with 2 dataset(s)
OrderedCollections.OrderedDict{Any,Any} with 7 entries:
  "GRIB_edition"           => 1
  "GRIB_centre"            => "ecmf"
  "GRIB_centreDescription" => "European Centre for Medium-Range Weather Forecas…
  "GRIB_subCentre"         => 0
  "Conventions"            => "CF-1.7"
  "institution"            => "European Centre for Medium-Range Weather Forecas…
  "history"                => "2020-11-29T13:19:12.208 GRIB to CDM+CF via cfgri…

This conversion to a backend will create an object for that specific backend, preserving all of the data that was present in our DataSet objects (e.g. the metadata will all be propagated through).

Current backend implementations have two limitations:

No 'dataset' like support
No metadata support

These limitations mean that we have to create a wrapper struct which can hold the multidimensional array type from the backend, as well as some additional attributes.

In the python xarray package, there are two basic types: a DataArray and a DataSet. The DataArray is a multidimensional array of a single variable, which contains information for that variable as well as information about the dimensions which enables useful indexing capabilities.

The DataSet is a set of multiple DataArrays with common dimensions. This lets you have a DataArray containing pressure information with dimensions of, for example, time, latitude, longitude, and height; if you have another set of data with the same dimensions but for temperature then you can store both in a singe DataSet.

The backends we currently use do not have this functionality, so instead we just wrap the two variables and allow for easy access to both.

Additionally, our DataSet contains some more metadata (such as the attributes and encoding information), which also cannot be stored in the array backends, so we store that in the wrapper as well.

To access the data you first access a single specific dataset and then index into it as per the docs for your chosen backend. For example, above we use AxisArrays as the backend, so:

julia> using AxisArrays

julia> z = dimensional_dataset.z;  # Looking at the `z` variable

julia> z[number=atvalue(0), isobaricInhPa=700..900, longitude=40..44]
4-dimensional AxisArray{Union{Missing, Float32},4,...} with axes:
    :time, [1483228800, 1483272000, 1483315200, 1483358400]
    :isobaricInhPa, [850, 500]
    :longitude, [42.0]
    :latitude, [90.0, 87.0, 84.0, 81.0, 78.0, 75.0, 72.0, 69.0, 66.0, 63.0  …  -63.0, -66.0, -69.0, -72.0, -75.0, -78.0, -81.0, -84.0, -87.0, -90.0]
And data, a 4×2×1×61 Array{Union{Missing, Float32},4}:
[:, :, 1, 1] =
 14201.8  51169.7
 14016.0  51021.0
 13708.7  50662.6
 13255.5  50020.9

[:, :, 1, 2] =
 14461.5  51376.7
 14347.1  51350.8
 14031.6  51023.8
 13555.4  50434.4

[:, :, 1, 3] =
 14355.8  51298.5
 14449.1  51440.5
 14255.8  51261.3
 13881.7  50837.9

...

[:, :, 1, 59] =
 12971.8  51226.2
 12980.1  51227.8
 13027.4  51272.8
 13053.1  51351.6

[:, :, 1, 60] =
 12869.4  51110.7
 12939.5  51117.5
 13028.1  51166.1
 13029.5  51173.4

[:, :, 1, 61] =
 12660.4  50866.5
 12797.1  50978.3
 12963.9  51146.1
 12987.2  51174.9