In Depth Guide
This is a start guide for how the internals of CfGRIB.jl work, targeted towards advanced users or those who want to work with the internals of the code.
If you want a quick guide on how to use the package then check the Quick Start Guide
Internals
The package internals are covered in the library section of the documentation in greater detail, however it is useful to have a vague sense of what is happening when you load a dataset.
First, we load the package, and for convenience create a string pointing to our file path:
julia> using CfGRIB
julia> sample_data_dir = abspath(joinpath(dirname(pathof(CfGRIB)), "..", "test", "sample-data"))
"/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data"
julia> demo_file_path = joinpath(sample_data_dir, "era5-levels-members.grib")
"/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data/era5-levels-members.grib"FileIndex
Whenever you load a grib file, the first thing that happens is that the file index is read. The file index contains metadata which describes which messages contain what information inside the file. We can explore the index by manually creating a FileIndex object.
First, we can look at the docstring for the FileIndex constructor by typing in ? at the REPL to enter help mode, then type in CfGRIB.FileIndex, press enter, and we get the docstring:
help?> CfGRIB.FileIndex
Summary
≡≡≡≡≡≡≡≡≡
mutable struct FileIndex
A mutable store for indices of a GRIB file
TODO: Should probably change this to a immutable struct
Fields
≡≡≡≡≡≡≡≡
• allowed_protocol_version::VersionNumber
Version number used when saving/hashing index files, should change if
the indexing structure changes breaking backwards-compatibility
• grib_path::String
Path to the file the index belongs to
• index_path::String
Path to the index cache file
• index_keys::Array{String,1}
Array containing all of the index keys
• offsets::Array{Pair{NamedTuple,Int64},1}
Array containing pairs of offsets[HeaderTuple(header_values)] => offset_field
• message_lengths::Array{Int64,1}
Array containing the length of each message in the GRIB file
• header_values::OrderedCollections.OrderedDict{String,Array}
Dictionary of all of the loaded header values in the GRIB file
• filter_by_keys::Dict
Filters used when creating the file index
Constructors
≡≡≡≡≡≡≡≡≡≡≡≡≡≡
FileIndex()
defined at dev/CfGRIB/src/indexing.jl:34
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/indexing.jl#L34).
FileIndex(grib_path, index_keys)
defined at dev/CfGRIB/src/indexing.jl:38
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/indexing.jl#L38).The docstring is quite long, it explains the fields contained in the object, as well as giving a list of the constructors which can be used to create an instance of the object.
We'll use the second constructor, which takes in a path to the file and a list of keys. First, we pick which keys we want to use. In this case we'll just use the ALL_KEYS constant:
julia> println(CfGRIB.ALL_KEYS)
["DxInMetres", "DyInMetres", "J", "K", "LaDInDegrees", "Latin1InDegrees", "Latin2InDegrees", "LoVInDegrees", "M", "N", "NV", "Nx", "Ny", "angleOfRotationInDegrees", "centre", "centreDescription", "cfName", "cfVarName", "dataDate", "dataTime", "dataType", "directionNumber", "edition", "endStep", "forecastMonth", "frequencyNumber", "gridDefinitionDescription", "gridType", "iDirectionIncrementInDegrees", "iScansNegatively", "indexing_time", "jDirectionIncrementInDegrees", "jPointsAreConsecutive", "jScansPositively", "latitudeOfFirstGridPointInDegrees", "latitudeOfLastGridPointInDegrees", "latitudeOfSouthernPoleInDegrees", "level", "longitudeOfFirstGridPointInDegrees", "longitudeOfLastGridPointInDegrees", "longitudeOfSouthernPoleInDegrees", "missingValue", "name", "number", "numberOfDirections", "numberOfFrequencies", "numberOfPoints", "paramId", "pl", "shortName", "step", "stepType", "stepUnits", "subCentre", "time", "totalNumber", "typeOfLevel", "units", "valid_time", "verifying_time"]
julia> index = CfGRIB.FileIndex(
demo_file_path,
CfGRIB.ALL_KEYS
);From here you can explore fields contained in this object. Typically you will never interact with the FileIndex directly, as it's just used in the background to load the data.
DataSet
Once the FileIndex has been created, the next step is to use it to create a DataSet object. The DataSet is what what you use to access the stored data. The docstring says:
help?> CfGRIB.DataSet
Summary
≡≡≡≡≡≡≡≡≡
struct DataSet
Map a GRIB file to the NetCDF Common Data Model with CF Conventions.
Fields
≡≡≡≡≡≡≡≡
• dimensions::OrderedCollections.OrderedDict{String,Int64}
OrderedDict{String,Int} of $DIMENSION_NAME => $DIMENSION_LENGTH.
• variables::OrderedCollections.OrderedDict{String,CfGRIB.Variable}
OrderedDict{String,CfGRIB.Variable} of $DIMENSION_NAME => $DIMENSION_VARIABLE, where the the variable is a CfGRIB.jl Variable.
• attributes::OrderedCollections.OrderedDict{String,Any}
OrderedDict{String,Any} containing some metadata extracted from the file.
• encoding::Dict{String,Any}
Dict{String,Any} containing metadata related to CfGRIB.jl, e.g. filter_by_keys
Constructors
≡≡≡≡≡≡≡≡≡≡≡≡≡≡
DataSet(dimensions, variables, attributes, encoding)
defined at dev/CfGRIB/src/dataset.jl:127
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L127).
DataSet(path; read_keys, kwargs...)
defined at dev/CfGRIB/src/dataset.jl:140
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L140).Here we see references to Variable, so we'll briefly explain those.
Variable
A Variable is a basic struct in CfGRIB.jl which contains information for a variable read from a GRIB file:
help?> CfGRIB.Variable
Summary
≡≡≡≡≡≡≡≡≡
struct Variable
Struct describing a cfgrib variable
Fields
≡≡≡≡≡≡≡≡
• dimensions::Tuple{Vararg{String,N} where N}
Name of the dimension(s) contained in this variable
• data::Union{CfGRIB.OnDiskArray, Number, Array}
Data contained in the variable, can point ot in-memory data or to a CfGRIB
OnDiskArray
• attributes::Dict{String,Any}
Dictionary containing metadata for the variable, typically the units, the long name,
and the standard name
Constructors
≡≡≡≡≡≡≡≡≡≡≡≡≡≡
Variable(dimensions, data, attributes)
defined at dev/CfGRIB/src/dataset.jl:108
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L108).OnDiskArray
As explained above, Variables contain a data field, this data can either be in-memory data (Array, Number), or it could be an OnDiskArray. On disk arrays are, as the name hints, a way to represent data stored on the disk before that data is loaded.
This is done do make it a bit easier to deal with large datasets, as the data is only lazily loaded in when the user attempts to read it. And then, only the requested data is stored in memory.
help?> CfGRIB.OnDiskArray
Summary
≡≡≡≡≡≡≡≡≡
struct OnDiskArray
Struct that contains metadata for an array, used to lazy-load the array from disk only when
requested
Fields
≡≡≡≡≡≡≡≡
• grib_path::String
• size::Tuple
• offsets::OrderedCollections.OrderedDict
• message_lengths::Array{Int64,1}
• missing_value::Any
• geo_ndim::Int64
• dtype::Type
Constructors
≡≡≡≡≡≡≡≡≡≡≡≡≡≡
OnDiskArray(grib_path, size, offsets, message_lengths, missing_value, geo_ndim, dtype)
defined at dev/CfGRIB/src/dataset.jl:27
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L27).The OnDiskArray object contains enough information to fully describe the data stored on disk, and to allow for easy indexing into this data. A custom getindex method dispatches off of this type which opens the grib file at grib_path and reads only the relevant messages.
For example, if a 3 dimensional array is described by OnDiskArray, and the user requests information with index [1, :, :], then only messages within that index are loaded from the grib file.
DataSet Constructors
Now that the groundwork is laid down, lets look into how files are read and used in the end. The most basic option is calling DataSet with a string as a path, this will use the constructor defined at dev/CfGRIB/src/dataset.jl:140 (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L140).
As you can see this creates a FileIndex, and then returns:
DataSet(build_dataset_components(
index;
errors=errors,
encode_cf=encode_cf,
squeeze=squeeze,
read_keys=read_keys,
time_dims=time_dims,
)...)The call to build_dataset_components returns the dimensions, variables, attributes, and encoding read from a file. These four variables are then passed to the other relevant constructor defined at dev/CfGRIB/src/dataset.jl:127 (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L127).
The constructor then returns a DataSet object.
Getting Data from a DataSet
Onc you have a DataSet object, you probably want to access its data.
Direct Access
The most basic way to do this is to just access the variables directly. For example:
julia> dataset = CfGRIB.DataSet(demo_file_path);
┌ Warning: Missing from GRIB Stream directionNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
┌ Warning: Missing from GRIB Stream frequencyNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
┌ Warning: Missing from GRIB Stream directionNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
┌ Warning: Missing from GRIB Stream frequencyNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
julia> dataset.dimensions
OrderedCollections.OrderedDict{Any,Any} with 5 entries:
"number" => 10
"time" => 4
"isobaricInhPa" => 2
"longitude" => 120
"latitude" => 61
julia> dataset.variables
OrderedCollections.OrderedDict{Any,Any} with 9 entries:
"number" => Variable(("number",), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], Dict…
"time" => Variable(("time",), [1483228800, 1483272000, 1483315200, 1…
"step" => Variable((), 0, Dict{String,Any}("units"=>"hours","long_na…
"isobaricInhPa" => Variable(("isobaricInhPa",), [850, 500], Dict{String,Any}(…
"latitude" => Variable(("latitude",), [90.0, 87.0, 84.0, 81.0, 78.0, 75.…
"longitude" => Variable(("longitude",), [0.0, 3.0, 6.0, 9.0, 12.0, 15.0, …
"valid_time" => Variable(("time",), [1483228800, 1483272000, 1483315200, 1…
"z" => Variable(("number", "time", "isobaricInhPa", "longitude", …
"t" => Variable(("number", "time", "isobaricInhPa", "longitude", …From here you can check the Variable documentation to see what is stored in these. So, if we want to get the data for z:
julia> dataset.variables["z"]
CfGRIB.Variable(("number", "time", "isobaricInhPa", "longitude", "latitude"), CfGRIB.OnDiskArray("/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data/era5-levels-members.grib", (10, 4, 2, 120, 61), OrderedCollections.OrderedDict((1, 1, 2) => 0,(2, 1, 2) => 14760,(3, 1, 2) => 29520,(4, 1, 2) => 44280,(5, 1, 2) => 59040,(6, 1, 2) => 73800,(7, 1, 2) => 88560,(8, 1, 2) => 103320,(9, 1, 2) => 118080,(10, 1, 2) => 132840…), [14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752 … 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752], 9999, 2, Float32), Dict{String,Any}("GRIB_units" => "m**2 s**-2","long_name" => "Geopotential","GRIB_dataType" => "an","GRIB_totalNumber" => 10,"GRIB_jScansPositively" => 0,"GRIB_name" => "Geopotential","GRIB_gridType" => "regular_ll","GRIB_Ny" => 61,"GRIB_longitudeOfLastGridPointInDegrees" => 357.0,"GRIB_stepUnits" => 1…))
julia> dataset.variables["z"].data
CfGRIB.OnDiskArray("/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data/era5-levels-members.grib", (10, 4, 2, 120, 61), OrderedCollections.OrderedDict((1, 1, 2) => 0,(2, 1, 2) => 14760,(3, 1, 2) => 29520,(4, 1, 2) => 44280,(5, 1, 2) => 59040,(6, 1, 2) => 73800,(7, 1, 2) => 88560,(8, 1, 2) => 103320,(9, 1, 2) => 118080,(10, 1, 2) => 132840…), [14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752 … 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752], 9999, 2, Float32)
julia> convert(Array, dataset.variables["z"].data)[:, :, 1, 1, 1]
10×4 Array{Union{Missing, Float32},2}:
14201.8 14016.0 13708.7 13255.5
14209.2 14010.4 13702.5 13259.5
14212.2 14023.9 13696.9 13254.6
14205.7 14034.3 13710.4 13246.6
14203.8 14023.8 13711.3 13270.9
14205.9 14009.2 13691.9 13261.1
14197.3 14011.9 13711.4 13246.1
14198.7 13994.9 13702.4 13254.3
14215.0 14001.4 13710.5 13245.8
14202.1 14008.7 13709.8 13251.7Since it's an OnDiskArray it has to be converted (which in this case just reads the data from disk) into an Array. Once that's done, it's just a standard array type which can be accessed.
For a normal variable stored in memory this is a bit easier as the reading step does not have to be performed:
julia> dataset.variables["number"]
CfGRIB.Variable(("number",), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], Dict{String,Any}("units" => "1","long_name" => "ensemble member numerical id","standard_name" => "realization"))
julia> dataset.variables["number"].data
10-element Array{Int64,1}:
0
1
2
3
4
5
6
7
8
9Accessing all of the data this way would be extremely awkward, so we provide a number of multidimensional named-axis backends which make data access far easier.
Using Named Dimensional Backends
The recommended way to use CfGRIB.jl is to use an array backend. More information about backends can be found on the Backends documentation page.
If one of the backend dependencies is available you can convert to that backend data type with the convert function:
julia> using AxisArrays
julia> dimensional_dataset = convert(AxisArray, dataset)
AxisArrayWrapper with 2 dataset(s)
OrderedCollections.OrderedDict{Any,Any} with 7 entries:
"GRIB_edition" => 1
"GRIB_centre" => "ecmf"
"GRIB_centreDescription" => "European Centre for Medium-Range Weather Forecas…
"GRIB_subCentre" => 0
"Conventions" => "CF-1.7"
"institution" => "European Centre for Medium-Range Weather Forecas…
"history" => "2020-11-29T13:19:12.208 GRIB to CDM+CF via cfgri…This conversion to a backend will create an object for that specific backend, preserving all of the data that was present in our DataSet objects (e.g. the metadata will all be propagated through).
Current backend implementations have two limitations:
- No 'dataset' like support
- No metadata support
These limitations mean that we have to create a wrapper struct which can hold the multidimensional array type from the backend, as well as some additional attributes.
In the python xarray package, there are two basic types: a DataArray and a DataSet. The DataArray is a multidimensional array of a single variable, which contains information for that variable as well as information about the dimensions which enables useful indexing capabilities.
The DataSet is a set of multiple DataArrays with common dimensions. This lets you have a DataArray containing pressure information with dimensions of, for example, time, latitude, longitude, and height; if you have another set of data with the same dimensions but for temperature then you can store both in a singe DataSet.
The backends we currently use do not have this functionality, so instead we just wrap the two variables and allow for easy access to both.
Additionally, our DataSet contains some more metadata (such as the attributes and encoding information), which also cannot be stored in the array backends, so we store that in the wrapper as well.
To access the data you first access a single specific dataset and then index into it as per the docs for your chosen backend. For example, above we use AxisArrays as the backend, so:
julia> using AxisArrays
julia> z = dimensional_dataset.z; # Looking at the `z` variable
julia> z[number=atvalue(0), isobaricInhPa=700..900, longitude=40..44]
4-dimensional AxisArray{Union{Missing, Float32},4,...} with axes:
:time, [1483228800, 1483272000, 1483315200, 1483358400]
:isobaricInhPa, [850, 500]
:longitude, [42.0]
:latitude, [90.0, 87.0, 84.0, 81.0, 78.0, 75.0, 72.0, 69.0, 66.0, 63.0 … -63.0, -66.0, -69.0, -72.0, -75.0, -78.0, -81.0, -84.0, -87.0, -90.0]
And data, a 4×2×1×61 Array{Union{Missing, Float32},4}:
[:, :, 1, 1] =
14201.8 51169.7
14016.0 51021.0
13708.7 50662.6
13255.5 50020.9
[:, :, 1, 2] =
14461.5 51376.7
14347.1 51350.8
14031.6 51023.8
13555.4 50434.4
[:, :, 1, 3] =
14355.8 51298.5
14449.1 51440.5
14255.8 51261.3
13881.7 50837.9
...
[:, :, 1, 59] =
12971.8 51226.2
12980.1 51227.8
13027.4 51272.8
13053.1 51351.6
[:, :, 1, 60] =
12869.4 51110.7
12939.5 51117.5
13028.1 51166.1
13029.5 51173.4
[:, :, 1, 61] =
12660.4 50866.5
12797.1 50978.3
12963.9 51146.1
12987.2 51174.9