In Depth Guide
This is a start guide for how the internals of CfGRIB.jl work, targeted towards advanced users or those who want to work with the internals of the code.
If you want a quick guide on how to use the package then check the Quick Start Guide
Internals
The package internals are covered in the library section of the documentation in greater detail, however it is useful to have a vague sense of what is happening when you load a dataset.
First, we load the package, and for convenience create a string pointing to our file path:
julia> using CfGRIB
julia> sample_data_dir = abspath(joinpath(dirname(pathof(CfGRIB)), "..", "test", "sample-data"))
"/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data"
julia> demo_file_path = joinpath(sample_data_dir, "era5-levels-members.grib")
"/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data/era5-levels-members.grib"
FileIndex
Whenever you load a grib
file, the first thing that happens is that the file index is read. The file index contains metadata which describes which messages contain what information inside the file. We can explore the index by manually creating a FileIndex
object.
First, we can look at the docstring for the FileIndex
constructor by typing in ?
at the REPL to enter help
mode, then type in CfGRIB.FileIndex
, press enter, and we get the docstring:
help?> CfGRIB.FileIndex
Summary
≡≡≡≡≡≡≡≡≡
mutable struct FileIndex
A mutable store for indices of a GRIB file
TODO: Should probably change this to a immutable struct
Fields
≡≡≡≡≡≡≡≡
• allowed_protocol_version::VersionNumber
Version number used when saving/hashing index files, should change if
the indexing structure changes breaking backwards-compatibility
• grib_path::String
Path to the file the index belongs to
• index_path::String
Path to the index cache file
• index_keys::Array{String,1}
Array containing all of the index keys
• offsets::Array{Pair{NamedTuple,Int64},1}
Array containing pairs of offsets[HeaderTuple(header_values)] => offset_field
• message_lengths::Array{Int64,1}
Array containing the length of each message in the GRIB file
• header_values::OrderedCollections.OrderedDict{String,Array}
Dictionary of all of the loaded header values in the GRIB file
• filter_by_keys::Dict
Filters used when creating the file index
Constructors
≡≡≡≡≡≡≡≡≡≡≡≡≡≡
FileIndex()
defined at dev/CfGRIB/src/indexing.jl:34
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/indexing.jl#L34).
FileIndex(grib_path, index_keys)
defined at dev/CfGRIB/src/indexing.jl:38
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/indexing.jl#L38).
The docstring is quite long, it explains the fields contained in the object, as well as giving a list of the constructors which can be used to create an instance of the object.
We'll use the second constructor, which takes in a path to the file and a list of keys. First, we pick which keys we want to use. In this case we'll just use the ALL_KEYS
constant:
julia> println(CfGRIB.ALL_KEYS)
["DxInMetres", "DyInMetres", "J", "K", "LaDInDegrees", "Latin1InDegrees", "Latin2InDegrees", "LoVInDegrees", "M", "N", "NV", "Nx", "Ny", "angleOfRotationInDegrees", "centre", "centreDescription", "cfName", "cfVarName", "dataDate", "dataTime", "dataType", "directionNumber", "edition", "endStep", "forecastMonth", "frequencyNumber", "gridDefinitionDescription", "gridType", "iDirectionIncrementInDegrees", "iScansNegatively", "indexing_time", "jDirectionIncrementInDegrees", "jPointsAreConsecutive", "jScansPositively", "latitudeOfFirstGridPointInDegrees", "latitudeOfLastGridPointInDegrees", "latitudeOfSouthernPoleInDegrees", "level", "longitudeOfFirstGridPointInDegrees", "longitudeOfLastGridPointInDegrees", "longitudeOfSouthernPoleInDegrees", "missingValue", "name", "number", "numberOfDirections", "numberOfFrequencies", "numberOfPoints", "paramId", "pl", "shortName", "step", "stepType", "stepUnits", "subCentre", "time", "totalNumber", "typeOfLevel", "units", "valid_time", "verifying_time"]
julia> index = CfGRIB.FileIndex(
demo_file_path,
CfGRIB.ALL_KEYS
);
From here you can explore fields contained in this object. Typically you will never interact with the FileIndex
directly, as it's just used in the background to load the data.
DataSet
Once the FileIndex
has been created, the next step is to use it to create a DataSet
object. The DataSet
is what what you use to access the stored data. The docstring says:
help?> CfGRIB.DataSet
Summary
≡≡≡≡≡≡≡≡≡
struct DataSet
Map a GRIB file to the NetCDF Common Data Model with CF Conventions.
Fields
≡≡≡≡≡≡≡≡
• dimensions::OrderedCollections.OrderedDict{String,Int64}
OrderedDict{String,Int} of $DIMENSION_NAME => $DIMENSION_LENGTH.
• variables::OrderedCollections.OrderedDict{String,CfGRIB.Variable}
OrderedDict{String,CfGRIB.Variable} of $DIMENSION_NAME => $DIMENSION_VARIABLE, where the the variable is a CfGRIB.jl Variable.
• attributes::OrderedCollections.OrderedDict{String,Any}
OrderedDict{String,Any} containing some metadata extracted from the file.
• encoding::Dict{String,Any}
Dict{String,Any} containing metadata related to CfGRIB.jl, e.g. filter_by_keys
Constructors
≡≡≡≡≡≡≡≡≡≡≡≡≡≡
DataSet(dimensions, variables, attributes, encoding)
defined at dev/CfGRIB/src/dataset.jl:127
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L127).
DataSet(path; read_keys, kwargs...)
defined at dev/CfGRIB/src/dataset.jl:140
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L140).
Here we see references to Variable
, so we'll briefly explain those.
Variable
A Variable
is a basic struct in CfGRIB.jl which contains information for a variable read from a GRIB file:
help?> CfGRIB.Variable
Summary
≡≡≡≡≡≡≡≡≡
struct Variable
Struct describing a cfgrib variable
Fields
≡≡≡≡≡≡≡≡
• dimensions::Tuple{Vararg{String,N} where N}
Name of the dimension(s) contained in this variable
• data::Union{CfGRIB.OnDiskArray, Number, Array}
Data contained in the variable, can point ot in-memory data or to a CfGRIB
OnDiskArray
• attributes::Dict{String,Any}
Dictionary containing metadata for the variable, typically the units, the long name,
and the standard name
Constructors
≡≡≡≡≡≡≡≡≡≡≡≡≡≡
Variable(dimensions, data, attributes)
defined at dev/CfGRIB/src/dataset.jl:108
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L108).
OnDiskArray
As explained above, Variable
s contain a data
field, this data can either be in-memory data (Array
, Number
), or it could be an OnDiskArray
. On disk arrays are, as the name hints, a way to represent data stored on the disk before that data is loaded.
This is done do make it a bit easier to deal with large datasets, as the data is only lazily loaded in when the user attempts to read it. And then, only the requested data is stored in memory.
help?> CfGRIB.OnDiskArray
Summary
≡≡≡≡≡≡≡≡≡
struct OnDiskArray
Struct that contains metadata for an array, used to lazy-load the array from disk only when
requested
Fields
≡≡≡≡≡≡≡≡
• grib_path::String
• size::Tuple
• offsets::OrderedCollections.OrderedDict
• message_lengths::Array{Int64,1}
• missing_value::Any
• geo_ndim::Int64
• dtype::Type
Constructors
≡≡≡≡≡≡≡≡≡≡≡≡≡≡
OnDiskArray(grib_path, size, offsets, message_lengths, missing_value, geo_ndim, dtype)
defined at dev/CfGRIB/src/dataset.jl:27
(https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L27).
The OnDiskArray
object contains enough information to fully describe the data stored on disk, and to allow for easy indexing into this data. A custom getindex
method dispatches off of this type which opens the grib file at grib_path
and reads only the relevant messages.
For example, if a 3 dimensional array is described by OnDiskArray
, and the user requests information with index [1, :, :]
, then only messages within that index are loaded from the grib file.
DataSet
Constructors
Now that the groundwork is laid down, lets look into how files are read and used in the end. The most basic option is calling DataSet
with a string as a path, this will use the constructor defined at dev/CfGRIB/src/dataset.jl:140 (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L140).
As you can see this creates a FileIndex
, and then returns:
DataSet(build_dataset_components(
index;
errors=errors,
encode_cf=encode_cf,
squeeze=squeeze,
read_keys=read_keys,
time_dims=time_dims,
)...)
The call to build_dataset_components
returns the dimensions, variables, attributes, and encoding read from a file. These four variables are then passed to the other relevant constructor defined at dev/CfGRIB/src/dataset.jl:127 (https://github.com/ecmwf/cfgrib.jl/tree/5ced129d540ed9a1ff57da48c9b4f047b17d936d//src/dataset.jl#L127).
The constructor then returns a DataSet
object.
Getting Data from a DataSet
Onc you have a DataSet
object, you probably want to access its data.
Direct Access
The most basic way to do this is to just access the variables
directly. For example:
julia> dataset = CfGRIB.DataSet(demo_file_path);
┌ Warning: Missing from GRIB Stream directionNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
┌ Warning: Missing from GRIB Stream frequencyNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
┌ Warning: Missing from GRIB Stream directionNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
┌ Warning: Missing from GRIB Stream frequencyNumber
└ @ CfGRIB ~/.julia/packages/CfGRIB/t9LHA/src/dataset.jl:327
julia> dataset.dimensions
OrderedCollections.OrderedDict{Any,Any} with 5 entries:
"number" => 10
"time" => 4
"isobaricInhPa" => 2
"longitude" => 120
"latitude" => 61
julia> dataset.variables
OrderedCollections.OrderedDict{Any,Any} with 9 entries:
"number" => Variable(("number",), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], Dict…
"time" => Variable(("time",), [1483228800, 1483272000, 1483315200, 1…
"step" => Variable((), 0, Dict{String,Any}("units"=>"hours","long_na…
"isobaricInhPa" => Variable(("isobaricInhPa",), [850, 500], Dict{String,Any}(…
"latitude" => Variable(("latitude",), [90.0, 87.0, 84.0, 81.0, 78.0, 75.…
"longitude" => Variable(("longitude",), [0.0, 3.0, 6.0, 9.0, 12.0, 15.0, …
"valid_time" => Variable(("time",), [1483228800, 1483272000, 1483315200, 1…
"z" => Variable(("number", "time", "isobaricInhPa", "longitude", …
"t" => Variable(("number", "time", "isobaricInhPa", "longitude", …
From here you can check the Variable
documentation to see what is stored in these. So, if we want to get the data for z
:
julia> dataset.variables["z"]
CfGRIB.Variable(("number", "time", "isobaricInhPa", "longitude", "latitude"), CfGRIB.OnDiskArray("/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data/era5-levels-members.grib", (10, 4, 2, 120, 61), OrderedCollections.OrderedDict((1, 1, 2) => 0,(2, 1, 2) => 14760,(3, 1, 2) => 29520,(4, 1, 2) => 44280,(5, 1, 2) => 59040,(6, 1, 2) => 73800,(7, 1, 2) => 88560,(8, 1, 2) => 103320,(9, 1, 2) => 118080,(10, 1, 2) => 132840…), [14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752 … 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752], 9999, 2, Float32), Dict{String,Any}("GRIB_units" => "m**2 s**-2","long_name" => "Geopotential","GRIB_dataType" => "an","GRIB_totalNumber" => 10,"GRIB_jScansPositively" => 0,"GRIB_name" => "Geopotential","GRIB_gridType" => "regular_ll","GRIB_Ny" => 61,"GRIB_longitudeOfLastGridPointInDegrees" => 357.0,"GRIB_stepUnits" => 1…))
julia> dataset.variables["z"].data
CfGRIB.OnDiskArray("/home/runner/.julia/packages/CfGRIB/t9LHA/test/sample-data/era5-levels-members.grib", (10, 4, 2, 120, 61), OrderedCollections.OrderedDict((1, 1, 2) => 0,(2, 1, 2) => 14760,(3, 1, 2) => 29520,(4, 1, 2) => 44280,(5, 1, 2) => 59040,(6, 1, 2) => 73800,(7, 1, 2) => 88560,(8, 1, 2) => 103320,(9, 1, 2) => 118080,(10, 1, 2) => 132840…), [14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752 … 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752, 14752], 9999, 2, Float32)
julia> convert(Array, dataset.variables["z"].data)[:, :, 1, 1, 1]
10×4 Array{Union{Missing, Float32},2}:
14201.8 14016.0 13708.7 13255.5
14209.2 14010.4 13702.5 13259.5
14212.2 14023.9 13696.9 13254.6
14205.7 14034.3 13710.4 13246.6
14203.8 14023.8 13711.3 13270.9
14205.9 14009.2 13691.9 13261.1
14197.3 14011.9 13711.4 13246.1
14198.7 13994.9 13702.4 13254.3
14215.0 14001.4 13710.5 13245.8
14202.1 14008.7 13709.8 13251.7
Since it's an OnDiskArray
it has to be converted
(which in this case just reads the data from disk) into an Array. Once that's done, it's just a standard array type which can be accessed.
For a normal variable stored in memory this is a bit easier as the reading step does not have to be performed:
julia> dataset.variables["number"]
CfGRIB.Variable(("number",), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], Dict{String,Any}("units" => "1","long_name" => "ensemble member numerical id","standard_name" => "realization"))
julia> dataset.variables["number"].data
10-element Array{Int64,1}:
0
1
2
3
4
5
6
7
8
9
Accessing all of the data this way would be extremely awkward, so we provide a number of multidimensional named-axis backends which make data access far easier.
Using Named Dimensional Backends
The recommended way to use CfGRIB.jl is to use an array backend. More information about backends can be found on the Backends documentation page.
If one of the backend dependencies is available you can convert to that backend data type with the convert
function:
julia> using AxisArrays
julia> dimensional_dataset = convert(AxisArray, dataset)
AxisArrayWrapper with 2 dataset(s)
OrderedCollections.OrderedDict{Any,Any} with 7 entries:
"GRIB_edition" => 1
"GRIB_centre" => "ecmf"
"GRIB_centreDescription" => "European Centre for Medium-Range Weather Forecas…
"GRIB_subCentre" => 0
"Conventions" => "CF-1.7"
"institution" => "European Centre for Medium-Range Weather Forecas…
"history" => "2020-11-29T13:19:12.208 GRIB to CDM+CF via cfgri…
This conversion to a backend will create an object for that specific backend, preserving all of the data that was present in our DataSet
objects (e.g. the metadata will all be propagated through).
Current backend implementations have two limitations:
- No 'dataset' like support
- No metadata support
These limitations mean that we have to create a wrapper struct which can hold the multidimensional array type from the backend, as well as some additional attributes.
In the python xarray
package, there are two basic types: a DataArray
and a DataSet
. The DataArray
is a multidimensional array of a single variable, which contains information for that variable as well as information about the dimensions which enables useful indexing capabilities.
The DataSet
is a set of multiple DataArray
s with common dimensions. This lets you have a DataArray
containing pressure information with dimensions of, for example, time, latitude, longitude, and height; if you have another set of data with the same dimensions but for temperature then you can store both in a singe DataSet
.
The backends we currently use do not have this functionality, so instead we just wrap the two variables and allow for easy access to both.
Additionally, our DataSet
contains some more metadata (such as the attributes and encoding information), which also cannot be stored in the array backends, so we store that in the wrapper as well.
To access the data you first access a single specific dataset and then index into it as per the docs for your chosen backend. For example, above we use AxisArrays
as the backend, so:
julia> using AxisArrays
julia> z = dimensional_dataset.z; # Looking at the `z` variable
julia> z[number=atvalue(0), isobaricInhPa=700..900, longitude=40..44]
4-dimensional AxisArray{Union{Missing, Float32},4,...} with axes:
:time, [1483228800, 1483272000, 1483315200, 1483358400]
:isobaricInhPa, [850, 500]
:longitude, [42.0]
:latitude, [90.0, 87.0, 84.0, 81.0, 78.0, 75.0, 72.0, 69.0, 66.0, 63.0 … -63.0, -66.0, -69.0, -72.0, -75.0, -78.0, -81.0, -84.0, -87.0, -90.0]
And data, a 4×2×1×61 Array{Union{Missing, Float32},4}:
[:, :, 1, 1] =
14201.8 51169.7
14016.0 51021.0
13708.7 50662.6
13255.5 50020.9
[:, :, 1, 2] =
14461.5 51376.7
14347.1 51350.8
14031.6 51023.8
13555.4 50434.4
[:, :, 1, 3] =
14355.8 51298.5
14449.1 51440.5
14255.8 51261.3
13881.7 50837.9
...
[:, :, 1, 59] =
12971.8 51226.2
12980.1 51227.8
13027.4 51272.8
13053.1 51351.6
[:, :, 1, 60] =
12869.4 51110.7
12939.5 51117.5
13028.1 51166.1
13029.5 51173.4
[:, :, 1, 61] =
12660.4 50866.5
12797.1 50978.3
12963.9 51146.1
12987.2 51174.9