Encoding ODB-2 Data
Trivial Example
Given a pandas DataFrame
to encode it, the data should simply be passed to encode_odb()
function:
[1]:
import sys
import os
sys.path.insert(0, os.path.abspath('../../..'))
[2]:
import pandas as pd
import pyodc as odc
df = pd.read_csv('data-1.csv')
odc.encode_odb(df, 'example-1.odb')
File Type Object
Encoding of ODB-2 data works with file-like objects as well as with file names:
[3]:
with open('example-1.odb', 'wb') as f:
odc.encode_odb(df, f)
Configuring Encoded Columns
By default, pyodc will always encode ODB-2 data in a lossless manner. In particular, most values are encoded as 8-byte DOUBLE values.
Typically, the encoder will automatically select a data type and corresponding encoder to use. This data type can be overridden by supplying a types dictionary, for example to encode a column as a 4-byte REAL value:
[4]:
odc.encode_odb(df, 'example-3.odb', types={'obsvalue@body': odc.REAL})
The interrogation of the frame headers shows that the data type has changed:
[5]:
r1 = odc.Reader('example-1.odb', aggregated=False)
r3 = odc.Reader('example-3.odb', aggregated=False)
print('original:', r1.frames[0].column_dict['obsvalue@body'].dtype)
print('updated: ', r3.frames[0].column_dict['obsvalue@body'].dtype)
original: DataType.DOUBLE
updated: DataType.REAL
Decoded data also confirms that the precision has been appropriately reduced:
[6]:
df_decoded = odc.read_odb('example-3.odb', single=True)
print(df_decoded)
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat00 0-12345-0-67890 0.000000
1 1 20210420 stat01 0-12345-0-67891 12.345600
2 1 20210420 stat02 0-12345-0-67892 24.691200
3 1 20210420 stat03 0-12345-0-67893 37.036800
4 1 20210420 stat04 0-12345-0-67894 49.382401
5 1 20210420 stat05 0-12345-0-67895 61.728001
6 1 20210420 stat06 0-12345-0-67896 74.073601
7 1 20210420 stat07 0-12345-0-67897 86.419197
8 1 20210420 stat08 0-12345-0-67898 98.764801
9 1 20210420 stat09 0-12345-0-67899 111.110397
integer_missing double_missing
0 1234.0 12.34
1 4321.0 43.21
2 NaN NaN
3 1234.0 12.34
4 4321.0 43.21
5 NaN NaN
6 1234.0 12.34
7 4321.0 43.21
8 NaN NaN
9 1234.0 12.34
Configuring Frame Structure
ODB-2 data is broken down into frames. By default a maximum of 10 000 rows of data will be encoded into each frame. If more than 10 000 rows are supplied, then the data will be split into a sequence of frames with at maximum 10 000 rows.
To modify the threshold, pass rows_per_frame
argument:
[7]:
odc.encode_odb(df, 'example-4.odb', rows_per_frame=3)
Examination of the frame structure clearly shows that the data now contains multiple frames:
[8]:
r1 = odc.Reader('example-1.odb', aggregated=False)
r4 = odc.Reader('example-4.odb', aggregated=False)
print('original frames:', r1.frames)
print('updated frames:', r4.frames)
print('original row counts:', [f.nrows for f in r1.frames])
print('updated row counts:', [f.nrows for f in r4.frames])
original frames: [<pyodc.frame.Frame object at 0x7f85a83d99d0>]
updated frames: [<pyodc.frame.Frame object at 0x7f85b8818df0>, <pyodc.frame.Frame object at 0x7f85a83b6c70>, <pyodc.frame.Frame object at 0x7f85a8570940>, <pyodc.frame.Frame object at 0x7f85a8570be0>]
original row counts: [10]
updated row counts: [3, 3, 3, 1]
Despite these differences, if decoded the data is the same:
[9]:
df_decoded = odc.read_odb('example-4.odb', single=True)
print(df_decoded)
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat00 0-12345-0-67890 0.0000
1 1 20210420 stat01 0-12345-0-67891 12.3456
2 1 20210420 stat02 0-12345-0-67892 24.6912
3 1 20210420 stat03 0-12345-0-67893 37.0368
4 1 20210420 stat04 0-12345-0-67894 49.3824
5 1 20210420 stat05 0-12345-0-67895 61.7280
6 1 20210420 stat06 0-12345-0-67896 74.0736
7 1 20210420 stat07 0-12345-0-67897 86.4192
8 1 20210420 stat08 0-12345-0-67898 98.7648
9 1 20210420 stat09 0-12345- 111.1104
integer_missing double_missing
0 1234.0 12.34
1 4321.0 43.21
2 NaN NaN
3 1234.0 12.34
4 4321.0 43.21
5 NaN NaN
6 1234.0 12.34
7 4321.0 43.21
8 NaN NaN
9 1234.0 12.34
Additional Properties
To encode additional properties as part of frame’s data, specify properties
parameter to encode_odb()
function with a dictionary value you want to include:
[10]:
metadata = {
'encoded_by': 'ECMWF',
'data_source': 'pyodc_docs',
}
odc.encode_odb(df, 'example-1.odb', properties=metadata)
Encoded properties are accessible via properties
key of the frame object:
[11]:
r1 = odc.Reader('example-1.odb')
print([f.properties for f in r1.frames])
[{}]