These notes form the basis for an as-yet-unwritten "String Datatypes" UG section.
H5Pset_char_encoding
(objects),
H5Tset_cset
(datatypes)
H5Tvlen_create
creates something quite different
Creating variable-length string datatypes
A heavily revised version of this section
(see _topic/create_vlen_strings.htm
)
is included via PHP on the H5T RM page.
As the term implies,
variable-length strings are strings of varying lengths.
Real variable-length strings can be arbitrarily long,
anywhere from 1 character to thousands of characters long.
These are what HDF5 calls variable-length strings
and, for the sake of discussion, we'll call them
unconstrained variable-length strings in this article.
But there is also a subclass of variable-length strings that vary within a well-defined range. For example, a set of strings might be known to always be between 5 and 20 characters long. In this article, we will call this subclass constrained variable-length strings. From HDF5’s point of view, these are actually just fixed-length strings that may happen to be shorter in length than the assigned datatype. Think of them as faux variable-length strings; we'll discuss them in more detail shortly.
Before we start creating strings, let’s look at string and character datatypes for a minute. HDF5 provides the following predefined datatypes that are relevant to this discussion, one string datatype and three character datatypes:
H5T_C_S1 H5T_NATIVE_CHAR H5T_NATIVE_SCHAR H5T_NATIVE_UCHARThe character datatypes,
H5T_NATIVE_CHAR
,
H5T_NATIVE_SCHAR
, and
H5T_NATIVE_UCHAR
,
are single-character datatypes;
a data element of one of these datatypes always contains one character.
They are unsuitable for creating a string datatype.
The string datatype,
H5T_C_S1
for C and
H5T_FORTRAN_S1
for Fortran,
defaults to one character in size but can be resized to any length.
These types are therefore the base type for any fixed-length
or variable-length string datatype.
Creating unconstrained
(or real) variable-length string datatypes:
The following HDF5 call creates a variable-length string datatype,
vls_type_id
:
vls_type_id = H5Tcreate(H5T_C_S1, H5T_VARIABLE) (call 1)Strings of type
vls_type_id
can be of arbitrary length.
In a C environment, these strings will always be NULL-terminated, so the buffer to hold such a string in memory must be one byte larger than the string itself to accomadate the NULL terminator.
Under the covers, variable-length strings are stored in a heap, which can present challenges for efficient storage and read/write access.
The next section discusses a different approach which may be useful in situations where it is known that the string length in a dataset will vary within known bounds.
Creating datatypes for constrained
(or faux) variable-length strings:
To avoid the storage and I/O overhead associated with heaps,
it will sometimes be useful to take a different approach when
it is known that the string length in a dataset
will always fall within known bounds.
Consider the example of a dataset containing one million strings that you know will range from 5 to 20 bytes in length. The following HDF5 call creates a string datatype for strings up to 20 bytes.
to20B_type_id = H5Tcreate(H5T_C_S1, 20) (call 2)If a particular data element is just a 5-byte string, simply write it to the dataset as a 5-byte string plus a NULL terminator (6 bytes total). When HDF5 reads the data back in a C environment and as it works with the data, HDF5 will interpret the NULL-terminated string transparently and correctly.
Note that variable-length strings stored in this manner must always be NULL-terminated unless they exactly fill the full datatype space (exactly 20 bytes in this case). Failure to include the NULL-terminator will result in either misinterpreted data or undefined values.
Strings in this dataset can be of any length up to 20 bytes, giving you essentially a constrained variable-length string. But since everything is handled within a fixed-length datatype, you receive all the benefits of HDF5’s highly efficient sequential I/O without the overhead of extracting data from a heap.
If this datatype were defined as in call 1 and the million-element dataset were fully populated, reading the entire dataset would require HDF5, under the covers, to issue up to 2 million seeks and reads to pluck the data elements 1-by-1 from the heap. Using this faux variable-length datatype, HDF5 can read the entire dataset with a couple of seeks and reads.
Note that this dataset can also be chunked, an option that is not available in a heap and is thus unavailable for a dataset of unconstrained variable-length strings.
Creating fixed-length string datatypes:
Relative to any form of variable-length string datatype,
fixed-length string datatypes are straight-forward.
The following HDF5 call creates a a fixed-length, 30-byte
string datatype:
20B_type_id = H5Tcreate(H5T_C_S1, 30)This datatype can be used for 30-character ASCII strings without any need for NULL terminators or any other special handling.
[ Consider a note regarding the accommodations necessary to handle fixed-length UTF-8 strings. ]
H5Tvlen_create
does not create variable-length strings
H5Tvlen_create
,
that function actually creates a fundamentally different datatype object.
H5Tvlen_create
creates a datatype that is a
one-dimensional array datatype with array elements of the base datatype.
Consider the following examples:
vl_char_type_id = H5Tvlen_create(H5T_NATIVE_CHAR)This call creates a datatype that holds a variable-size, one-dimensional array of data elements; each element is of the
H5T_NATIVE_CHAR
base datatype.
12B_string_type_id = H5Tset_size(H5T_C_S1, 12) vl_12B_string_type_id = H5Tvlen_create(12B_string)This pair of calls creates a datatype that holds a variable-size, one-dimensional array of 12-byte strings.
vl_int8_type_id = H5Tvlen_create(H5T_IEEE_F32BE)The above call creates a datatype that holds a variable-size, one-dimensional array of IEEE big-endian 32-bit floats.