This document describes the Noda Time-specific binary file format that is
produced by Noda Time's NodaTime.TzdbCompiler
tool. This tool reads the
text files from the tz database along with
some additional input from CLDR; it produces output in one of two different
formats:
a .NET resource file containing a number of different resources
(including, for example, one for each time zone), serialized as either
an XML (.resx
) or compiled (.resources
) file. This is the only
format accepted by Noda Time 1.0.x. The intention is to remove support
for this format in Noda Time 2.0.
a binary file format (the "NodaZoneData" format) that does not depend upon a .NET resource file acting as a container. This is the default format in Noda Time 1.1 and later, and is the only format supported by the Portable Class Library versions of Noda Time.
The formats have some similarities, but can be treated independently. In particular, the resource format supported by Noda Time 1.0.x is effectively frozen, while the NodaZoneData format will evolve as more data is required.
Most users will just use the version of the database embedded in the Noda
Time assembly (which is typically produced with the same version of
NodaTime.TzdbCompiler
as the Noda Time assembly itself), and so do not need to
worry about these details, but as we provide the ability to use a custom
compiled time zone database, we document here the compatibility rules
between the output produced by different versions of ZoneInfoCompiler
/NodaTime.TzdbCompiler
and
the Noda Time assembly.
In general, these are probably as you would expect: any version of Noda Time must be able to read a compiled time zone database consumed by an earlier version of Noda Time with the same major version number; the reverse, however, is not true.
For example, Noda Time 1.1.x should be able to read the compiled output used by Noda Time 1.0.x and 1.1.x, but would not necessarily be able to read the output used by version 1.2.x. Noda Time 2.x does not necessarily need to be able to read anything consumed by Noda Time 1.x.
(The same is true for the PCL version of Noda Time 1.x, though it is considered a separate development line for the purposes of compatibility, and so does not need to support the same features as the earlier non-PCL versions.)
Note that even though Noda Time must read earlier versions of the compiled database, it does not necessarily need to provide the same level of functionality when doing so as it might when reading a current version. In particular, only calls to methods supported by that earlier version of Noda Time are guaranteed to work; any calls to methods introduced in a later version may require a database compiled by at least that version.
The former is true because semantic versioning requires that we be able to drop a later version of the Noda Time assembly into a running environment without recompiling (or without replacing the compiled time zone database, which might e.g. be embedded in the main executable); the latter statement is true because it is unreasonable to expect that new functionality should work without the new data it may require.
As a consequence, it is reasonable to act as if any functionality introduced in a given version of Noda Time comes with an implicit caveat that it requires a database compiled against at least that version to work.
Compiling the input data using the (default) NodaZoneData
option produces
a compiled database in the "NodaZoneData" format, with an extension of ".nzd".
This is intended as the long-term format, and is designed with more extensibility
in mind than the resource file format.
The first four bytes of the file contain the format version number. At the time of this writing, the only supported version number is 0. Not all changes in the future will require a change in version number (as even within a version the format is extensible) but this provides flexibility in the future, where a single binary may be able to read a range of versions. At that point, this document will be extended to give details for each version separately. Each version may be completely independent of previous versions, so may redefine the meaning of particular field IDs, or even contain a different field structure.
After that, the file is simply a sequence of fields. Each field consists of:
count
(see details below)Fields are always stored in ascending order of ID (so later fields may depend on earlier ones). The expected data within each field is governed by its ID.
The "NodaZoneData" format makes use of the following serialization primitives:
byte
, a general 8-bit unsigned integer, simply serialized as a single byte.fixed16
, fixed32
, and fixed64
, general 16-bit, 32-bit, and 64-bit signed integers,
simply serialized as 2-, 4-, or 8-byte twos-complement values of the appropriate length,
in MSB-first order. For example, 0x1234
would be serialized as the bytes [0x12 0x34]
using
the fixed16
serialization, while -1 would be serialized as [0xff 0xff]
count
, a non-negative (typically small) int32
value, serialized as a variably-sized integer:
the value is written 7 bits at a time (least significant first), with the highest bit of each
byte set if this is not the final byte of the value. For example, the value 0x16a
would be
serialized as the bytes [0xea 0x02]
.signed count
, a potentially-negative (typically small in magnitude) int32
value, stored
in a "ZigZag" encoding: first mapping the signed number to an unsigned number, alternating
between signed and unsigned values: 0 maps to 0, -1 maps to 1, 1 maps 2, -2 maps to 3, 2 maps
to 4 etc. The result is then written in the same 7-bit encoding as count
.string
, a UTF-8 encoded text string or entry into a string pool. Each field either has a
string pool associated with it or it doesn't. In a field with a string pool, strings are simply
stored using the count
encoding to specify which entry in the string pool is used. In a field
without a string pool, the UTF-8-encoded size of the string is written using the count
encoding,
then the UTF-8 encoded bytes themselves are written.offset
, an int32
value representing a millisecond offset from UTC, with maximum and minimum
values of +/-24 hours (exclusive on both sides). The offset is first made positive by simply adding
24 hours, and then the resulting integer is represented in one of four ways, based on how neatly
the offset divides into a whole number of larger units. In each case where the representation is
more than one byte, the data is written with the most significant bits first. The representations are:
zoneintervaltransition
, an encoding for an instant possibly relative to an earlier or equal one. This
is used when writing transitions between zone intervals, where the additional context of a previous value
helps to reduce the data required. The format is (in order of preference):
Instant.MinValue
, a single byte 0
.Instant.MaxValue
, a single byte 1
.count
.count
.2
followed by a fixed64
value for the number of ticks since the Unix epoch.dictionary
, a string-to-string dictionary where no keys or values can be null. This is simply a
count
number of entries followed by each key/value pair as a string
key and string
value.The string pool field is built without reference to a string pool (as otherwise it would be pointless).
It is simply a count
followed by that many strings. The ordering of the strings is optimized (by running through
a lot of the serialization twice) so that the most commonly used strings come earliest. This allows those strings
to be represented very efficiently by later fields which use the string pool.
There must be exactly one string pool field.
The time zone field provides full details of one of the "canonical" time zones in TZDB. (Time zone aliases always eventually refer to a canonical time zone for their data.)
The format depends on the type of time zone. At the time of this writing, there are only two options, described below. In both cases, the format starts with the zone ID (as a string) and then a 1-byte flag to indicate the type.
Fixed time zones (which have a single ZoneInterval
covering
the whole of time). The flag value for a fixed zone is 1
. This is followed by the offset, and as of time zone data
generated under Noda Time 2.x, also the name for the zone interval. Noda Time 1.x code will ignore the name part,
and always use the time zone ID as the name for the interval. Noda Time 2.x code will read the name if it exists,
but default to the ID if the name data isn't present.
Everything else is represented as a PrecalculatedDateTimeZone
- a number of abutting ZoneIntervals
from the start of
time until either the end of time, or some stable period where a pair of rules governing when daylight saving time starts
and stops continues until the end of time. This final stable period is the "tail zone" - and is optional, as a zone which
ends in a fixed offset can simply represent that as a final zone interval to the end of time.
The flag value for a precalculated zone is 2
, and the remaining data is:
count
zoneintervaltransition
from the start of the previous interval (except for the first interval, which has no "previous")
to the start of the current one.string
name of the intervaloffset
(i.e. standard + daylight savings)offset
zoneintervaltransition
from the start of the final zone interval above to
the end of the final interval (which is the start of the tail zone, if there
is one, or Instant.MaxValue
otherwise).byte
to indicate the presence (1) or absence (0) of a tail zoneoffset
string
name used for intervals in "standard time"string
name used for intervals in "daylight saving time"offset
(to be added to the standard offset)A recurrence rule provides data about when a transition occurs. For example, "The first Sunday in October at 1am local time." The format is:
count
(single byte) for the month-of-year in which the rule startssigned count
(single byte) for the day-of-month in which the rule starts, where a negative
value indicates the number of days counted from the end of a month. (-1 is the last day of the
month, for example.)offset
used (somewhat unconventionally) to indicate the time of day at which the recurrence starts.This field may occur multiple times, and it always uses the string pool.
This is simply a single string
representing the version of TZDB being serialized, such as "2013a".
This field must occur exactly once, and it does not use the string pool.
This is a map from alias to canonical ID. So for example, "Europe/Guernsey" maps to "Europe/London". The
representation is a single dictionary
.
This field must occur exactly once, and it uses the string pool.
This field contains the data from the windowsZones.xml file in CLDR. The format is:
string
for the mapping version number (currently a string representation of the revision number)string
for the TZDB version the mapping was generated againststring
for the Windows time zone version the mapping was generated againstcount
for the number of "map zones" to followstring
for the Windows system time zone IDstring
for the territory IDcount
for the number of TZDB IDs in this mappingstring
This field must occur exactly once, and it uses the string pool.
This field is only present for the sake of the PCL (1.x) versions, and is not loaded in 2.x.
It contains a mapping from Windows TimeZoneInfo
standard name (in the en-US
locale) to TZDB ID for those time zones
where the two differ. (There are only a few.) This is required as the PCL doesn't expose the
TimeZoneInfo.Id
property.
The field data is a single dictionary
. It must occur exactly once, and it uses the string pool.
This field provides the information in the iso3166.tab
and zone.tab
files within TZDB, which
contain geographical information about time zones. Applications can use this to offer users a choice
of time zones in a friendlier way than using the TZDB ID... although the information is only provided
in English.
The field data consists of a count
of elements, then the elements themselves. Each element consists of:
signed count
for the latitude of the sample location, as an integer number of seconds.signed count
for the longitude of the sample location, as an integer number of seconds.string
for the name of the country.string
for the ISO-3166 two-letter country code.string
for the comment associated with the location (usually used to disambiguate between
locations in the same country).This field provides the information in the iso3166.tab
and zone1970.tab
files within TZDB.
The field data consists of a count
of elements, then the elements themselves. Each element consists of:
signed count
for the latitude of the sample location, as an integer number of seconds.signed count
for the longitude of the sample location, as an integer number of seconds.signed count
for the number of countries.string
for the name of the country.string
for the ISO-3166 two-letter country code.string
for the zone ID.string
for the comment associated with the location (usually used to disambiguate between
locations in the same country).Compiling the input data using the Resource
or ResX
options produces a
compiled database in the resource file format native to Noda Time 1.0.x.
This is represented as a .NET resource file containing many resources: one
resource for each time zone, and four additional meta-information resources.
The four meta-information resources, and their resource keys, are:
--meta-VersionId
, a String
resource containing the tzdb version ID
(e.g. "2012i")--meta-IdMap
, a byte[]
resource containing a serialized
dictionary
(see below for the definition of these types), mapping all
known time zone IDs (aliases and non-aliases) to a non-alias time zone
ID--meta-WindowsToPosix
, a byte[]
resource containing a serialized
dictionary
, mapping Windows time zone names to the corresponding tzdb
time zone ID--meta-WindowsToPosixVersion
, a String
resource containing the
revision number of the Windows mapping data (e.g. "7825")Each non-alias time zone is represented by a byte[]
resource containing a
serialized timezone
. The name of the resource is derived from the time zone
ID, with (theoretically) some substitutions for invalid characters. (The
substitutions are not documented here, since they do not actually appear to be
used in practice; for example, the name of the Europe/London resource is simply
"Europe/London", and the same appears to be true for all other time zones as of
the 2012i tzdb).
All of these resources may (theoretically) contain extra appended data not mentioned below; readers do not confirm that the end of the resource was reached when all the data they know how to read has been consumed.
The resource file format uses the following serialization primitives:
byte
, a general 8-bit unsigned integerfixed16
, fixed32
, and fixed64
, general 16-bit, 32-bit, and 64-bit
signed integerscount
, a non-negative (typically small) int32
valueoffset
, an int32
value representing a millisecond offset from UTCstring
, a UTF-8 encoded text stringdictionary
, a dictionary with string keys and valuestimezone
, a serialized time zone (with several variants and subtypes,
documented below)Of these, only dictionary
and timezone
are used directly; the others are
used as part of the definitions of these types, below.
A byte
is simply serialized as a single byte representing a value in the
range 0..255.
fixed16
, fixed32
, and fixed64
are simply serialized as 2-, 4-, or
8-byte twos-complement values of the appropriate length, in MSB-first order.
For example, 0x1234
would be serialized as the bytes [0x12 0x34]
using
the fixed16
serialization, while -1 would be serialized as [0xff 0xff]
.
A count
encodes a non-negative (typically small) int32
value; it is
serialized as a variable number of bytes according to the following
scheme:
0x00
..0x8e
are serialized as a single byte; values to 0x0e
are serialized as 0xf0
..0xfe
, while 0x0f
..0x8e
is serialized as
0x00
..0x7f
0x8f
..0x408e
is serialized as two bytes, by subtracting 0x8f
and
then encoding the resulting (0
–0x3fff
) value using the byte
0x80
..0xbf
for the MSB, followed by the LSB0x408f
..0x20408e
is serialized as three bytes, by subtracting
0x408f
and then encoding the resulting (0
–0x1fffff
) value
as the byte 0xc0
..0xdf
for the MSB, followed by the next two bytes
in MSB-first order (i.e. as a fixed16
)0x20408f
..0x1020408e
is serialized as four bytes, by subtracting
0x20408f
(giving 0
–0x0fffffff
), adding 0xe0000000
,
and then encoding the resulting (0xe0000000
–0xefffffff
) value
as a fixed32
0xff
followed by the value as a fixed32
An offset
encodes an int32
value representing an offset in milliseconds
from UTC. While the encoding below can in theory represent any int32
value, the use in Noda Time is solely for values in the range (-24 hours,
+24 hours); commonly for 'round' numbers of minutes, etc.
An offset
is serialized as a variable-length field as follows:
0x10
..0x6e
counting the number of 30 minute periods, with a
bias such that 0x10
represents an offset of -23:30, 0x11
represents
-23:00, 0x3f
represents a zero offset, and 0x6e
represents the
offset +23:300x82
..0x85
, by adding
0x83ffff
to the number of seconds, and writing three bytes in
MSB-first order, so that [0x82 0xae 0x80]
represents -23:59:59,
[0x84 0x00 0x00]
represents +00:00:01, and [0x85 0x51 0x7e]
represents +23:59:590xfd
followed by the number
of milliseconds as a fixed32
(The serialization method used additionally defines an encoding for 'out-of-range' offsets, but these cannot occur in practice, as the underlying C# data type restricts the range of the offset to that described above.)
A string
is serialized as a count
of the number of UTF-8 bytes in the
string followed by those bytes.
A dictionary
is serialized as a count
number of entries in the
dictionary, followed by that number of pairs of string
key and string
value.
A timezone
encodes a complete definition for a single time zone, excluding
the ID; it is represented in one of a few different forms depending on the
type of the time zone. The serialized format may be any of the following:
0x00
followed by another timezone
, indicating that the following
time zone (in practice, always one of type 0x04
) should have
information about zone transitions cached at runtime if possible0x01
followed by a daylightsavingstimezone
, representing a time
zone with a standard and DST offset, and simple recurrence rules for
each transition0x02
followed by an offset
, representing a time zone with a fixed
offset (and no DST); this is used natively when encoding time zones
with no DST (such as "Etc/UTC") and as part of the composite
"pre-calculated" time zone0x03
, representing 'no time zone' (only used when encoding an absent
'tail zone' for a pre-calculated zone)0x04
followed by a precalculatedtimezone
, representing a zone that
expands all known transitions explicitly, up to an optional 'tail zone'
representing a final (possibly DST-using) stable stateA daylightsavingstimezone
is serialized as an offset
containing the
standard offset followed by the DST zonerecurrence
and the standard time
(non-DST) zonerecurrence
. The DST recurrence contains an offset that is
the additional delta from the standard offset for daylight savings time
(usually positive, but sometimes zero); the standard time recurrence is
guaranteed to have a zero offset.
A zonerecurrence
is serialized as the string
name of the recurrence
(e.g. "PST"), the offset
containing the offset of this recurrence, a
description of when the recurrence occurs (the 'year offset'), and finally
two count
values indicating the (inclusive) years the recurrence covers,
where a zero start year indicates a recurrence that starts at the beginning
of time.
(Note that 1.0 versions of NodaTime serialized the 'beginning of time' as
the year int32.MinValue
instead of zero. The two should be treated
equally.)
The 'year offset' is:
count
indicating whether transitions are relative to UTC (0), the
'wall' (daylight savings) offset (1), or the standard offset (2)count
of the month-of-year in which this recurrence starts, with a
bias such that 13 represents January and 24 represents Decembercount
of the day-of-month in which this recurrence starts; values
from 0..30 indicate a number of days [-31..-1] counted from the end of
a (31 day) month (in practice, only 30 is used, to represent 'last
[day-of-week] of the month'), while values from 32..62 represent the
number of days [1..31] counted from the start of the monthcount
of the day of week in which this recurrence start, with a
bias such that 7 represents 'no day', and 8..13 represent
Monday..Sundaybyte
indicating whether the day of month represents an upper bound
(0) or lower bound (1); this is only relevant if a day of week is also
specified, where the rule might be described as e.g. "first Tuesday on
or after the 15th"offset
representing the (non-negative) time of day that the
recurrence startsbyte
indicating whether an additional day should be added to the
calculation (1) or not (0), to cope with transitions documented as
occurring at 24:00 - these may potentially spill into a following monthA precalculatedtimezone
is serialized as a string pool, a sequence of
zoneinterval
periods representing an expansion of all known transitions
before the rules stabilize, and an optional 'tail zone' representing a final stable
state. There is no tail zone if the final state is fixed (if the overall zone has
stopped observing DST); otherwise, it's a daylightsavingstimezone
with a pair of
rules which extend to the end of time.
Specifically:
count
of the number of entries
in the pool, followed by that number of string
valuescount
of the number of precalculated periods, followed by that
number of zoneinterval
valueszoneintervaltransition
for the tail zone (as the end of time if
there is no tail zone)timezone
representing the tail zone, in practice always either of
type 0x03
(no tail zone) or 0x01
(daylightsavingstimezone
).A zoneinterval
is serialized as the zoneintervaltransition
representing
the start of the transition (the beginning of time for the first one), an
index into the zone's interval name string pool, as either byte
or
fixed32
depending on the total number of entries in the pool, an offset
representing the wall-time offset from UTC (including any daylight savings
time) and a second offset
representing the daylight savings contribution
(or zero if the period is not a daylight savings period).
The periods serialized as part of a precalculatedtimezone
cover the whole of
time, so each zoneinterval
finishes as the next one starts (or as the tail
zone starts).
A zoneintervaltransition
represents an instant at which a transition
begins. It is serialized as a variable number of bytes according to the
following scheme, based on the (signed) number of ticks from the Unix epoch:
0xfe
and
0xff
respectively.0x00
..0x3e
counting the number of 30 minute
periods, with a bias such that 0x00
represents -15:30, 0x01
represents -15:00, 0x1f
represents a zero offset, and 0x3e
represents the +15:30. Values outside this range are serialized
as one of the forms below.fixed32
representing a 30 bit value with the top two bits of the fixed32
set to 01
.
The 30 bit value is the number of minutes, with a bias such that 0x00000000
represents -0x1fffff
minutes. (This is about 4 years. Note that this is shorter
than the intended span of about 1000 years, due to a bug; the bug cannot
now be fixed due to compatibility issues.) Values outside the +/-4 year range are
serialized as one of the forms below.byte
followed by a fixed32
, combined to represent a 40 bit value with the
top two bits of the first byte set to 10
. The 40 bit value is the number of
seconds, with a bias such that 0x0000000000
represents 0x1fffffffffL
seconds.
This is about 4355 years. Values outside the +/-4355 year range are
serialized as one of the forms below.0xc0
followed by a fixed64
with
the number of ticks.