Data.ByteString.Builder.Prim
Copyright | (c) 2010-2011 Simon Meier (c) 2010 Jasper van der Jeugt |
---|---|
License | BSD3-style (see LICENSE) |
Maintainer | Simon Meier <[email protected]> |
Portability | GHC |
Safe Haskell | Trustworthy |
Language | Haskell98 |
Description
This module provides Builder
primitives, which are lower level building blocks for constructing Builder
s. You don't need to go down to this level but it can be slightly faster.
Morally, builder primitives are like functions a -> Builder
, that is they take a value and encode it as a sequence of bytes, represented as a Builder
. Of course their implementation is a bit more specialised.
Builder primitives come in two forms: fixed-size and bounded-size.
-
Fixed(-size) primitives are builder primitives that always result in a sequence of bytes of a fixed length. That is, the length is independent of the value that is encoded. An example of a fixed size primitive is the big-endian encoding of a
Word64
, which always results in exactly 8 bytes. -
Bounded(-size) primitives are builder primitives that always result in a sequence of bytes that is no larger than a predetermined bound. That is, the bound is independent of the value that is encoded but the actual length will depend on the value. An example for a bounded primitive is the UTF-8 encoding of a
Char
, which can be 1,2,3 or 4 bytes long, so the bound is 4 bytes.
Note that fixed primitives can be considered as a special case of bounded primitives, and we can lift from fixed to bounded.
Because bounded primitives are the more general case, in this documentation we only refer to fixed size primitives where it matters that the resulting sequence of bytes is of a fixed length. Otherwise, we just refer to bounded size primitives.
The purpose of using builder primitives is to improve the performance of Builder
s. These improvements stem from making the two most common steps performed by a Builder
more efficient. We explain these two steps in turn.
The first most common step is the concatenation of two Builder
s. Internally, concatenation corresponds to function composition. (Note that Builder
s can be seen as difference-lists of buffer-filling functions; cf. http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dlist. ) Function composition is a fast O(1) operation. However, we can use bounded primitives to remove some of these function compositions altogether, which is more efficient.
The second most common step performed by a Builder
is to fill a buffer using a bounded primitives, which works as follows. The Builder
checks whether there is enough space left to execute the bounded primitive. If there is, then the Builder
executes the bounded primitive and calls the next Builder
with the updated buffer. Otherwise, the Builder
signals its driver that it requires a new buffer. This buffer must be at least as large as the bound of the primitive. We can use bounded primitives to reduce the number of buffer-free checks by fusing the buffer-free checks of consecutive Builder
s. We can also use bounded primitives to simplify the control flow for signalling that a buffer is full by ensuring that we check first that there is enough space left and only then decide on how to encode a given value.
Let us illustrate these improvements on the CSV-table rendering example from Data.ByteString.Builder. Its "hot code" is the rendering of a table's cells, which we implement as follows using only the functions from the Builder
API.
import Data.ByteString.Builder as B renderCell :: Cell -> Builder renderCell (StringC cs) = renderString cs renderCell (IntC i) = B.intDec i renderString :: String -> Builder renderString cs = B.charUtf8 '"' <> foldMap escape cs <> B.charUtf8 '"' where escape '\\' = B.charUtf8 '\\' <> B.charUtf8 '\\' escape '\"' = B.charUtf8 '\\' <> B.charUtf8 '\"' escape c = B.charUtf8 c
Efficient encoding of Int
s as decimal numbers is performed by intDec
. Optimization potential exists for the escaping of String
s. The above implementation has two optimization opportunities. First, the buffer-free checks of the Builder
s for escaping double quotes and backslashes can be fused. Second, the concatenations performed by foldMap
can be eliminated. The following implementation exploits these optimizations.
import qualified Data.ByteString.Builder.Prim as P import Data.ByteString.Builder.Prim (condB
,liftFixedToBounded
, (>*<
), (>$<
) ) renderString :: String -> Builder renderString cs = B.charUtf8 '"' <> E.encodeListWithB
escape cs <> B.charUtf8 '"' where escape :: E.BoundedPrim
Char escape =condB
(== '\\') (fixed2 ('\\', '\\')) $condB
(== '\"') (fixed2 ('\\', '\"')) $ E.charUtf8
{-# INLINE fixed2 #-} fixed2 x =liftFixedToBounded
$ const x>$<
E.char7
>*<
E.char7
The code should be mostly self-explanatory. The slightly awkward syntax is because the combinators are written such that the size-bound of the resulting BoundedPrim
can be computed at compile time. We also explicitly inline the fixed2
primitive, which encodes a fixed tuple of characters, to ensure that the bound computation happens at compile time. When encoding the following list of String
s, the optimized implementation of renderString
is two times faster.
maxiStrings :: [String] maxiStrings = take 1000 $ cycle ["hello", "\"1\"", "λ-wörld"]
Most of the performance gain stems from using primMapListBounded
, which encodes a list of values from left-to-right with a BoundedPrim
. It exploits the Builder
internals to avoid unnecessary function compositions (i.e., concatenations). In the future, we might expect the compiler to perform the optimizations implemented in primMapListBounded
. However, it seems that the code is currently to complicated for the compiler to see through. Therefore, we provide the BoundedPrim
escape hatch, which allows data structures to provide very efficient encoding traversals, like primMapListBounded
for lists.
Note that BoundedPrim
s are a bit verbose, but quite versatile. Here is an example of a BoundedPrim
for combined HTML escaping and UTF-8 encoding. It exploits that the escaped character with the maximal Unicode codepoint is '>'.
{-# INLINE charUtf8HtmlEscaped #-} charUtf8HtmlEscaped :: E.BoundedPrim Char charUtf8HtmlEscaped =condB
(> '>' ) E.charUtf8
$condB
(== '<' ) (fixed4 ('&',('l',('t',';')))) $ -- <condB
(== '>' ) (fixed4 ('&',('g',('t',';')))) $ -- >condB
(== '&' ) (fixed5 ('&',('a',('m',('p',';'))))) $ -- &condB
(== '"' ) (fixed5 ('&',('#',('3',('4',';'))))) $ -- "condB
(== '\'') (fixed5 ('&',('#',('3',('9',';'))))) $ -- ' (liftFixedToBounded
E.char7
) -- fallback forChar
s smaller than '>' where {-# INLINE fixed4 #-} fixed4 x =liftFixedToBounded
$ const x>$<
E.char7>*<
E.char7>*<
E.char7>*<
E.char7 {-# INLINE fixed5 #-} fixed5 x =liftFixedToBounded
$ const x>$<
E.char7>*<
E.char7>*<
E.char7>*<
E.char7>*<
E.char7
This module currently does not expose functions that require the special properties of fixed-size primitives. They are useful for prefixing Builder
s with their size or for implementing chunked encodings. We will expose the corresponding functions in future releases of this library.
Bounded-size primitives
data BoundedPrim a Source
A builder primitive that always results in sequence of bytes that is no longer than a pre-determined bound.
Combinators
The combinators for BoundedPrim
s are implemented such that the size of the resulting BoundedPrim
can be computed at compile time.
emptyB :: BoundedPrim a Source
The BoundedPrim
that always results in the zero-length sequence.
(>*<) :: Monoidal f => f a -> f b -> f (a, b) infixr 5 Source
A pairing/concatenation operator for builder primitives, both bounded and fixed size.
For example,
toLazyByteString (primFixed (char7 >*< char7) ('x','y')) = "xy"
We can combine multiple primitives using >*<
multiple times.
toLazyByteString (primFixed (char7 >*< char7 >*< char7) ('x',('y','z'))) = "xyz"
(>$<) :: Contravariant f => (b -> a) -> f a -> f b infixl 4 Source
A fmap-like operator for builder primitives, both bounded and fixed size.
Builder primitives are contravariant so it's like the normal fmap, but backwards (look at the type). (If it helps to remember, the operator symbol is like ($) but backwards.)
We can use it for example to prepend and/or append fixed values to an primitive.
showEncoding ((\x -> ('\'', (x, '\''))) >$< fixed3) 'x' = "'x'" where fixed3 = char7 >*< char7 >*< char7
Note that the rather verbose syntax for composition stems from the requirement to be able to compute the size / size bound at compile time.
eitherB :: BoundedPrim a -> BoundedPrim b -> BoundedPrim (Either a b) Source
Encode an Either
value using the first BoundedPrim
for Left
values and the second BoundedPrim
for Right
values.
Note that the functions eitherB
, pairB
, and contramapB
(written below using >$<
) suffice to construct BoundedPrim
s for all non-recursive algebraic datatypes. For example,
maybeB :: BoundedPrim () -> BoundedPrim a -> BoundedPrim (Maybe a) maybeB nothing just =maybe
(Left ()) Right>$<
eitherB nothing just
condB :: (a -> Bool) -> BoundedPrim a -> BoundedPrim a -> BoundedPrim a Source
Conditionally select a BoundedPrim
. For example, we can implement the ASCII primitive that drops characters with Unicode codepoints above 127 as follows.
charASCIIDrop =condB
(< '\128') (fromF
char7
)emptyB
Builder construction
primBounded :: BoundedPrim a -> a -> Builder Source
Create a Builder
that encodes values with the given BoundedPrim
.
We rewrite consecutive uses of primBounded
such that the bound-checks are fused. For example,
primBounded (word32 c1) `mappend` primBounded (word32 c2)
is rewritten such that the resulting Builder
checks only once, if ther are at 8 free bytes, instead of checking twice, if there are 4 free bytes. This optimization is not observationally equivalent in a strict sense, as it influences the boundaries of the generated chunks. However, for a user of this library it is observationally equivalent, as chunk boundaries of a lazy ByteString
can only be observed through the internal interface. Morevoer, we expect that all primitives write much fewer than 4kb (the default short buffer size). Hence, it is safe to ignore the additional memory spilled due to the more agressive buffer wrapping introduced by this optimization.
primMapListBounded :: BoundedPrim a -> [a] -> Builder Source
Create a Builder
that encodes a list of values consecutively using a BoundedPrim
for each element. This function is more efficient than the canonical
filter p = B.toLazyByteString . E.encodeLazyByteStringWithF (E.ifF p E.word8) E.emptyF)
mconcat . map (primBounded w)
or
foldMap (primBounded w)
because it moves several variables out of the inner loop.
primUnfoldrBounded :: BoundedPrim b -> (a -> Maybe (b, a)) -> a -> Builder Source
Create a Builder
that encodes a sequence generated from a seed value using a BoundedPrim
for each sequence element.
primMapByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder Source
Create a Builder
that encodes each Word8
of a strict ByteString
using a BoundedPrim
. For example, we can write a Builder
that filters a strict ByteString
as follows.
import Data.ByteString.Builder.Primas P (word8, condB, emptyB)
filterBS p = P.condB p P.word8 P.emptyB
primMapLazyByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder Source
Chunk-wise application of primMapByteStringBounded
.
Fixed-size primitives
A builder primitive that always results in a sequence of bytes of a pre-determined, fixed size.
Combinators
The combinators for FixedPrim
s are implemented such that the size
of the resulting FixedPrim
is computed at compile time.
The '(>*and '($<)' pairing and mapping operators can be used with FixedPrim
.
The FixedPrim
that always results in the zero-length sequence.
liftFixedToBounded :: FixedPrim a -> BoundedPrim a Source
Lift a FixedPrim
to a BoundedPrim
.
Builder construction
In terms of expressivity, the function fixedPrim
would be sufficient for constructing Builder
s from FixedPrim
s. The fused variants of this function are provided because they allow for more efficient implementations. Our compilers are just not smart enough yet; and for some of the employed optimizations (see the code of encodeByteStringWithF
) they will very likely never be.
Note that functions marked with "Heavy inlining." are forced to be inlined because they must be specialized for concrete encodings, but are rather heavy in terms of code size. We recommend to define a top-level function for every concrete instantiation of such a function in order to share its code. A typical example is the function byteStringHex
from Data.ByteString.Builder.ASCII, which is implemented as follows.
byteStringHex :: S.ByteString -> Builder byteStringHex =encodeByteStringWithF
word8HexFixed
primFixed :: FixedPrim a -> a -> Builder Source
Encode a value with a FixedPrim
.
primMapListFixed :: FixedPrim a -> [a] -> Builder Source
Encode a list of values from left-to-right with a FixedPrim
.
primUnfoldrFixed :: FixedPrim b -> (a -> Maybe (b, a)) -> a -> Builder Source
Encode a list of values represented as an unfoldr
with a FixedPrim
.
primMapByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder Source
Heavy inlining. Encode all bytes of a strict ByteString
from left-to-right with a FixedPrim
. This function is quite versatile. For example, we can use it to construct a Builder
that maps every byte before copying it to the buffer to be filled.
mapToBuilder :: (Word8 -> Word8) -> S.ByteString -> Builder mapToBuilder f = encodeByteStringWithF (contramapF f word8)
We can also use it to hex-encode a strict ByteString
as shown by the byteStringHex
example above.
primMapLazyByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder Source
Heavy inlining. Encode all bytes of a lazy ByteString
from left-to-right with a FixedPrim
.
Standard encodings of Haskell values
Binary encodings
Encoding single signed bytes as-is.
word8 :: FixedPrim Word8 Source
Encoding single unsigned bytes as-is.
Big-endian
int16BE :: FixedPrim Int16 Source
Encoding Int16
s in big endian format.
int32BE :: FixedPrim Int32 Source
Encoding Int32
s in big endian format.
int64BE :: FixedPrim Int64 Source
Encoding Int64
s in big endian format.
word16BE :: FixedPrim Word16 Source
Encoding Word16
s in big endian format.
word32BE :: FixedPrim Word32 Source
Encoding Word32
s in big endian format.
word64BE :: FixedPrim Word64 Source
Encoding Word64
s in big endian format.
floatBE :: FixedPrim Float Source
Encode a Float
in big endian format.
doubleBE :: FixedPrim Double Source
Encode a Double
in big endian format.
Little-endian
int16LE :: FixedPrim Int16 Source
Encoding Int16
s in little endian format.
int32LE :: FixedPrim Int32 Source
Encoding Int32
s in little endian format.
int64LE :: FixedPrim Int64 Source
Encoding Int64
s in little endian format.
word16LE :: FixedPrim Word16 Source
Encoding Word16
s in little endian format.
word32LE :: FixedPrim Word32 Source
Encoding Word32
s in little endian format.
word64LE :: FixedPrim Word64 Source
Encoding Word64
s in little endian format.
floatLE :: FixedPrim Float Source
Encode a Float
in little endian format.
doubleLE :: FixedPrim Double Source
Encode a Double
in little endian format.
Non-portable, host-dependent
intHost :: FixedPrim Int Source
Encode a single native machine Int
. The Int
s is encoded in host order, host endian form, for the machine you are on. On a 64 bit machine the Int
is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way are not portable to different endian or integer sized machines, without conversion.
int16Host :: FixedPrim Int16 Source
Encoding Int16
s in native host order and host endianness.
int32Host :: FixedPrim Int32 Source
Encoding Int32
s in native host order and host endianness.
int64Host :: FixedPrim Int64 Source
Encoding Int64
s in native host order and host endianness.
wordHost :: FixedPrim Word Source
Encode a single native machine Word
. The Word
s is encoded in host order, host endian form, for the machine you are on. On a 64 bit machine the Word
is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way are not portable to different endian or word sized machines, without conversion.
word16Host :: FixedPrim Word16 Source
Encoding Word16
s in native host order and host endianness.
word32Host :: FixedPrim Word32 Source
Encoding Word32
s in native host order and host endianness.
word64Host :: FixedPrim Word64 Source
Encoding Word64
s in native host order and host endianness.
floatHost :: FixedPrim Float Source
Encode a Float
in native host order and host endianness. Values written this way are not portable to different endian machines, without conversion.
doubleHost :: FixedPrim Double Source
Encode a Double
in native host order and host endianness.
Character encodings
ASCII
char7 :: FixedPrim Char Source
Encode the least 7-bits of a Char
using the ASCII encoding.
Decimal numbers
Decimal encoding of numbers using ASCII encoded characters.
int8Dec :: BoundedPrim Int8 Source
Decimal encoding of an Int8
.
int16Dec :: BoundedPrim Int16 Source
Decimal encoding of an Int16
.
int32Dec :: BoundedPrim Int32 Source
Decimal encoding of an Int32
.
int64Dec :: BoundedPrim Int64 Source
Decimal encoding of an Int64
.
intDec :: BoundedPrim Int Source
Decimal encoding of an Int
.
word8Dec :: BoundedPrim Word8 Source
Decimal encoding of a Word8
.
word16Dec :: BoundedPrim Word16 Source
Decimal encoding of a Word16
.
word32Dec :: BoundedPrim Word32 Source
Decimal encoding of a Word32
.
word64Dec :: BoundedPrim Word64 Source
Decimal encoding of a Word64
.
wordDec :: BoundedPrim Word Source
Decimal encoding of a Word
.
Hexadecimal numbers
Encoding positive integers as hexadecimal numbers using lower-case ASCII characters. The shortest possible representation is used. For example,
toLazyByteString (primBounded word16Hex 0x0a10) = "a10"
Note that there is no support for using upper-case characters. Please contact the maintainer if your application cannot work without hexadecimal encodings that use upper-case characters.
word8Hex :: BoundedPrim Word8 Source
Hexadecimal encoding of a Word8
.
word16Hex :: BoundedPrim Word16 Source
Hexadecimal encoding of a Word16
.
word32Hex :: BoundedPrim Word32 Source
Hexadecimal encoding of a Word32
.
word64Hex :: BoundedPrim Word64 Source
Hexadecimal encoding of a Word64
.
wordHex :: BoundedPrim Word Source
Hexadecimal encoding of a Word
.
Fixed-width hexadecimal numbers
Encoding the bytes of fixed-width types as hexadecimal numbers using lower-case ASCII characters. For example,
toLazyByteString (primFixed word16HexFixed 0x0a10) = "0a10"
int8HexFixed :: FixedPrim Int8 Source
Encode a Int8
using 2 nibbles (hexadecimal digits).
int16HexFixed :: FixedPrim Int16 Source
Encode a Int16
using 4 nibbles.
int32HexFixed :: FixedPrim Int32 Source
Encode a Int32
using 8 nibbles.
int64HexFixed :: FixedPrim Int64 Source
Encode a Int64
using 16 nibbles.
word8HexFixed :: FixedPrim Word8 Source
Encode a Word8
using 2 nibbles (hexadecimal digits).
word16HexFixed :: FixedPrim Word16 Source
Encode a Word16
using 4 nibbles.
word32HexFixed :: FixedPrim Word32 Source
Encode a Word32
using 8 nibbles.
word64HexFixed :: FixedPrim Word64 Source
Encode a Word64
using 16 nibbles.
floatHexFixed :: FixedPrim Float Source
Encode an IEEE Float
using 8 nibbles.
doubleHexFixed :: FixedPrim Double Source
Encode an IEEE Double
using 16 nibbles.
ISO/IEC 8859-1 (Char8)
The ISO/IEC 8859-1 encoding is an 8-bit encoding often known as Latin-1. The Char8 encoding implemented here works by truncating the Unicode codepoint to 8-bits and encoding them as a single byte. For the codepoints 0-255 this corresponds to the ISO/IEC 8859-1 encoding. Note that the Char8 encoding is equivalent to the ASCII encoding on the Unicode codepoints 0-127. Hence, functions such as intDec
can also be used for encoding Int
s as a decimal number with Char8 encoded characters.
char8 :: FixedPrim Char Source
Char8 encode a Char
.
UTF-8
The UTF-8 encoding can encode all Unicode codepoints. It is equivalent to the ASCII encoding on the Unicode codepoints 0-127. Hence, functions such as intDec
can also be used for encoding Int
s as a decimal number with UTF-8 encoded characters.
charUtf8 :: BoundedPrim Char Source
UTF-8 encode a Char
.
© The University of Glasgow and others
Licensed under a BSD-style license (see top of the page).
https://downloads.haskell.org/~ghc/7.10.3/docs/html/libraries/bytestring-0.10.6.0/Data-ByteString-Builder-Prim.html