One of the current proposals for changing C++ is D1040R4, which proposes adding a function called
std::embed, which can load binary assets during compilation. My initial reaction to the proposal was that the author’s must have missed the existing solutions to this problem. That was hubris (sorry). The authors have indeed considered existing solutions, and the proposal is well written, but the motivation for this feature needs more thought. It is not clear that the existing solutions are insufficient, or if moving that functionality into the compiler would be a benefit.
Although a few different approaches are discussed in the proposal, I don’t think that considering the “manual work” approach is informative. Depending on scope and experience, someone might hand-wrap data into a literal. Either way, it does not inform the best tool once someone decides that they need a tool to help. Similarly, the comments on MongoDB’s bespoke Python is not informative either. The current proposal would mandate replacing their custom Python code with custom constexpr code in C++ to get the same effect. The primary question is: What is the best approach to embed a binary asset into a library or program?
In the past, my approach has been to use a tool to convert the binary file into C or C++. For mostly textual data, a tool like
file2c1 can be used, or, as suggested by D1040R4, use
xxd -i. The build pipeline is very simple2:
And that would create output similar to the following:
The supposed weaknesses of this approach are described in the proposal.
This is problematic because it turns binary data in[to] C++ source. In many cases, this results in a larger file due to having to restructure the data to fit grammar requirements. It also results in needing an extra build step, which throws any programmer immediately at the mercy of build tools and project management.
None of which I have found to be true. Yes, there is an intermediate source file in C or C++, but when was the last time you ran out of disk space? For a ephemeral file that needs to exist only long enough to be fed to the compiler? Elsewhere it is suggested that this will impact compile times, but my memories of painfully long compile times have never involved embedding assets, which should be about the easiest code a compiler could hope for3. These impacts really should be quantified if they are to support the use of
In particular, if compile times are an issue, do not follow the proposals’ examples and couple the compilation of the asset with the rest of the code. Instead of converting the asset into a header file, convert it to a source file, which can be compiled into a separate object file for linking. Even if the build time is significant, at least it only needs to be paid when the binary asset changes, not on every code change.
The third argument is about putting the programmer at the mercy of build tools, but this argument is backwards. Who works on a significant code base, and is not already using a build tool? Instead, using
std::embed would hide a new type of dependency in the source files. Given the grief that the modules proposal has run into because of an unclear interface between build tools and the compiler, introducing additional landmines for build tools may not be a positive step.
The need to embed binary assets does occur. The existing tooling could be improved. In particular, a tool that could compile a binary asset directly to an object file, suitable for linking, should be much faster4. However, it is unclear why this functionality needs to be moved into the compiler.
Two other approaches are mentioned in the proposal. These approaches are more complex, so I don’t feel that they are required unless the approach mentioned above is not useable, which has not yet been proven. That said, I believe the comments about
incbin are incorrect. There is no indication that the any files need to be shipped with the final binary.
There is one aspect of std::embed that cannot be matched by the other approaches, the binary asset can be used in a
constexpr context. In the proposal D1040R4,
constexpr is used used to verify the prefix of the embedded files at compile time. I’d be surprised that that check must be done at compile time, but perhaps there are other applications.
I measured compile times for using
xxd -i and compiling the static array into an object file. This was on a linux VM with only 2GB of memory5. User time for
xxd was only a few percent of the time required to compile the ephermeral source file. The largest binary asset that could be compiled was 4MB, and took ~7s6. Somewhere between 4MB and 8MB, the system started swapping. On a dedicated machine with 8GB of memory, 16MB binary assets could be compiled, but the compile time was well over a minute. Compile speed is not an issue at smaller file sizes, but clearly there is a lot of memory overhead.
Even if there was memory available, compiling very large file (on the order of GB) would still become problematic. Assuming that the linear relation can be extrapolated that far, compile time would reach 30 minutes for a 1GB binary asset. At least some of the use-cases suggested by D1040R4 could run into this limit. The other approaches discussed in the proposal, such as
incbin, would probably perform better. On the dedicated system, I compiled object files for 1GB binary assets in 30s.
The main arguments in support of
std::embed do not really show a need. Although several use-cases are presented, there is no data to indicate that the existing approaches are insufficient. In particular, there isn’t an explanation about why the functionality needs to be in the compiler, and how it will interact with build systems.
Self-plug, but that code is old. ↩︎
Rocking the old-school make. Note that make will remove the intermediate C source file once it has built the object file. ↩︎
Although admittedly, large binary assets can make up in size what they lack in compile complexity. ↩︎
Such a tool should be relatively easy to build. Using LLVM, it would only take a few hundred lines of code. The library has optimizations for constant data arrays, but I don’t know if it has been tested to arrays that large. ↩︎
The host OS is windows, which is only reporting a little over 200MB in use. It appears that the linux OS may have very little physical memory available, so the size limits form compiling sources may be significantly larger than reported here. ↩︎
However, compile times are surprisingly linear from 1kB up to 4MB (R²=0.999) ↩︎