The standard approach to reading files is some equivalent to calling the functions open, read (repeatedly), and close. This pattern repeats across languages1. However, for simply accessing a sequence of bytes, there is another approach that provides a much nicer API for reading files – memory mapped files. Instead of managing read buffers, one can use the operating system’s virtual memory management (VMM) to page in data as required. Here, I would like to show an implementation for C++.
To start, I would like to show a minimal class. There are only 4 member functions2. The constructor and destructor handle RAII for the memory mapping, and two short getters that provide bounds for the array of bytes3. It is important to note that the constructor, while it does create the memory mapping, it does not necessarily copy any of the file contents into memory.
There are only two member functions that still need definitions. The implementation below uses POSIX, but similar implementations are possible on all modern operation systems. The constructor is the only one that is even marginally complicated.
Before the file can be mapped to memory, some preparation is required. In POSIX, we need a file handle to create the memory mapping, and we also need to know the length of the file. I’ve left out comments, but the code has the following blocks:
The other member function to define is the destructor. In POSIX, the mapping can be released with a single call to munmap.
The above code work quite well on POSIX systems, but is not a complete solution. An implementation can be built for WIN325, but requires a few more steps. Additionaly, you might want to refer to the implementation of the MemoryBuffer class by LLVM, which can delegate to a memory-mapped buffer for files.
There are two advantages to this approach. First, accessing the contents of the file is very easy. The file appears to the program as if it were a read-only array of bytes in memory. There is no need for calls to seek, no worries if records span multiple buffers for the read, etc. Wrap the data in a std::string_view, and go.
Second, the operating system becomes responsible for paging data into memory as necessary. This means that there is likely less copying, likely fewer system calls, and likely better memory management by the operating system. However, there memory mapped files are not faster in all cases6.
Since we are using memory mapped files, the input needs to be an actual file. File descriptors also work with pipes, FIFOs, IO devices, sockets, plus other streams that I’m forgetting. If your program needs to work cleanly with all of these types of streams, then a memory mapped file may not be the correct approach.
There are also likely differences in the presence of concurrency, especially if the file is not local. Unfortunately, it is not specified whether writes to the file that occur after the mapping has been created are visible to the process or not. The same issue applies to accessing data with seek and read through file descriptors, but programmers are likely to see an implicit promise that their array of const bytes is immutable.
The implementation above is minimal, but memory mapped files essentially provide a std::string_view for reading files. With more effort, a writable version of the file could be provided with an interface matching a std::vector. When the capacity needs to be modified, the file length would be adjusted with ftruncate, and the mapping adjusted.
Memory mapped files are not a common tool for reading files, but they provide a nice, simple API for reading. A minimal solution is also quite short. The main draw back compared to file descriptors is an inability to read data from the full range of sources supported. But, if your data is definitely on disk, your application code might be simpler.
In C, this is fopen, fread, and fclose. In C++, this is construction of std::fstream, the read member function, and the destructor. Although there are methods to open and close the file independent of object lifetime. In Go, this if os.Open, the method Read, and the method Close. ↩︎
There is a fifth declaration to explicitly delete the copy constructor, just to make sure there are no surprises. ↩︎
Since we are not writing, close shouldn’t fail with EIO because it is unable to flush the stream. Since there is no possibility of a bug…, EBADF is not possible. However, we technically should be checking if the close was interrupted (EINTR) and retrying the close. ↩︎