Easy file reading

Mon Jan 1, 2018
~1200 Words
Tags: programming, C++

The standard approach to reading files is some equivalent to calling the functions open, read (repeatedly), and close. This pattern repeats across languages1. However, for simply accessing a sequence of bytes, there is another approach that provides a much nicer API for reading files – memory mapped files. Instead of managing read buffers, one can use the operating system’s virtual memory management (VMM) to page in data as required. Here, I would like to show an implementation for C++.

To start, I would like to show a minimal class. There are only 4 member functions2. The constructor and destructor handle RAII for the memory mapping, and two short getters that provide bounds for the array of bytes3. It is important to note that the constructor, while it does create the memory mapping, it does not necessarily copy any of the file contents into memory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
class mmfile
{
public:
	explicit mmfile( char const* filename );
	mmfile( mmfile const& ) = delete;
	~mmfile();

	uint8_t const* begin() const noexcept { return _data; };
	uint8_t const* end() const noexcept { return _data + _size; };

private:
	uint8_t* _data;
	size_t _size;
};

There are only two member functions that still need definitions. The implementation below uses POSIX, but similar implementations are possible on all modern operation systems. The constructor is the only one that is even marginally complicated.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
mmfile::mmfile( char const* name )
{
	assert( name && *name );

	auto fd = open(name, O_RDONLY, 0600);
	if (fd == -1) {
		auto ec = std::error_code( errno, std::system_category() );
		throw std::system_error( ec );
	}
    
	auto len = lseek( fd, 0, SEEK_END );
	if ( len == -1 ) {
		close_and_throw( fd );
	}
	_size = len;
	assert( _size == len /* in case of truncation */ );

	auto map = mmap( nullptr, len, PROT_READ, MAP_PRIVATE, fd, 0);
	if (map == MAP_FAILED) {
		close_and_throw( fd );
	}
	_data = reinterpret_cast<uint8_t*>(map);

	close( fd );
}

Before the file can be mapped to memory, some preparation is required. In POSIX, we need a file handle to create the memory mapping, and we also need to know the length of the file. I’ve left out comments, but the code has the following blocks:

  1. Check preconditions. It might be better to thow a std::invalid_argument instead of using an assert.
  2. Open the file. For error handling, opinions on exceptions vary. In this case, options are limited in a constructor, so we convert any error into a std::system_error and then throw.
  3. Seek to the end of the file, which conveniently returns the new location and so the file size. For error handling, we need to close the open file descriptor. The call to close might change the value of errno4, so we use a small helper function that saves the value of errno before closing the file.
  4. Map the file to memory. Note that while a range of memory addresses is reserved for the file, the operating system has not necessarily read anything into physical memory.
  5. Close the file descriptor. The memory mapping will keep a reference to the file descriptor, so we don’t need to hold on to ours.

The other member function to define is the destructor. In POSIX, the mapping can be released with a single call to munmap.

The above code work quite well on POSIX systems, but is not a complete solution. An implementation can be built for WIN325, but requires a few more steps. Additionaly, you might want to refer to the implementation of the MemoryBuffer class by LLVM, which can delegate to a memory-mapped buffer for files.

Advantages

There are two advantages to this approach. First, accessing the contents of the file is very easy. The file appears to the program as if it were a read-only array of bytes in memory. There is no need for calls to seek, no worries if records span multiple buffers for the read, etc. Wrap the data in a std::string_view, and go.

Second, the operating system becomes responsible for paging data into memory as necessary. This means that there is likely less copying, likely fewer system calls, and likely better memory management by the operating system. However, there memory mapped files are not faster in all cases6.

Disadvantages

Since we are using memory mapped files, the input needs to be an actual file. File descriptors also work with pipes, FIFOs, IO devices, sockets, plus other streams that I’m forgetting. If your program needs to work cleanly with all of these types of streams, then a memory mapped file may not be the correct approach.

There are also likely differences in the presence of concurrency, especially if the file is not local. Unfortunately, it is not specified whether writes to the file that occur after the mapping has been created are visible to the process or not. The same issue applies to accessing data with seek and read through file descriptors, but programmers are likely to see an implicit promise that their array of const bytes is immutable.

Writing

The implementation above is minimal, but memory mapped files essentially provide a std::string_view for reading files. With more effort, a writable version of the file could be provided with an interface matching a std::vector. When the capacity needs to be modified, the file length would be adjusted with ftruncate, and the mapping adjusted.

Summary

Memory mapped files are not a common tool for reading files, but they provide a nice, simple API for reading. A minimal solution is also quite short. The main draw back compared to file descriptors is an inability to read data from the full range of sources supported. But, if your data is definitely on disk, your application code might be simpler.


  1. In C, this is fopen, fread, and fclose. In C++, this is construction of std::fstream, the read member function, and the destructor. Although there are methods to open and close the file independent of object lifetime. In Go, this if os.Open, the method Read, and the method Close. [return]
  2. There is a fifth declaration to explicitly delete the copy constructor, just to make sure there are no surprises. [return]
  3. With a little work, this class could be derived from string_view. While that would provide a complete interface matching best practice, it’s an unnecessary complication here. [return]
  4. Since we are not writing, close shouldn’t fail with EIO because it is unable to flush the stream. Since there is no possibility of a bug…, EBADF is not possible. However, we technically should be checking if the close was interrupted (EINTR) and retrying the close. [return]
  5. As an entry into the documenation for the WIN32 API, refer to the CreateFileMapping function. [return]
  6. For more information, there are two stack overflow articles with more information on the performance comparison (link, and link). [return]

Places to join the discussion