blog/content/posts/linux-memory-mapping.md

219 lines
9.5 KiB
Markdown

---
date: 2015-04-05T00:00:00-05:00
title: "Memory mappings, core dumps, GDB and Linux"
tags: [free-software, linux, gdb, fedora-planet, en_us, english]
---
After spending the last weeks struggling with this, I decided to write a
blog post. First, what is “this” that you are talking about? The answer
is: Linux kernel's concept of memory mapping. I found it utterly
confused, beyond my expectations, and so I believe that a blog post is
the write way to (a) preserve and (b) share this knowledge. So, let's do
it!
First things first
------------------
First, I cannot begin this post without a few acknowledgements and
“thank you's”. The first goes to Oleg Nesterov (sorry, I could not find
his website), a Linux kernel guru who really helped me a lot through the
whole task. Another “thank you” goes to [Jan
Kratochvil](http://www.jankratochvil.net/), who also provided valuable
feedback by commenting my GDB patch. Now, back to the point.
The task
--------
The task was requested
[here](https://sourceware.org/bugzilla/show_bug.cgi?id=16092): GDB
needed to respect the `/proc/<PID>/coredump_filter` file when generating
a coredump (i.e., when you use the `gcore` command).
Currently, GDB has his own coredump mechanism implemented which, despite
its limitations and bugs, has been around for quite some time. However,
and maybe you don't know that, but the Linux kernel has its own
algorithm for generating the corefile of a process. And unfortunately,
GDB and Linux were not really following the same standards here...
So, in the end, the task was about synchronizing GDB and Linux. To do
that, I first had to decipher the contents of the `/proc/<PID>/smaps`
file.
The `/proc/<PID>/smaps` file
----------------------------
This special file, generated by the Linux kernel when you read it,
contains detailed information about each memory mapping of a certain
process. Some of the fields on this file are documented in the `proc(5)`
manpage, but others are missing there (asking for a patch!). Here is an
explanation of everything I needed:
- The first line of each memory mapping has the following format:
The fields here are:
a) *address* is the address range, in the process' address space,
that the mapping occupies. This part was already treated by GDB,
so I did not have to worry about it.
b) *perms* is a set of permissions (**r** ead, **w** rite, e **x**
ecute, **s** hared, **p** rivate [COW -- copy-on-write])
applied to the memory mapping. GDB was already dealing with
`rwx` permissions, but I needed to include the `p` flag as well.
I also made GDB ignore the mappings that did not have the `r`
flag active, because it does not make sense to dump something
that you cannot read.
c) *offset* is the offset into the applied to the file, if the
mapping is file-backed (see below). GDB already handled
this correctly.
d) *dev* is the device (major:minor) related to the file, if there
is one. GDB already handled this correctly, though I was using
this field for more things (continue reading).
e) *inode* is the inode on the device above. The value of zero
means that no inode is associated with the memory mapping.
Nothing to do here.
f) *pathname* is the file associate with this mapping, if there
is one. This is one of the most important fields that I had to
use, and one of the most complicated to understand completely.
GDB now uses this to heuristically identify whether the mapping
is anonymous or not.
- GDB is now also interested in `Anonymous:` and `AnonHugePages:`
fields from the `smaps` file. Those fields represent the content of
anonymous data on the mapping; if GDB finds that this content is
greater than zero, this means that the mapping is anonymous.
- The last, but perhaps most important field, is the `VmFlags:` field.
It contains a series of two-letter flags that provide very useful
information about the mapping. A description of the fields is:
a) `sh`: the mapping is shared (`VM_SHARED`)
b) `dd`: this mapping should not be dumped in a corefile
(`VM_DONTDUMP`)
c) `ht`: this is HugeTLB mapping
With that in hands, the following task was to be able to determine
whether a memory mapping is anonymous or file-backed, private or shared.
Types of memory mappings
------------------------
There can be four types of memory mappings:
1. Anonymous private mapping
2. Anonymous shared mapping
3. File-backed private mapping
4. File-backed shared mapping
It should be possible to uniquely identify each mapping based on the
information provided by the `smaps` file; however, you will see that
this is not always the case. Below, I will explain how to determine each
of the four characteristics that define a mapping.
### `Anonymous`
A mapping is anonymous if one of these conditions apply:
1. The `pathname` associated with it is either `/dev/zero (deleted)`,
`/SYSV%08x (deleted)`, or `<filename> (deleted)` (see below).
2. There is content in the `Anonymous:` or in the `AnonHugePages:`
fields of the mapping in the `smaps` file.
A special explanation is needed for the `<filename> (deleted)` case. It
is not always guaranteed that it identifies an anonymous mapping; in
fact, it is possible to have the `(deleted)` part for file-backed
mappings as well (say, when you are running a program that uses shared
libraries, and those shared libraries have been removed because of an
update, for example). However, we are trying to mimic the behavior of
the Linux kernel here, which checks to see if a file has no hard links
associated with it (and therefore is truly deleted).
Although it may be possible for the userspace to do an extensive check
(by `stat` ing the file, for example), the Linux kernel certainly could
give more information about this.
### `File-backed`
A mapping is file-backed (i.e., not anonymous) if:
1. The `pathname` associated with it contains a `<filename>`, without
the `(deleted)` part.
As has been explained above, a mapping whose `pathname` contains the
`(deleted)` string could still be file-backed, but we decide to consider
it anonymous.
It is also worth mentioning that a mapping can be simultaneously
anonymous and file-backed: this happens when the mapping contains a
valid `pathname` (without the `(deleted)` part), but **also** contains
`Anonymous:` or `AnonHugePages:` contents.
### `Private`
A mapping is considered to be private (i.e., not shared) if:
1. In the absence of the `VmFlags` field (in the `smaps` file), its
permission field has the flag `p`.
2. If the `VmFlags` field is present, then the mapping is private if
we do not find the `sh` flag there.
### `Shared`
A mapping is shared (i.e., not private) if:
1. In the absence of `VmFlags` in the `smaps` file, the permission
field of the mapping does not have the `p` flag. Not having this
flag actually means `VM_MAYSHARE` and not necessarily `VM_SHARED`
(which is what we want), but it is the best approximation we have.
2. If the `VmFlags` field is present, then the mapping is shared if
we find the `sh` flag there.
The patch
---------
With all that in mind, I hacked GDB to improve the coredump mechanism
for GNU/Linux operating systems. The main function which decides the
memory mappings that will or will not be dumped on GNU/Linux is
[linux_find_memory_regions_full](http://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/linux-tdep.c;h=4af1d01900256164a478a0159b0fcabe86d5549f;hb=HEAD#l1108);
the Linux kernel obviously uses its own function,
[vma_dump_size](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/binfmt_elf.c#n1229),
to do the same thing.
Linux has one advantage: it is a kernel, and therefore has much more
knowledge about processes' internals than a userspace program. For
example, inside Linux it is trivial to check if a file marked as
"`(deleted)`" in the output of the `smaps` file has no hard links
associated with it (and therefore is not really deleted); the same
operation on userspace, however, would require root access to inspect
the contents of the `/proc/<PID>/map_files/` directory.
The case described above, if you remember, is something that impacts the
ability to tell whether a mapping is anonymous or not. I am talking to
the Linux kernel guys to see if it is possible to export this
information directly via the `smaps` file, instead of having to do the
current heuristic.
While doing this work, some strange behaviors were found in the Linux
kernel. Oleg is working on them, along with other Linux hackers. From
our side, there is still room for improvement on this code. The first
thing I can think of is to improve the heuristics for finding anonymous
mappings. Another relatively easy thing to do would be to let the user
specify a value for `coredump_filter` on the command line, without
editing the `/proc` file. And of course, keep this code always updated
with its counterpart in the Linux kernel.
Upstream discussions and commit
-------------------------------
If you are interested, you can see the discussions that happened
upstream by going [to this
link](http://sourceware.org/ml/gdb-patches/2015-03/msg00816.html). This
is the fourth (and final) submission of the patch; you should be able to
find the other submissions [in the
archive](http://sourceware.org/ml/gdb-patches/2015-03/authors.html).
The final commit can be found [in the official
repository](http://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=df8411da087dc05481926f4c4a82deabc5bc3859).