Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs

Posted through Mateusz Jurczyk of Google Project Zero

Patch diffing is a commonplace methodology of evaluating two binary builds of the similar code – a known-vulnerable one and one containing a safety repair. It is incessantly used to decide the technical main points in the back of ambiguously-worded announcements, and to identify the basis reasons, assault vectors and doable variants of the vulnerabilities in query. The method has attracted various analysis [1] ^([2][3] and tooling construction [4][5] ^([6] through the years, and has been proven to be helpful for figuring out so-called 1-day insects, which may also be exploited in opposition to customers who’re sluggish to undertake newest safety patches. Overall, the chance of post-patch vulnerability exploitation is inevitable for instrument which may also be freely reverse-engineered, and is thus permitted as a herbal a part of the ecosystem.

In a equivalent vein, binary diffing can be used to uncover discrepancies between two or extra variations of a unmarried product, in the event that they proportion the similar core code and coexist in the marketplace, however are serviced independently through the seller. One instance of such instrument is the Windows working gadget, which these days has three variations beneath energetic fortify – Windows 7, 8 and 10 [7]. While Windows 7 nonetheless has a just about 50% proportion at the desktop marketplace on the time of this writing [8] ^(, Microsoft is understood for introducing quite a lot of structural safety enhancements and every so often even abnormal bugfixes most effective to the newest Windows platform. This creates a false sense of safety for customers of the older techniques, and leaves them weak to instrument flaws which may also be detected simply through recognizing delicate adjustments within the corresponding code in several variations of Windows.

In this weblog submit, we can display how a very easy type of binary diffing was once successfully used to discover circumstances of 0-day uninitialized kernel reminiscence disclosure to user-mode methods. Bugs of this type is usually a helpful hyperlink in native privilege escalation exploit chains (e.g. to bypass kernel ASLR), or simply evidently divulge delicate information saved within the kernel cope with house. If you are now not conversant in the malicious program magnificence, we advise checking the slides of the Bochspwn Reloaded communicate given on the REcon and Black Hat USA meetings this 12 months as a previous studying [9] ^(

Chasing memset calls

Most kernel data disclosures are led to through leaving portions of enormous reminiscence areas uninitialized earlier than copying them to user-mode; be they buildings, unions, arrays or some mixture of those constructs. This usually signifies that the kernel supplies a ring-3 program with extra output information than there may be applicable data, for quite a lot of conceivable causes: compiler-inserted padding holes, unused construction/union fields, huge fixed-sized arrays used for variable-length content material and so on. In the top, those insects are infrequently constant through switching to smaller buffers – extra incessantly than now not, the unique habits is preserved, with the addition of one additional memset serve as name which pre-initializes the output reminiscence space so it does not comprise any leftover stack/heap information. This makes such patches really easy to acknowledge throughout opposite engineering.

When submitting factor #1267 within the Project Zero malicious program tracker (Windows Kernel pool reminiscence disclosure in win32ok!NtGdiGetGlyphOutline, discovered through Bochspwn) and doing some cursory research, I noticed that the malicious program was once most effective found in Windows 7 and 8, whilst it have been internally constant through Microsoft in Windows 10. The determine underneath presentations the most obvious distinction between the weak and glued kinds of the code, as decompiled through the Hex-Rays plugin and diffed through Diaphora:

zNgge7vw7jlDWVkJshUoPW6HS36yfsfi3lBW1p_tMRridHaHT0rLBDPqRdpHj5mEK_trKT0UsWWQ46C5LG1ISlMeVkJZ6HUX8BskGgcdStZ3YH2zFHUfRQbtWMO3pVoijWWnxZNd Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 1. A the most important distinction within the implementation of win32ok!NtGdiGetGlyphOutline in Windows 7 and 10

Considering how glaring the patch was once in Windows 10 (a fully new memset name in a top-level syscall handler), I suspected there might be different equivalent problems lurking within the older kernels which have been silently constant through Microsoft within the more moderen ones. To check this, I made up our minds to evaluate the collection of memset calls in all top-level syscall handlers (i.e. purposes beginning with the Nt prefix, carried out through each the core kernel and graphical subsystem) between Windows 7 and 10, and later between Windows 8.1 and 10. Since in idea this was once a very easy research, an adequately easy method might be used to get enough effects, which is why I made up our minds to carry out the diffing in opposition to code listings generated through the IDA Pro disassembler.

When doing so, I temporarily discovered that every reminiscence zeroing operation discovered within the kernel is compiled in one of three tactics: with an immediate name to the memset serve as, its inlined shape carried out with the rep stosd x86 instruction, or an opened up sequence of mov x86 directions:

5sjDSZhdHHAoufKCosiWgQ9tZX2p66U-jqcTDJZa1nSKO051rmyrayR6Cqv3-s8N-1fC2jCEEk4VtXyotlslo1cUmERsqfInJtGrUjTWqqS-tT-J0TJcmJSEYcATQ_5lQ2YSn66R Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 2. An instantaneous memset serve as name to reset reminiscence in nt!NtCreateJobObject (Windows 7)

62r3Up-Q6PSnWpzx-fZAM63c_vnlt4jo1ajjIOgaD9lYc8SXlReh9AK4DWmRjrUNEy2zxmoUGL0U2N37bRpDobMhrmaNMBO7P-6EtUb56TwAtuuepSYeuaCxL2M-8N3216w-wHGA Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 3. Inlined memset code used to reset reminiscence in nt!NtRequestPort (Windows 7)

Nr_20PHeGerQlSyu9ChNHc76kCqZbYYU6DtKVYbZqequmlGoqeBz_J62lpL58vabaz0L_xpfbDQZmq2bHcjxeDq2WGfmONTPfMWLowBwdl7Hd4pWbPMC7k7eT0TVrCkzQHqlVNcB Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 4. A sequence of mov directions used to reset reminiscence in win32ok!NtUserRealInternalGetMessage (Windows 8.1)

The two maximum commonplace instances (memset calls and rep stosd) are each decompiled to common invocations of memset() through the Hex-Rays decompiler:

q9-J81delwMPpRGXi1yVrwnfnlWNyxawYOO4ZQV8kA5EgqJChw60G2ZKiZL6PPsLD6JwlfJZCI0Y0rATi2Ha7-vmWplWeY_YZX5lPFP9YSaKeqL_2OANFEQXlDZ6x2-QYYbbavah Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
yBJOggcMIbimCqQjBq3-sbUMnrzAljN9BDMJCf7x9jGg8EkDKvCQ12AaAOV7vlBVJFM52EkL_KdVGhJlOg5hK6BEeTFg0gMtJsHTFXmmI-ZStaQcn1Z071M4jZa_MXoBZSkM2GKY Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figures 5 and 6. A standard memset name is indistinguishable from an inlined rep movsd assemble within the Hex-Rays view

Unfortunately, a series of mov’s with a zeroed-out check in because the supply operand isn’t identified through Hex-Rays as a memset but, however the collection of such occurrences is fairly low, and therefore may also be not noted till we manually maintain any ensuing false-positives later within the procedure. In the top, we made up our minds to carry out the diffing the use of decompiled .c recordsdata as a substitute of normal meeting, simply to make our existence a little bit more uncomplicated.

An entire record of steps we adopted to arrive on the ultimate consequence is proven underneath. We repeated them two times, first for Windows 7/10 after which for Windows 8.1/10:

  1. Decompiled ntkrnlpa.exe and win32ok.sys from Windows 7 and 8.1 to their .c opposite numbers with Hex-Rays, and did the similar with ntoskrnl.exe, tm.sys, win32kbase.sys and win32kfull.sys from Windows 10.
  2. Extracted an inventory of kernel purposes containing memset references (taking their amount under consideration too), and looked after them alphabetically.
  3. Performed a standard textual diff in opposition to the two lists, and selected the purposes which had extra memset references on Windows 10.
  4. Filtered the output of the former step in opposition to the record of purposes provide within the older kernels (7 or 8.1, once more pulled from IDA Pro), to make certain that we did not come with routines which have been most effective offered in the most recent gadget.

In numbers, we ended up with the next effects:

ntoskrnl purposes
ntoskrnl syscall handlers
win32ok purposes
win32ok syscall handlers
Windows 7 vs. 10
Windows 8.1 vs. 10
Table 1. Number of previous purposes with new memset utilization in Windows 10, relative to earlier gadget editions

Quite intuitively, the Windows 7/10 comparability yielded extra variations than the Windows 8.1/10 one, because the gadget steadily developed from one edition to the following. It’s additionally attention-grabbing to see that the graphical subsystem had fewer adjustments detected normally, however greater than the core kernel particularly within the syscall handlers. Once we knew the applicants, we manually investigated every of them intimately, finding two new vulnerabilities within the win32ok!NtGdiGetFontResourceInfoInternalW and win32ok!NtGdiEngCreatePalette gadget services and products. Both of them have been addressed within the September Patch Tuesday, and because they have got some distinctive traits, we can talk about every of them within the next sections.

win32ok!NtGdiGetFontResourceInfoInternalW (CVE-2017-8684)

The inconsistent memset which gave away the life of the malicious program is as follows:

-PXdUh707bgpaSN9bnGqgf5WlsH2EdkyFHIKhtD5MPvO7G5AKn8LfpxZ5MVGQia8gfmflVAfNRtVIxdlVtFhLvgh5GF2U8BuoDBRGYYm05yvrafv_tBY0GHT1VASFY0ST1gySJmj Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 8. A brand new memset added in win32ok!NtGdiGetFontResourceInfoInternalW in Windows 10

This was once a stack-based kernel reminiscence disclosure of about 0x5c (92) bytes. The construction of the serve as follows a commonplace optimization scheme utilized in Windows, the place an area buffer positioned at the stack is used for brief syscall outputs, and the pool allocator is most effective invoked for higher ones. The applicable snippet of pseudocode is proven underneath:

0P98wtLur10CENchPAwMQVAULnMhOtditXHjM9CdS3V0oByoQuuWxhnsYPaeIrAppgBPAw6mwBVf89BL5jOdGXrjO7TtHoIAE-ssQLqZ9aL2wdsQjH1iBjb2U6RsXns0ugbbWo7l Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 9. Optimized reminiscence utilization discovered within the syscall handler

It’s attention-grabbing to observe that even within the weak type of the regimen, reminiscence disclosure was once most effective conceivable when the primary (stack) department was once taken, and thus just for asked buffer sizes of up to 0x5c bytes. That’s for the reason that dynamic PALLOCMEM pool allocator does 0 out the asked reminiscence earlier than returning it to the caller:

30wSuZ7ox9uEOgzO6RGZ1A1MOditUZ7BfBi5BP0vQgyxX3B9JtTuvfgsSrNjOTXoPtXUiajall_ejS86StkR0JOrNYSz3Fop4Jk-23gx97OCoVqXNgzvxRzviIFMK5T5f6utrLLf Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 10. PALLOCMEM all the time resets allotted reminiscence

Furthermore, the problem may be a perfect instance of ways any other bizarre habits in interacting with user-mode might give a contribution to the advent of a safety flaw (see slides 32-33 of the Bochspwn Reloaded ^( deck). The code trend at fault is as follows:

  1. Allocate a brief output buffer in keeping with a user-specified measurement (dubbed a4 on this case), as mentioned above.
  2. Have the asked data written to the kernel buffer through calling an inner win32ok!GetFontResourceInfoInternalW serve as.
  3. Write the contents of all of the transient buffer again to ring-3, irrespective of how a lot information was once in truth crammed out through win32ok!GetFontResourceInfoInternalW.

Here, the weak win32ok!NtGdiGetFontResourceInfoInternalW handler in truth “is aware of” the size of significant information (it’s even handed again to the user-mode caller in the course of the 5th syscall parameter), however it nonetheless makes a decision to replica the total quantity of reminiscence asked through the customer, even if it’s totally useless for the right kind functioning of the syscall:

kqGDWufs5EVvYdc4B2ViPyyHXn3-vC19dxUzBDnMQX3yo2_lCCQ_DhOaVZWVXq6t77wgFUxjxyFRf7DcSAEtv4ceeomnqHIjkW00MYG9hit7R36iKq2WRnRnldXl05mJ3WcFHKav Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 11. There are v10 output bytes, however the serve as copies the total a4 buffer measurement.

The mixture of a loss of buffer pre-initialization and permitting the copying of redundant bytes is what makes this an exploitable safety malicious program. In the proof-of-concept program, we used an undocumented data magnificence 5, which most effective writes to the primary four bytes of the output buffer, leaving the rest 88 uninitialized and able to be disclosed to the attacker.

win32ok!NtGdiEngCreatePalette (CVE-2017-8685)

In this example, the vulnerability was once constant in Windows 8 through introducing the next memset into the syscall handler, whilst nonetheless leaving Windows 7 uncovered:

RW_KYNAgoikNUq_EkVLeOegJySrlc0hnfegzqcDM5ZzBYq6zMyxtzd0UHwyd6Ch8VYTdbnfxtQNkJz4xc4dBaaa4X2qGR7Rp-tRA2LeA4cKGJEjNSsBbTQrEUYfI9O8jQ--DT5DX Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 12. A brand new memset added in win32ok!NtGdiEngCreatePalette in Windows 8

The gadget name in query is liable for making a kernel GDI palette object consisting of N 4-byte colour entries, for a user-controlled N. Again, a reminiscence utilization optimization is hired through the implementation – if N is much less or equivalent to 256 (1024 bytes in overall), these things are learn from user-mode to a kernel stack buffer the use of win32ok!bSafeReadBits; differently, they’re simply locked in ring-3 reminiscence through calling win32ok!bSecureBits. As you’ll be able to wager, the reminiscence area with the additional memset carried out to it’s the native buffer used to briefly retailer an inventory of user-defined RGB colours, and it’s later handed to win32ok!EngCreatePalette to in truth create the palette object. The query is, how do we now have the buffer stay uninitialized however nonetheless handed for the advent of a non-empty palette? The solution lies within the implementation of the win32ok!bSafeReadBits regimen:

5Q2Ovn5a8sCeHgjH7rjJchp1Vs9PfMVmq4I_nM4yAlCsom2YpARdkhq2kxc1HHvuQoPhyqcYaIj-s8U1nqQ8s8vilbFpLRQyemfE2AVjqpMIOepzlTetp4PbaD_ncPymBZL18Wya Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 13. Function frame of win32ok!bSafeReadBits

As you’ll be able to see within the decompiled list above, the serve as completes effectively with out appearing any exact paintings, if both the supply or vacation spot pointer is NULL. Here, the supply cope with comes immediately from the syscall’s 3rd argument, which does not go through any prior sanitization. This signifies that we will be able to make the syscall assume it has effectively captured an array of up to 256 parts from user-mode, whilst in fact the stack buffer is not written to in any respect. This is completed with the next gadget name invocation in our proof-of-concept program:

HPALETTE hpal = (HPALETTE)SystemCall32(__NR_NtGdiEngCreatePalette, PAL_INDEXED, 256, NULL, 0.0f, 0.0f, 0.0f);

Once the syscall returns, we obtain a take care of to the palette which internally retail outlets the leaked stack reminiscence. In order to learn it again to our program, one extra name to the GetPaletteEntries API is wanted. To reiterate the severity of the malicious program, its exploitation permits an attacker to reveal a whole 1 kB of uninitialized kernel stack reminiscence, which is the most important primitive to have in one’s arsenal.

In addition to the reminiscence disclosure itself, different attention-grabbing quirks may also be noticed within the within reach code space. If you glance carefully on the code of win32ok!NtGdiEngCreatePalette in Windows 8.1 and 10, you are going to spot a captivating disparity between them: the stack array is absolutely reset in each instances, however it is completed in several tactics. On Windows 8.1, the serve as “manually” units the primary DWORD to 0 after which calls memset() at the ultimate 0x3FC bytes, whilst Windows 10 simply evidently memsets the entire 0x400-byte space. The explanation why for that is relatively unclear, and even if the outcome is identical, the discrepancy provokes the concept that now not simply the life of memset calls may also be in comparison throughout Windows variations, but additionally perhaps the dimensions operands of the ones calls.

A_ZsZche7CIsTPJJdlRAxeBBTdkmTuph3oCjKFanhJrYlXcZtbmTZjdqLiiwG1ogb_lz3uqfe1hkTOeYC8YoGHLuzfGynu4JiExLAIkpwcHeZWpA9Dvod7kJ8N4dJjXZyBSE1U-p Using Binary Diffing to Discover Windows Kernel Memory Disclosure Bugs Apps News Technology
Figure 14. Different code constructs used to 0 out a 256-item array on Windows 8.1 and 10

On a final comparable observe, the win32ok!NtGdiEngCreatePalette syscall could also be additionally relatively helpful for stack spraying functions throughout kernel exploitation, because it permits methods to simply write 1024 managed bytes to a continual space of the stack. While the buffer measurement is smaller than what e.g. nt!NtMapUserPhysicalPages ^( has to be offering, the buffer itself ends at the next offset relative to the stack body of the top-level syscall handler, which may make crucial distinction in sure situations.


The goal of this weblog submit was once to illustrate that security-relevant variations in similtaneously supported branches of a unmarried product could also be utilized by malicious actors to pinpoint vital weaknesses or simply common insects within the extra dated variations of stated instrument. Not most effective does it depart some consumers uncovered to assaults, however it additionally visibly unearths what the assault vectors are, which matches immediately in opposition to person safety. This is particularly true for malicious program categories with glaring fixes, corresponding to kernel reminiscence disclosure and the added memset calls. The “binary diffing” procedure mentioned on this submit was once if truth be told pseudocode-level diffing that did not require a lot low-level experience or wisdom of the working gadget internals. It may have been simply utilized by non-advanced attackers to establish the three discussed vulnerabilities (CVE-2017-8680, CVE-2017-8684, CVE-2017-8685) with little or no effort. We hope that those have been one of the most only a few circumstances of such “low striking fruit” being obtainable to researchers thru diffing, and we inspire instrument distributors to make sure that of it through making use of safety enhancements constantly throughout all supported variations in their instrument.


  1. ^(
  2. ^(
  3. windows/desktop/international ^(
  4. ^(

Exploiting the Linux kernel via packet sockets

Guest blog post, posted by Andrey Konovalov


Lately I’ve been spending some time fuzzing network-related Linux kernel interfaces with syzkaller. Besides the recently discovered vulnerability in DCCP sockets ^(, I also found another one, this time in packet sockets. This post describes how the bug was discovered and how we can exploit it to escalate privileges.

The bug itself (CVE-2017-7308) is a signedness issue, which leads to an exploitable heap-out-of-bounds write. It can be triggered by providing specific parameters to the PACKET_RX_RING option on an AF_PACKET socket with a TPACKET_V3 ring buffer version enabled. As a result the following sanity check in the packet_set_ring() ^( function in net/packet/af_packet.c can be bypassed, which later leads to an out-of-bounds access.
4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size –
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;
The bug was introduced on Aug 19, 2011 in the commit f6fb8f10 (“af-packet: TPACKET_V3 flexible buffer implementation”) together with the TPACKET_V3 implementation. There was an attempt to fix it on Aug 15, 2017 in commit dc808110 (“packet: handle too big packets for PACKET_V3”) by adding additional checks, but this was not sufficient, as shown below. The bug was fixed in 2b6867c2 (“net/packet: fix overflow in check for priv area size”) on Mar 29, 2017.
The bug affects a kernel if it has AF_PACKET sockets enabled (CONFIG_PACKET=y), which is the case for many Linux kernel distributions. Exploitation requires the CAP_NET_RAW privilege to be able to create such sockets. However it’s possible to do that from a user namespace if they are enabled (CONFIG_USER_NS=y) and accessible to unprivileged users.
Since packet sockets are a quite widely used kernel feature, this vulnerability affects a number of popular Linux kernel distributions including Ubuntu and Android. It should be noted, that access to AF_PACKET sockets is expressly disallowed to any untrusted code within Android, although it is available to some privileged components. Updated Ubuntu kernels are already out, Android’s update is scheduled for July.


The bug was found with syzkaller, a coverage guided syscall fuzzer, and KASAN, a dynamic memory error detector. I’m going to provide some details on how syzkaller works and how to use it for fuzzing some kernel interface in case someone decides to try this.
Let’s start with a quick overview of how the syzkaller fuzzer works. Syzkaller is able to generate random programs (sequences of syscalls) based on manually written template descriptions for each syscall. The fuzzer executes these programs and collects code coverage for each of them. Using the coverage information, syzkaller keeps a corpus of programs, which trigger different code paths in the kernel. Whenever a new program triggers a new code path (i.e. gives new coverage), syzkaller adds it to the corpus. Besides generating completely new programs, syzkaller is able to mutate the existing ones from the corpus.
Syzkaller is meant to be used together with dynamic bug detectors like KASAN (detects memory bugs like out-of-bounds and use-after-frees, available upstream since 4.0), KMSAN (detects uses of uninitialized memory, prototype was just released) or KTSAN (detects data races, prototype is available). The idea is that syzkaller stresses the kernel and executes various interesting code paths and the detectors detect and report bugs.
The usual workflow for finding bugs with syzkaller is as follows:
  1. Setup syzkaller and make sure it works. README and wiki provides quite extensive information on how to do that.
  2. Write template descriptions for a particular kernel interface you want to test.
  3. Specify the syscalls that are used in this interface in the syzkaller config.
  4. Run syzkaller until it finds bugs. Usually this happens quite fast for the interfaces, that haven’t been tested with it previously.
Syzkaller uses it’s own declarative language to describe syscall templates. Checkout sys/sys.txt for an example or sys/ for the information on the syntax. Here’s an excerpt from the syzkaller descriptions for AF_PACKET sockets that I used to discover the bug:
resource sock_packet[sock]
define ETH_P_ALL_BE htons(ETH_P_ALL)
socket$packet(domain const[AF_PACKET], type flags[packet_socket_type], proto const[ETH_P_ALL_BE]) sock_packet
packet_socket_type = SOCK_RAW, SOCK_DGRAM
setsockopt$packet_rx_ring(fd sock_packet, level const[SOL_PACKET], optname const[PACKET_RX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])
setsockopt$packet_tx_ring(fd sock_packet, level const[SOL_PACKET], optname const[PACKET_TX_RING], optval ptr[in, tpacket_req_u], optlen len[optval])
tpacket_req {
tp_block_size int32
tp_block_nr int32
tp_frame_size int32
tp_frame_nr int32
tpacket_req3 {
tp_block_size int32
tp_block_nr int32
tp_frame_size int32
tp_frame_nr int32
tp_retire_blk_tov int32
tp_sizeof_priv int32
tp_feature_req_word int32
tpacket_req_u [
req tpacket_req
req3 tpacket_req3
] [varlen]
The syntax is mostly self-explanatory. First, we declare a new type sock_packet. This type is inherited from an existing type sock. That way syzkaller will use syscalls which have arguments of type sock on sock_packet sockets as well.
After that, we declare a new syscall socket$packet. The part before the $ sign tells syzkaller what syscall it should use, and the part after the $ sign is used to differentiate between different kinds of the same syscall. This is particularly useful when dealing with syscalls like ioctl. The socket$packet syscall returns a sock_packet socket.
Then setsockopt$packet_rx_ring and setsockopt$packet_tx_ring are declared. These syscalls set the PACKET_RX_RING and PACKET_TX_RING socket options on a sock_packet socket. I’ll talk about these options in details below. Both of them use the tpacket_req_u union as a socket option value. This union has two struct members tpacket_req and tpacket_req3.
Once the descriptions are added, syzkaller can be instructed to fuzz packet-related syscalls specifically. This is what I provided in the syzkaller manager config:
“enable_syscalls”: [
“socket$packet”, “socketpair$packet”, “accept$packet”, “accept4$packet”, “bind$packet”, “connect$packet”, “sendto$packet”, “recvfrom$packet”, “getsockname$packet”, “getpeername$packet”, “listen”, “setsockopt”, “getsockopt”, “syz_emit_ethernet”
After a few minutes of running syzkaller with these descriptions I started getting kernel crashes. Here’s one of the syzkaller programs that triggered the mentioned bug:
mmap(&(0x7f0000000000/0xc8f000)=nil, (0xc8f000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
r0 = socket$packet(0x11, 0x3, 0x300)
setsockopt$packet_int(r0, 0x107, 0xa, &(0x7f000061f000)=0x2, 0x4)
setsockopt$packet_rx_ring(r0, 0x107, 0x5, &(0x7f0000c8b000)[email protected]={0x10000, 0x3, 0x10000, 0x3, 0x4, 0xfffffffffffffffe, 0x5}, 0x1c)
And here’s one of the KASAN reports. It should be noted, that since the access is quite far past the block bounds, allocation and deallocation stacks don’t correspond to the overflown object.
BUG: KASAN: slab-out-of-bounds in prb_close_block net/packet/af_packet.c:808
Write of size 4 at addr ffff880054b70010 by task syz-executor0/30839
CPU: 0 PID: 30839 Comm: syz-executor0 Not tainted 4.11.0-rc2+ #94
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x292/0x398 lib/dump_stack.c:52
print_address_description+0x73/0x280 mm/kasan/report.c:246
kasan_report_error mm/kasan/report.c:345 [inline]
kasan_report.part.3+0x21f/0x310 mm/kasan/report.c:368
kasan_report mm/kasan/report.c:393 [inline]
__asan_report_store4_noabort+0x2c/0x30 mm/kasan/report.c:393
prb_close_block net/packet/af_packet.c:808 [inline]
prb_retire_current_block+0x6ed/0x820 net/packet/af_packet.c:970
__packet_lookup_frame_in_block net/packet/af_packet.c:1093 [inline]
packet_current_rx_frame net/packet/af_packet.c:1122 [inline]
tpacket_rcv+0x9c1/0x3750 net/packet/af_packet.c:2236
packet_rcv_fanout+0x527/0x810 net/packet/af_packet.c:1493
deliver_skb net/core/dev.c:1834 [inline]
__netif_receive_skb_core+0x1cff/0x3400 net/core/dev.c:4117
__netif_receive_skb+0x2a/0x170 net/core/dev.c:4244
netif_receive_skb_internal+0x1d6/0x430 net/core/dev.c:4272
netif_receive_skb+0xae/0x3b0 net/core/dev.c:4296
tun_rx_batched.isra.39+0x5e5/0x8c0 drivers/net/tun.c:1155
tun_get_user+0x100d/0x2e20 drivers/net/tun.c:1327
tun_chr_write_iter+0xd8/0x190 drivers/net/tun.c:1353
call_write_iter include/linux/fs.h:1733 [inline]
new_sync_write fs/read_write.c:497 [inline]
__vfs_write+0x483/0x760 fs/read_write.c:510
vfs_write+0x187/0x530 fs/read_write.c:558
SYSC_write fs/read_write.c:605 [inline]
SyS_write+0xfb/0x230 fs/read_write.c:597
RIP: 0033:0x40b031
RSP: 002b:00007faacbc3cb50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000002a RCX: 000000000040b031
RDX: 000000000000002a RSI: 0000000020002fd6 RDI: 0000000000000015
RBP: 00000000006e2960 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000708000
R13: 000000000000002a R14: 0000000020002fd6 R15: 0000000000000000
Allocated by task 30534:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:513
set_track mm/kasan/kasan.c:525 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:617
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555
slab_post_alloc_hook mm/slab.h:456 [inline]
slab_alloc_node mm/slub.c:2720 [inline]
slab_alloc mm/slub.c:2728 [inline]
kmem_cache_alloc+0x1af/0x250 mm/slub.c:2733
getname_flags+0xcb/0x580 fs/namei.c:137
getname+0x19/0x20 fs/namei.c:208
do_sys_open+0x2ff/0x720 fs/open.c:1045
SYSC_open fs/open.c:1069 [inline]
SyS_open+0x2d/0x40 fs/open.c:1064
Freed by task 30534:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:513
set_track mm/kasan/kasan.c:525 [inline]
kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:590
slab_free_hook mm/slub.c:1358 [inline]
slab_free_freelist_hook mm/slub.c:1381 [inline]
slab_free mm/slub.c:2963 [inline]
kmem_cache_free+0xb5/0x2d0 mm/slub.c:2985
putname+0xee/0x130 fs/namei.c:257
do_sys_open+0x336/0x720 fs/open.c:1060
SYSC_open fs/open.c:1069 [inline]
SyS_open+0x2d/0x40 fs/open.c:1064
Object at ffff880054b70040 belongs to cache names_cache of size 4096
The buggy address belongs to the page:
page:ffffea000152dc00 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
flags: 0x500000000008100(slab|head)
raw: 0500000000008100 0000000000000000 0000000000000000 0000000100070007
raw: ffffea0001549a20 ffffea0001b3cc20 ffff88003eb44f40 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff880054b6ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff880054b6ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff880054b70000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ffff880054b70080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff880054b70100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
You can find more details about syzkaller in it’s repository and more details about KASAN in the kernel documentation. If you decide to try syzkaller or KASAN and run into any troubles drop an email to [email protected] or to [email protected].

Introduction to AF_PACKET sockets

To better understand the bug, the vulnerability it leads to and how to exploit it, we need to understand what AF_PACKET sockets are and how they are implemented in the kernel.


AF_PACKET sockets allow users to send or receive packets on the device driver level. This for example lets them to implement their own protocol on top of the physical layer or to sniff packets including Ethernet and higher levels protocol headers. To create an AF_PACKET socket a process must have the CAP_NET_RAW capability in the user namespace that governs its network namespace. More details can be found in the packet sockets documentation ^( It should be noted that if a kernel has unprivileged user namespaces enabled, then an unprivileged user is able to create packet sockets.
To send and receive packets on a packet socket, a process can use the send and recv syscalls. However, packet sockets provide a way to do this faster by using a ring buffer, that’s shared between the kernel and the userspace. A ring buffer can be created via the PACKET_TX_RING and PACKET_RX_RING socket options. The ring buffer can then be mmaped by the user and the packet data can then be read or written directly to it.
There are a few different variants of the way the ring buffer is handled by the kernel. This variant can be chosen by the user by using the PACKET_VERSION socket option. The difference between ring buffer versions can be found in the kernel documentation (search for “TPACKET versions”).
One of the widely known users of AF_PACKET sockets is the tcpdump utility. This is roughly what happens when tcpdump is used to sniff all packets on a particular interface:
# strace tcpdump -i eth0
socket(PF_PACKET, SOCK_RAW, 768)        = 3
bind(3, {sa_family=AF_PACKET, proto=0x03, if2, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(3, SOL_PACKET, PACKET_VERSION, [1], 4) = 0
setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=31, frame_size=65616, frame_nr=31}, 16) = 0
mmap(NULL, 4063232, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f73a6817000
This sequence of syscalls corresponds to the following actions:
  1. A socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) is created.
  2. The socket is bound to the eth0 interface.
  3. Ring buffer version is set to TPACKET_V2 via the PACKET_VERSION socket option.
  4. A ring buffer is created via the PACKET_RX_RING socket option.
  5. The ring buffer is mmapped in the userspace.
After that the kernel will start putting all packets coming through the eth0 interface in the ring buffer and tcpdump will read them from the mmapped region in the userspace.
ak01 Exploiting the Linux kernel via packet sockets Apple

Ring buffers

Let’s see how to use ring buffers for packet sockets. For consistency all of the kernel code snippets below will come from the Linux kernel 4.8 ^( This is the version the latest Ubuntu 16.04.2 kernel is based on.
The existing documentation mostly focuses on TPACKET_V1 and TPACKET_V2 ring buffer versions. Since the mentioned bug only affects the TPACKET_V3 version, I’m going to assume that we deal with that particular version for the rest of the post. Also I’m going to mostly focus on PACKET_RX_RING ignoring PACKET_TX_RING.
A ring buffer is a memory region used to store packets. Each packet is stored in a separate frame. Frames are grouped into blocks. In TPACKET_V3 ring buffers frame size is not fixed and can have arbitrary value as long as a frame fits into a block.
To create a TPACKET_V3 ring buffer via the PACKET_RX_RING socket option a user must provide the exact parameters for the ring buffer. These parameters are passed to the setsockopt call via a pointer to a request struct called tpacket_req3 ^(, which is defined as:
274 struct tpacket_req3 {
275         unsigned int    tp_block_size;  /* Minimal size of contiguous block */
276         unsigned int    tp_block_nr;    /* Number of blocks */
277         unsigned int    tp_frame_size;  /* Size of frame */
278         unsigned int    tp_frame_nr;    /* Total number of frames */
279         unsigned int    tp_retire_blk_tov; /* timeout in msecs */
280         unsigned int    tp_sizeof_priv; /* offset to private data area */
281         unsigned int    tp_feature_req_word;
282 };
Here’s what each field means in the tpacket_req3 struct:
  1. tp_block_size – the size of each block.
  2. tp_block_nr – the number of blocks.
  3. tp_frame_size – the size of each frame, ignored for TPACKET_V3.
  4. tp_frame_nr – the number of frames, ignored for TPACKET_V3.
  5. tp_retire_blk_tov – timeout after which a block is retired, even if it’s not fully filled with data (see below).
  6. tp_sizeof_priv – the size of per-block private area. This area can be used by a user to store arbitrary information associated with each block.
  7. tp_feature_req_word – a set of flags (actually just one at the moment), which allows to enable some additional functionality.
Each block has an associated header, which is stored at the very beginning of the memory area allocated for the block. The block header struct is called tpacket_block_desc ^( and has a block_status field, which indicates whether the block is currently being used by the kernel or available to the user. The usual workflow is that the kernel stores packets into a block until it’s full and then sets block_status to TP_STATUS_USER. The user then reads required data from the block and releases it back to the kernel by setting block_status to TP_STATUS_KERNEL.
186 struct tpacket_hdr_v1 {
187         __u32   block_status;
188         __u32   num_pkts;
189         __u32   offset_to_first_pkt;
233 };
235 union tpacket_bd_header_u {
236         struct tpacket_hdr_v1 bh1;
237 };
239 struct tpacket_block_desc {
240         __u32 version;
241         __u32 offset_to_priv;
242         union tpacket_bd_header_u hdr;
243 };
Each frame also has an associated header described by the struct tpacket3_hdr ^( The tp_next_offset field points to the next frame within the same block.
162 struct tpacket3_hdr {
163         __u32 tp_next_offset;
176 };
When a block is fully filled with data (a new packet doesn’t fit into the remaining space), it’s closed and released to userspace or “retired” by the kernel. Since the user usually wants to see packets as soon as possible, the kernel can release a block even if it’s not filled with data completely. This is done by setting up a timer that retires current block with a timeout controlled by the tp_retire_blk_tov parameter.
There’s also a way so specify per-block private area, which the kernel won’t touch and the user can use to store any information associated with a block. The size of this area is passed via the tp_sizeof_priv parameter.
If you’d like to better understand how a userspace program can use TPACKET_V3 ring buffer you can read the example provided in the documentation (search for “TPACKET_V3 example“).
ak02 Exploiting the Linux kernel via packet sockets Apple

Implementation of AF_PACKET sockets

Let’s take a quick look at how some of this is implemented in the kernel.

Struct definitions

Whenever a packet socket is created, an associated packet_sock ^( struct is allocated in the kernel:
103 struct packet_sock {
105         struct sock             sk;
108         struct packet_ring_buffer       rx_ring;
109         struct packet_ring_buffer       tx_ring;
123         enum tpacket_versions   tp_version;
130         int                     (*xmit)(struct sk_buff *skb);
132 };
The tp_version field in this struct holds the ring buffer version, which in our case is set to TPACKET_V3 by a PACKET_VERSION setsockopt call. The rx_ring and tx_ring fields describe the receive and transmit ring buffers in case they are created via PACKET_RX_RING and PACKET_TX_RING setsockopt calls. These two fields have type packet_ring_buffer ^(, which is defined as:
56 struct packet_ring_buffer {
57         struct pgv              *pg_vec;
70         struct tpacket_kbdq_core        prb_bdqc;
71 };
The pg_vec field is a pointer to an array of pgv structs ^(, each of which holds a reference to a block. Blocks are actually allocated separately, not as a one contiguous memory region.
52 struct pgv {
53         char *buffer;
54 };
ak03 Exploiting the Linux kernel via packet sockets Apple
The prb_bdqc field is of type tpacket_kbdq_core ^( and its fields describe the current state of the ring buffer:
14 struct tpacket_kbdq_core {
21         unsigned short  blk_sizeof_priv;
36         char            *nxt_offset;
49         struct timer_list retire_blk_timer;
50 };
The blk_sizeof_priv fields contains the size of the per-block private area. The nxt_offset field points inside the currently active block and shows where the next packet should be saved. The retire_blk_timer field has type timer_list ^( and describes the timer which retires current block on timeout.
12 struct timer_list {
17         struct hlist_node       entry;
18         unsigned long           expires;
19         void                    (*function)(unsigned long);
20         unsigned long           data;
31 };

Ring buffer setup

The kernel uses the packet_setsockopt() ^( function to handle setting socket options for packet sockets. When the PACKET_VERSION ^( socket option is used, the kernel sets po->tp_version to the provided value.
With the PACKET_RX_RING ^( socket option a receive ring buffer is created. Internally it’s done by the packet_set_ring() ^( function. This function does a lot of things, so I’ll just show the important parts. First, packet_set_ring() performs a bunch of sanity checks on the provided ring buffer parameters:
4202                 err = -EINVAL;
4203                 if (unlikely((int)req->tp_block_size <= 0))
4204                         goto out;
4205                 if (unlikely(!PAGE_ALIGNED(req->tp_block_size)))
4206                         goto out;
4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size –
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;
4211                 if (unlikely(req->tp_frame_size < po->tp_hdrlen +
4212                                         po->tp_reserve))
4213                         goto out;
4214                 if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT – 1)))
4215                         goto out;
4217                 rb->frames_per_block = req->tp_block_size / req->tp_frame_size;
4218                 if (unlikely(rb->frames_per_block == 0))
4219                         goto out;
4220                 if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
4221                                         req->tp_frame_nr))
4222                         goto out;
Then, it allocates the ring buffer blocks:
4224                 err = -ENOMEM;
4225                 order = get_order(req->tp_block_size);
4226                 pg_vec = alloc_pg_vec(req, order);
4227                 if (unlikely(!pg_vec))
4228                         goto out;
It should be noted that alloc_pg_vec() ^( uses the kernel page allocator to allocate blocks (we’ll use this in the exploit):
4104 static char *alloc_one_pg_vec_page(unsigned long order)
4105 {
4110         buffer = (char *) __get_free_pages(gfp_flags, order);
4111         if (buffer)
4112                 return buffer;
4127 }
4129 static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
4130 {
4139         for (i = 0; i < block_nr; i++) {
4140                 pg_vec[i].buffer = alloc_one_pg_vec_page(order);
4143         }
4152 }
Finally, packet_set_ring() calls init_prb_bdqc() ^(, which performs some additional steps to set up a TPACKET_V3 receive ring buffer specifically:
4229                 switch (po->tp_version) {
4230                 case TPACKET_V3:
4234                         if (!tx_ring)
4235                                 init_prb_bdqc(po, rb, pg_vec, req_u);
4236                         break;
4237                 default:
4238                         break;
4239                 }
The init_prb_bdqc() ^( function copies provided ring buffer parameters to the prb_bdqc field of the ring buffer struct, calculates some other parameters based on them, sets up the block retire timer and calls prb_open_block() to initialize the first block:
604 static void init_prb_bdqc(struct packet_sock *po,
605                         struct packet_ring_buffer *rb,
606                         struct pgv *pg_vec,
607                         union tpacket_req_u *req_u)
608 {
609         struct tpacket_kbdq_core *p1 = GET_PBDQC_FROM_RB(rb);
610         struct tpacket_block_desc *pbd;
616         pbd = (struct tpacket_block_desc *)pg_vec[0].buffer;
617         p1->pkblk_start = pg_vec[0].buffer;
618         p1->kblk_size = req_u->req3.tp_block_size;
630         p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv;
632         p1->max_frame_len = p1->kblk_size – BLK_PLUS_PRIV(p1->blk_sizeof_priv);
633         prb_init_ft_ops(p1, req_u);
634         prb_setup_retire_blk_timer(po);
635         prb_open_block(p1, pbd);
636 }
On of the things that the prb_open_block() ^( function does is it sets the nxt_offset field of the tpacket_kbdq_core struct to point right after the per-block private area:
841 static void prb_open_block(struct tpacket_kbdq_core *pkc1,
842         struct tpacket_block_desc *pbd1)
843 {
862         pkc1->pkblk_start = (char *)pbd1;
863         pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
876 }

Packet reception

Whenever a new packet is received, the kernel is supposed to save it into the ring buffer. The key function here is __packet_lookup_frame_in_block() ^(, which does the following:
  1. Checks whether the currently active block has enough space for the packet.
  2. If yes, saves the packet to the current block and returns.
  3. If no, dispatches the next block and saves the packet there.
1041 static void *__packet_lookup_frame_in_block(struct packet_sock *po,
1042                                             struct sk_buff *skb,
1043                                                 int status,
1044                                             unsigned int len
1045                                             )
1046 {
1047         struct tpacket_kbdq_core *pkc;
1048         struct tpacket_block_desc *pbd;
1049         char *curr, *end;
1051         pkc = GET_PBDQC_FROM_RB(&po->rx_ring);
1052         pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
1075         curr = pkc->nxt_offset;
1076         pkc->skb = skb;
1077         end = (char *)pbd + pkc->kblk_size;
1079         /* first try the current block */
1080         if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) {
1081                 prb_fill_curr_block(curr, pkc, pbd, len);
1082                 return (void *)curr;
1083         }
1085         /* Ok, close the current block */
1086         prb_retire_current_block(pkc, po, 0);
1088         /* Now, try to dispatch the next block */
1089         curr = (char *)prb_dispatch_next_block(pkc, po);
1090         if (curr) {
1091                 pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
1092                 prb_fill_curr_block(curr, pkc, pbd, len);
1093                 return (void *)curr;
1094         }
1101 }



Let’s look closely at the following check ^( from packet_set_ring():
4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size –
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;
This is supposed to ensure that the length of the block header together with the per-block private data is not bigger than the size of the block. Which totally makes sense, otherwise we won’t have enough space in the block for them let alone the packet data.
However turns out this check can be bypassed. In case req_u->req3.tp_sizeof_priv has the higher bit set, casting the expression to int results in a big positive value instead of negative. To illustrate this behavior:
A = req->tp_block_size = 4096 = 0x1000
B = req_u->req3.tp_sizeof_priv = (1 << 31) + 4096 = 0x80001000
BLK_PLUS_PRIV(B) = (1 << 31) + 4096 + 48 = 0x80001030
A – BLK_PLUS_PRIV(B) = 0x1000 – 0x80001030 = 0x7fffffd0
(int)0x7fffffd0 = 0x7fffffd0 > 0
Later, when req_u->req3.tp_sizeof_priv is copied to p1->blk_sizeof_priv in init_prb_bdqc() (see the snippet above), it’s clamped to two lower bytes, since the type of the latter is unsigned short. So this bug basically allows us to set the blk_sizeof_priv of the tpacket_kbdq_core struct to arbitrary value bypassing all sanity checks.


If we search through the net/packet/af_packet.c source ^( looking for blk_sizeof_priv usage, we’ll find that it’s being used in the two following places.
The first one is in init_prb_bdqc() right after it gets assigned (see the code snippet above) to set max_frame_len. The value of p1->max_frame_len denotes the maximum size of a frame that can be saved into a block. Since we control p1->blk_sizeof_priv, we can make BLK_PLUS_PRIV(p1->blk_sizeof_priv) bigger than p1->kblk_size. This will result in p1->max_frame_len having a huge value, higher than the size of a block. This allows us to bypass the size check ^( when a frame is being copied into a block, thus causing a kernel heap out-of-bounds write ^(
That’s not all. Another user of blk_sizeof_priv is prb_open_block(), which initializes a block (the code snippet is above as well). There pkc1->nxt_offset denotes the address, where the kernel will write a new packet when it’s being received. The kernel doesn’t intend to overwrite the block header and per-block private data, so it makes this address to point right after them. Since we control blk_sizeof_priv, we can control the lowest two bytes of nxt_offset. This allows us to control offset of the out-of-bounds write.
To sum up, this bug leads to a kernel heap out-of-bounds write of controlled maximum size and controlled offset up to about 64k bytes. 


Let’s see how we can exploit this vulnerability. I’m going to be targeting x86-64 Ubuntu 16.04.2 with 4.8.0-41-generic kernel version with KASLR, SMEP and SMAP enabled. Ubuntu kernel has user namespaces available to unprivileged users (CONFIG_USER_NS=y and no restrictions on it’s usage), so the bug can be exploited to gain root privileges by an unprivileged user. All of the exploitation steps below are performed from within a user namespace.
The Linux kernel has support for a few hardening features that make exploitation more difficult. KASLR (Kernel Address Space Layout Randomization) puts the kernel text at a random offset to make jumping to a particular fixed address useless. SMEP (Supervisor Mode Execution Protection) causes an oops whenever the kernel tries to execute code from the userspace memory and SMAP (Supervisor Mode Access Prevention) does the same whenever the kernel tries to access the userspace memory directly.

Shaping heap

The idea of the exploit is to use the heap out-of-bounds write to overwrite a function pointer in the memory adjacent to the overflown block. For that we need to specifically shape the heap, so some object with a triggerable function pointer is placed right after a ring buffer block. I chose the already mentioned packet_sock struct to be this object. We need to find a way to make the kernel allocate a ring buffer block and a packet_sock struct one next to the other.
As I mentioned above, ring buffer blocks are allocated with the kernel page allocator (buddy allocator). It allows to allocate blocks of 2^n contiguous memory pages. The allocator keeps a freelist of such block for each n and returns the freelist head when a block is requested. If the freelist for some n is empty, it finds the first m > n, for which the freelist is not empty and splits it in halves until the required size is reached. Therefore, if we start repeatedly allocating blocks of size 2^n, at some point they will start coming from one high order memory block being split and they will be adjacent each one to the next.
A packet_sock is allocated via the kmalloc() function by the slab allocator. The slab allocator is mostly used to allocate objects of a smaller-than-one-page size. It uses the page allocator to allocate a big block of memory and splits this block into smaller objects. The big blocks are called slabs, hence the name of the allocator. A set of slabs together with their current state and a set of operations like “allocate an object” and “free an object” is called a cache. The slab allocator creates a set of general purpose caches for objects of size 2^n. Whenever kmalloc(size) is called, the slab allocator rounds size up to the nearest power of 2 and uses the cache of that size.
Since the kernel uses kmalloc() all the time, if we try to allocate an object it will most likely come from one of the slabs already created during previous usage. However, if we start allocating objects of the same size, at some point the slab allocator will run out of slabs for this size and will have to allocate another one via the page allocator.
The size of a newly allocated slab depends on the size of objects this slab is meant for. The size of the packet_sock struct is ~1920 and 1024 < 1920 <= 2048, which means that it’ll be rounded to 2048 and the kmalloc-2048 cache will be used. Turns out, for this particular cache the SLUB allocator (which is the kind of slab allocator used in Ubuntu) uses slabs of size 0x8000 ^( So whenever the allocator runs out of slabs for the kmalloc-2048 cache, it allocates 0x8000 bytes with the page allocator.
Keeping all that in mind, this is how we can allocate a kmalloc-2048 slab next to a ring buffer block:
  1. Allocate a lot (512 worked for me) of objects of size 2048 to fill currently existing slabs in the kmalloc-2048 cache. To do that we can create a bunch of packet sockets to cause allocation of packet_sock structs.
  2. Allocate a lot (1024 worked for me) page blocks of size 0x8000 to drain the page allocator freelists and cause some high-order page block to be split. To do that we can create another packet socket and attach a ring buffer with 1024 blocks of size 0x8000.
  3. Create a packet socket and attach a ring buffer with blocks of size 0x8000. The last one of these blocks (I’m using 2 blocks, the reason is explained below) is the one we’re going to overflow.
  4. Create a bunch of packet sockets to allocate packet_sock structs and cause an allocation of at least one new slab.
This way we can shape the heap in the following way:
ak04 Exploiting the Linux kernel via packet sockets Apple
The exact number of allocations to drain freelists and shape the heap the way we want might be different for different setups and depend on the memory usage activity. The numbers above are for a mostly idle Ubuntu machine.

Controlling the overwrite

Above I explained that the bug results in a write of a controlled maximum size at a controlled offset out of the bounds of a ring buffer block. Turns out not only we can control the maximum size and offset, we can actually control the exact data (and it’s size) that’s being written. Since the data that’s being stored in a ring buffer block is the packet that’s passing through a particular network interface, we can manually send packets with arbitrary content on a raw socket through the loopback interface. If we’re doing that in an isolated network namespace no external traffic will interfere.
There are a few caveats though.
First, it seems that the size of a packet must be at least 14 bytes (12 bytes for two mac addresses and 2 bytes for the EtherType apparently) for it to be passed to the packet socket layer. That means that we have to overwrite at least 14 bytes. The data in the packet itself can be arbitrary.
Then, the lowest 3 bits of nxt_offset always have the value of 2 due to the alignment. That means that we can’t start overwriting at an 8-byte aligned offset.
Besides that, when a packet is being received and saved into a block, the kernel updates some fields in the block and frame headers. If we point nxt_offset to some particular offset we want to overwrite, some data where the block and frames headers end up will probably be corrupted.
Another issue is that if we make nxt_offset point past the block end, the first block will be immediately closed when the first packet is being received, since the kernel will (correctly) decide that there’s no space left in the first block (see the __packet_lookup_frame_in_block() snippet). This is not really an issue, since we can create a ring buffer with 2 blocks. The first one will be closed, the second one will be overflown.

Executing code

Now, we need to figure out which function pointers to overwrite. There are a few of function pointers fields in the packet_sock struct, but I ended up using the following two:
  1. packet_sock->xmit
  2. packet_sock->rx_ring->prb_bdqc->retire_blk_timer->func
The first one is called whenever a user tries to send a packet ^( via a packet socket. The usual way to elevate privileges to root is to execute the commit_creds(prepare_kernel_cred(0)) payload in a process context. The xmit pointer is called from a process context, which means we can simply point it to the executable memory region, which contains the payload.
To do that we need to put our payload to some executable memory region. One of the possible ways for that is to put the payload in the userspace, either by mmapping an executable memory page or by just defining a global function within our exploit program. However, SMEP & SMAP will prevent the kernel from accessing and executing user memory directly, so we need to deal with them first.
For that I used the retire_blk_timer field (the same field used by Philip Pettersson in his CVE-2017-8655 exploit ^( It contains a function pointer that’s triggered whenever the retire timer times out. During normal packet socket operation, retire_blk_timer->func points to prb_retire_rx_blk_timer_expired() ^( and it’s called with retire_blk_timer->data as an argument, which contains the address of the packet_sock struct. Since we can overwrite the data field along with the func field, we get a very nice func(data) primitive.
The state of SMEP & SMAP on the current CPU core is controlled by the 20th and 21st bits of the CR4 register. To disable them we should zero out these two bits. For this we can use the func(data) primitive to call native_write_cr4(X), where X has 20th and 21st bits set to 0. The exact value of X might depend on what other CPU features are enabled. On the machine where I tested the exploit, the value of CR4 is 0x10407f0 (only the SMEP bit is enabled since the CPU has no SMAP support), so I used X = 0x407f0. We can use the sched_setaffinity syscall to force the exploit program to be executed on one CPU core and thus making sure that the userspace payload will be executed on the same core as where we disable SMAP & SMEP.
Putting this all together, here are the exploitation steps:
  1. Figure out the kernel text address to bypass KASLR (described below).
  2. Pad heap as described above.
  3. Disable SMEP & SMAP.
    1. Allocate a packet_sock after a ring buffer block.
    2. Schedule a block retire timer on the packet_sock by attaching a receive ring buffer to it.
    3. Overflow the block and overwrite retire_blk_timer field. Make retire_blk_timer->func point to native_write_cr4 and make retire_blk_timer->data equal to the desired CR4 value.
    4. Wait for the timer to be executed, now we have SMEP & SMAP disabled on the current core.
  4. Get root privileges.
    1. Allocate another pair of a packet_sock and a ring buffer block.
    2. Overflow the block and overwrite xmit field. Make xmit point to a commit_creds(prepare_kernel_cred(0)) allocated in userspace.
    3. Send a packet on the corresponding packet socket, xmit will get triggered and the current process will obtain root privileges.
The exploit code can be found here.
It should be noted, that when we overwrite these two fields in the packet_sock structs, we’ll end up corrupting some of the fields before them (the kernel will write some values to the block and frame headers), which can lead to a kernel crash. However, as long as these other fields don’t get used by the kernel we should be good. I found that one of the fields that caused crashes if we try to close all packet sockets after the exploit finished is the mclist field, but simply zeroing it out helps.

ak05 Exploiting the Linux kernel via packet sockets Apple

KASLR bypass

I didn’t bother to come up with some elaborate KASLR bypass technique which exploits the same bug. Since Ubuntu doesn’t restrict dmesg by default, we can just grep the kernel syslog for the “Freeing SMP” string, which contains a kernel pointer, that looks suspiciously similar to the kernel text address:
# Boot #1
$ dmesg | grep ‘Freeing SMP’
[    0.012520] Freeing SMP alternatives memory: 32K (ffffffffa58ee000 – ffffffffa58f6000)
$ sudo cat /proc/kallsyms | grep ‘T _text’
ffffffffa4800000 T _text
# Boot #2
$ dmesg | grep ‘Freeing SMP’
[    0.017487] Freeing SMP alternatives memory: 32K (ffffffff85aee000 – ffffffff85af6000)
$ sudo cat /proc/kallsyms | grep ‘T _text’
ffffffff84a00000 T _text
By doing simple math we can calculate the kernel text address based on the one exposed through dmesg. This way of figuring out the kernel text location works only for some time after boot, as syslog only stores a fixed number of lines and starts dropping them at some point.
There are a few Linux kernel hardening features that can be used to prevent this kind of information disclosures. The first one is called dmesg_restrict and it restricts the ability of unprivileged users to read the kernel syslog. It should be noted, that even with dmesg restricted the first user on Ubuntu can still read the syslog from /var/log/kern.log and /var/log/syslog since he belongs to the adm group.
Another feature is called kptr_restrict ^( and it doesn’t allow unprivileged users to see pointers printed by the kernel with the %pK format specifier. However in 4.8 the free_reserved_area() function uses %p ^(, so kptr_restrict doesn’t help in this case. In 4.10 free_reserved_area() was fixed not to print address ranges at all, but the change was not backported to older kernels.


Let’s take a look at the fix. The vulnerable code as it was before the fix is below. Remember that the user fully controls both tp_block_size and tp_sizeof_priv.
4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     (int)(req->tp_block_size –
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
4210                         goto out;
When thinking about a way to fix this, the first idea that comes to mind is that we can compare the two values as is without that weird conversion to int:
4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     req->tp_block_size <=
4209                           BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv))
4210                         goto out;
Funny enough, this doesn’t actually help. The reason is that an overflow can happen while evaluating BLK_PLUS_PRIV ^( in case tp_sizeof_priv is close to the unsigned int maximum value.
177 #define BLK_PLUS_PRIV(sz_of_priv)
178         (BLK_HDR_LEN + ALIGN((sz_of_priv), V3_ALIGNMENT))
One of the ways to fix this overflow is to cast tp_sizeof_priv to uint64 before passing it to BLK_PLUS_PRIV. That’s exactly what I did in the fix that was sent upstream.
4207                 if (po->tp_version >= TPACKET_V3 &&
4208                     req->tp_block_size <=
4209                           BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv))
4210                         goto out;


Creating packet socket requires the CAP_NET_RAW privilege, which can be acquired by an unprivileged user inside a user namespaces. Unprivileged user namespaces expose a huge kernel attack surface, which resulted in quite a few exploitable vulnerabilities (CVE-2017-7184, CVE-2017-8655, …). This kind of kernel vulnerabilities can be mitigated by completely disabling user namespaces or disallowing using them to unprivileged users.
To disable user namespaces completely you can rebuild your kernel with CONFIG_USER_NS disabled. Restricting user namespaces usage only to privileged users can be done by writing 0 to /proc/sys/kernel/unprivileged_userns_clone in Debian-based kernel. Since version 4.9 the upstream kernel has a similar /proc/sys/user/max_user_namespaces setting.


Right now the Linux kernel has a huge number of poorly tested (from a security standpoint) interfaces and a lot of them are enabled and exposed to unprivileged users in popular Linux distributions like Ubuntu. This is obviously not good and they need to be tested or restricted.
Syzkaller is an amazing tool that allows to test kernel interfaces via fuzzing. Even adding barebone descriptions for another syscall usually uncovers numbers of bugs. We certainly need people writing syscall descriptions and fixing existing ones, since there’s a huge surface that’s still not covered and probably a ton of security bugs buried in the kernel. If you decide to contribute, we’ll be glad to see a pull request.


Just a bunch of related links.
Our Linux kernel bug finding tools:

A collection of Linux kernel exploitation materials: