What is BPF and why is it taking over Linux Performance Analysis?

Kyle Laracey
Kyle Laracey

Performance analysis often gets bottlenecked by lack of visibility. At MemSQL, we architected our database to easily observe its inner workings. Observability allows our engineers to easily identify components that need to be faster. Faster components mean our database’s performance skyrockets. These tools also enable support engineers to react quickly and precisely to customer needs. In the spirit of using the best available tools to which we have access, the performance team is currently evaluating next-generation tools just recently available in Linux.

The newest tool for observing the Linux operating system is the “Berkeley Packet Filter” (BPF). BPF allows users to run a small piece of code quickly and safely inside the operating system. Originally used for packet filtering, it has since been enhanced from its eponymous use-case to support dynamic tracing of the Linux operating system. For example, it is possible to write a small BPF program that prints every time a particular file was accessed by a user.

The power of the Berkeley Packet Filter, when used with Userland Statically Defined Tracepoints (USDT), expands beyond the operating system to the database. USDT probes are well defined locations in the database where BPF programs run, allowing engineers to ask questions previously unanswerable. For example, engineers can now examine the interactions between the database and the operating system by running BPF programs in each at the same time.

Adding a USDT static tracepoint is as easy as a single macro call, which declares the probe and its arguments. This probe fires when each query is executed and records the query string:

DTRACE_PROBE1(memsqld, querystart, query);

To use this USDT probe, we need to attach a BPF program to it. We write our program in C and use the BPF Compiler Collection to compile it to BPF and attach it to our probes. The following BPF script traces queries and records their latencies:

BPF_HASH(pid_to_start_hash, u32);
BPF_HISTOGRAM(latency);                                                            

// This function runs each time a query begins. It records the current time stamp 
// (`start_ts`) and save it in the `pid_to_start_ht` hash table                                                                                
int querystart(struct pt_regs *ctx)                     
{                                                                                   
    u64 start_ts = bpf_ktime_get_ns();                                              
    u32 pid = bpf_get_current_pid_tgid();
                                           
    pid_to_start_hash.update(&pid, &start_ts);                                          
    return 0;                                                                       
} 

// This function runs at the end of each query. Look up the saved start timestamp 
// (`start_ts`) for the current thread’s id (pid) using the hash table 
// (`pid_to_start_hash`) and record the elapsed time (`delta_ms`) in the latency 
// histogram.
int queryend(struct pt_regs *ctx)                       
{                                                                                                                                   
    u32 pid = bpf_get_current_pid_tgid();                                                                                                      
    u64 *start_tsp = pid_to_start_hash.lookup(&pid);

    // Edge case: this query began before we started tracing.                                          
    if (!start_tsp) return 0; 
                                                                                                                                      
    u64 delta_ms = (bpf_ktime_get_ns() - *start_tsp) / 1000 / 1000;
    // Take the log of the elapsed time to put into the logarithmic histogram.                 
    latency.increment(bpf_log2l(delta_ms));                                                                

    // Make sure to delete values from the hash table when they are no longer needed.                                                                                    
    pid_to_start_hash.delete(&pid);                                                             
    return 0;                                                                       
}

We run query_latency.py, a script that wraps the above BPF program using the BCC toolchain, and get a nice histogram of query latencies:

$ sudo ./query_latency.py /var/lib/memsql/master-3306/memsqld --histogram
Tracing queries. ^C to exit.
latency (ms):
     value               : count     distribution
         0 -> 1          : 9        |****************************************|
         2 -> 3          : 1        |****                                    |
         4 -> 7          : 1        |****                                    |
         8 -> 15         : 2        |*******                                 |
        16 -> 31         : 1        |****                                    |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 1        |****                                    |

Once engineers have the ability to trace when a thread is executing a MemSQL query, they can ask more interesting questions about how the database interacts with Linux. For example, engineers can investigate and determine how long queries are spending acquiring locks. With BPF, engineers are able to instrument the start and end of the queries as above, and additionally instrument the futex system call itself (used in Linux to acquire and release locks) to trace how long it takes acquire locks while executing our query:

futex latencies (ms) for 'select count(distinct sid_1g) where...'
     value               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 2        |****                                    |
         4 -> 7          : 2        |****                                    |
         8 -> 15         : 1        |**                                      |
        16 -> 31         : 5        |***********                             |
        32 -> 63         : 17       |****************************************|

What about how a query spends its time? On- and off-CPU flamegraphs are helpful, but they are too coarse for query investigations. We instrumented the kernel scheduler tracepoints to conditionally collect information for threads that queries run on. This tracing tells us how long the thread of a query spends in various states (waiting, running, blocked, I/O, and sleeping).

The power of BPF allows us to inspect our database at runtime and ask precise questions. Increased observability provided by BPF improves the rate of performance and optimizes customer interaction with the MemSQL database. Overall, BPF provides the observability necessary to build a transparent and easily accessible modern in-memory database.

Access scripts, documentation, and additional reference information on BCC and BPF here.

After joining the MemSQL performance team this summer, Kyle Laracey will be returning to Brown University in the fall. At Brown, he studies computer science, is a teaching assistant for CS167: Operating Systems, and is expected to graduate in May 2017.

 

memsql rainbow wave
On-Demand Webinar
See MemSQL in Action