The previous articles explain how to build applications using the OCaml-LLVM
bindings, and how to use the API to manipulate the LLVM objects. This was the
“read-only” part of the tutorial, which can be used to analyze LLVMIR.
This part explains how to create LLVMIR, and write a simple application from
scratch, and see how to build and run it.
Modules
As in the previous tutorial, we need
to create a context and a module:
There are two actions that can be done on functions:
declare_function to give only a declaration of the prototype,
define_function to give both the declaration and the implementation.
In both cases, we need to give the signature (return type, number and type of
arguments) of the function.
This is pretty similar to C. We’ll use this to declare the function
int main(void).
The int type is a bit problematic in LLVM (and in C, but for other reasons):
integer types must have a known size in LLVM. While this does not change the
architecture-independent property …
In the previous tutorial, we’ve seen how to use ocamlbuild and make to build
a simple application. In this part, we’ll start exploring the API, and see how
to access values and attributes of LLVM objects.
The base of the code is the same as in part 1: it reads an existing LLVM bitcode
file, for example one generated by clang.
As in previous tutorial part, knowing the LLVM C++ API is not required (but can help).
LLVM objects
The top-level container is a module (llmodule). The module contains global
variables, types and functions, which in turn contains basic blocks, and basic blocks
contain instructions.
Values
In the OCaml bindings, all objects (variables, functions, instructions) are
instances of the opaque type llvalue.
A value has a type, a name, a definition, a list of users, and other things like
attributes (for ex. visibility or linkage options) or aliases.
Each value has a type (lltype), which is a composite object to define the type
of a value and its arguments. To match the real type, it needs to be converted
to a TypeKind.t:
This is the first part of a tutorial series, on how to use the OCaml bindings
for LLVM.
Why use OCaml bindings ? Because you can avoid using the C++ API, spending huge
amounts of time compiling Clang sources, then your plugin, then debugging the
segfaults again and again. The bindings are stable, cover most of the API, and
are quite simple to use, thanks to the Debian packages.
This tutorial is written based on a Debian Sid, things may differ but should
stay similar on other distributions.
The objectives of this first part are:
install the required packages
setup a build environment for ocamlbuild
build a simple application that reads an LLVM bitcode file and prints it
Installation
The required packages are:
llvm-3.5-dev
libllvm-3.5-ocaml-dev
the LLVM and OCaml compilers (llvm-3.5, ocaml)
optionally, clang
The current LLVM version is 3.6, however the OCaml bindings are currently
disabled (See Debian bug
#783919), because of
changes in the required dependencies.
Here are the materials for the talk PICON : Control Flow Integrity on LLVM IR,
given during SSTIC 2015. While SSTIC is a
french-speaking conference, I publish here in English because my other posts
also are in English.
Here is the summary, from the website:
Control flow integrity has been a well explored field of software security for
more than a decade.
However, most of the proposed approaches are stalled in a
proof of concept state - when the implementation is publicly available - or have
been designed with a minimal performance overhead as their primary objective,
sacrificing security.
Currently, none of the proposed approaches can be used to
fully protect real-world programs compiled with most common compilers (e.g. GCC,
Clang/LLVM).
In this paper we describe a control flow integrity enforcement
mechanism whose main objective is security. Our approach is based on
compile-time code instrumentation, making the program communicate with its
external execution monitor. The program is terminated by the monitor as soon as
a control flow integrity violation is detected.
Our approach is implemented as
an LLVM plugin and is working on LLVM’s Intermediate Representation.
The idea behind FORTIFY_SOURCE is relatively simple: there are cases
where the compiler can know the size of a buffer (if it’s a fixed sized
buffer on the stack, as in the example, or if the buffer just came from
a malloc() function call). With a known buffer size, functions that
operate on the buffer can make sure the buffer will not overflow.
Since recent versions (>= 4.0, maybe before), gcc (and ld) has some
nice security features. Debian has created a wrapper for the toolchain,
to make the use of these features easy.
To install the wrapper, run:
apt-get install hardening-wrapper
To enable the hardening features, you have to export the environment variable:
export DEB_BUILD_HARDENING=1
The features include additional checks for printf-like functions, stack
protector, using address-space layout randomization (ASLR), marking
ELF-sections as read-only after loading when possible, etc.
Please note that you must compile with *-02* if you want the checks
to be effective
Ask gcc to make additional checks on format strings, to prevent attacks.
The following code, for ex:
printf(buf);
will result in a warning:
[home ~/harden] DEB_BUILD_HARDENING=1 make
gcc bad.c -o bad
bad.c: In function ‘main’:
bad.c:10: warning: format not a string literal and no format arguments
Why is this code vulnerable ? Because the buffer (buf) could contain
format characters like %s, and the printf function will interpret these
characters to pop arguments from the stack, and can result in the
execution of arbitrary code.
ANSI C requires all uninitialized static and global variables to be
initialized with 0 (§6.7.8 of the C99 definition). This means you can
rely on the following behavior:
int global;
void function() {
printf("%d\n",global);
}
This will print 0, and it is guaranteed by the standard.
However, this is not handled by the compiler. All you will be able to
see is that the variable is put in the bss section:
08049560 l O .bss 00000004 static_var.1279
08049564 g O .bss 00000004 global_var
It is the startup code of the linker which initializes the variables.
The C compiler usually puts variables that are supposed to be
initialized with 0 in the .bss section instead of the .data section.
Opposed to the .data section, the .bss section does not contain actual
data, it just specifies the size of all elements it contains. The C
compiler just *assumes* that the linker, loader, or the startup code
of the C library initializes this block of memory with 0. This is an
optimization; .data elements occupy space in the image (or ROM or flash
memory) and in RAM whereas .bss elements need to occupy RAM space only
if …