Undefined Behavior in 2017

Recently we’ve heard a few people imply that problems stemming from undefined behaviors (UB) in C and C++ are largely solved due to ubiquitous availability of dynamic checking tools such as ASan, UBSan, MSan, and TSan. We are here to state the obvious — that, despite the many excellent advances in tooling over the last few years, UB-related problems are far from solved — and to look at the current situation in detail.

Valgrind and most of the sanitizers are intended for debugging: emitting friendly diagnostics regarding undefined behaviors that are executed during testing. Tools like this are exceptionally useful and they have helped us progress from a world where almost every nontrivial C and C++ program executed a continuous stream of UB to a world where quite a few important programs seem to be largely UB-free in their most common configurations and use cases.

The problem with dynamic debugging tools is that they don’t do anything to help us to cope with the worst UBs: the ones that we didn’t know how to trigger during testing, but that someone else has figured out how to trigger in deployed software — while exploiting it. The problem reduces to doing good testing, which is hard. Tools like afl-fuzz are great but they barely begin to scratch the surface of large programs that process highly structured inputs.

One way to sidestep problems in testing is to use static UB-detection tools. These are steadily improving, but sound and precise static analysis is not necessarily any easier than achieving good test coverage. Of course the two techniques are attacking the same problem — identifying feasible paths in software — from opposite sides. This problem has always been extremely hard and probably always will be. We’ve written a lot elsewhere about finding UBs via static analysis; in this piece our focus is on dynamic tools.

The other way to work around problems in testing is to use UB mitigation tools: these turn UB into defined behavior in production C and C++, effectively gaining some of the benefits of a safe programming language. The challenge is in engineering mitigation tools that:

* don’t break our code in any corner cases,
* have very low overhead,
* don’t add effective attack surfaces, for example by requiring programs to be linked against a non-hardened runtime library,
* raise the bar for determined attackers (in contrast, debugging tools can afford to use heuristics that aren’t resistant to adversaries),
* compose with each other (in contrast, some debugging tools such as ASan and TSan are not compatible, necessitating two runs of the test suite for any project that wants to use both).

Before looking at some individual kinds of UB, let’s review the our goals here. These apply to every C and C++ compiler.

Goal 1: Every UB (yes, all ~200 of them, we’ll give the list towards the end of this post) must either be documented as having some defined behavior, be diagnosed with a fatal compiler error, or else — as a last resort — have a sanitizer that detects that UB at runtime. This should not be controversial, it’s sort of a minimal requirement for developing C and C++ in the modern world where network packets and compiler optimizations are effectively hostile.

Goal 2: Every UB must either be documented as having some defined behavior, be diagnosed with a fatal compiler error, or else have an optional mitigation mechanism that meets the requirements above. This is more difficult; it necessitates, for example, production-grade memory safety. We like to think that this can be achieved in many execution environments. OS kernels and other maximally performance-critical code will need to resort to more difficult technologies such as formal methods.

blog.regehr.org/archives/1520
discuss

Other urls found in this thread:

sourcemage.org/Comparison with Gentoo,
hackernoon.com/so-you-think-you-know-c-8d4e2cd6f6a6
en.wikipedia.org/wiki/Undefined_Behavior
blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html
blog.llvm.org/2011/05/what-every-c-programmer-should-know_21.html
twitter.com/SFWRedditVideos

wtf?

If you don't write code at all, then you'll have no UB! ;^)

C is "this is why we can't have nice things: the language."

checked my project in ubsan and it passes. groooovy baby

Why didn't you post sourcemage.org/Comparison with Gentoo, faggot? Anyway, as a Gentoo user, I'll probably try it in a kvm, but the repos look tiny as fuck. I can't even find rtorrent, m8.

What about people use their brains instead of complaining about undefined behaviours? Most UBs happen in the form of atoi("i'm retarded") and the like.
You code should NEVER run into any situation with undefined behaviour, which is the very reason undefined behaviours exist in the first place.

...

your argument is retarded but that ultimately doesnt matter. fact is the majority of software written in c has ub.


dont shitpost. do it faggot

Undefined behavior is how you do systems programming in C. C is a systems language, which is why there's this big emphasis on pointers. These compiler writers are destroying the only reason to use C in the first place. There are languages with no undefined behavior that can also be used for systems programming, like Ada and PL/I.


Any kind of systems programming in C is undefined behavior. You're using it as a dumbed down Pascal or crippled Java.

Fact is the majority of software is written by tards, no matter the language. Most C programs aren't properly layered and have a very high degree of coupling between modules, but that's not a problem of the language itself. As I said, use your brain when you program in C.


Except most C projects actually don't require you to rely upon UB, the fact that Linux or NT might do it doesn't mean all C programs need it to function properly.

literally how is this different than some bullshit like valgrind which OP already addressed?

found the C LARPer *protip: just because you've coded C, doesn't mean you understand it*

I actually program in C for a living, and I've done maintenance of actual corporate critical systems that were written in C89. And I'll tell you what, they're pajeet-tier, and I can tell you that the original koders didn't understand the slightest about UB. Yet they made a 200k LoC system that, to this day, moves tens of millions euros a day.

found the larper

It doesn't matter, even if you use your brain there is still room for lots of error that a human won't see.
And yes it is easy to write simple programs in C but it's a lot harder to write complex things and audit that stuff correctly.

That's nice, but pretty much every C project in existence is full of UB, because it's impractical to avoid it. In most cases, you just know all compilers in practice will be fine with your code. By strictly avoiding every possible UB, you put yourself at a disadvantage to the market for no gain.
Yes, by definition some code has to actually make money.

Threads and multiprocessing are actually rarely needed. For CPU-bound problems, most of the time higher level tools are better — automatic loop vectorization (OpenMP for example), high level concurrency primitives with immutable data (parallel "map" for example). Actually creating threads by hand is almost never needed if the system design is well thought.

And for IO-bound tasks, the optimal way in all aspects is cooperative multitasking (aka coroutines) in single thread, using non-blocking IO primitives. It performs better and it's easier to implement if you understand coroutines. It requires a language which can express coroutines in some form, so some of them (Java) are out of luck. BTW it's one of the reasons why Python code in practice can be faster than Java, even if the raw interpreter overhead is orders of magnitude higher.

I agree, however
OpenMP SIMD is great on paper and terrible in practice, it can't manage to do even the simplest things optimally. (at least, in the current state of gcc and clang compilers)

It was just one example, there are other high-level parallel primitives.
And it will only become better at generating safe-but-fast code over time. Unlike human brains.

And maybe you tried it the wrong way, because I've seen software using actually it (darktable for example) and it seems to be just fine.

Anyway, there's also OpenCL, which can execute on CPU as well (not only GPU). For embarrassingly parallel tasks, it's also a very good tool

Every time someone says "I know C but others don't, and they make UB mistakes but I don't", I make them take this test. It's just five questions.
hackernoon.com/so-you-think-you-know-c-8d4e2cd6f6a6

I don't think we're understanding each other. Of course, most C code has undefined behaviour, and of course it's impossible to avoid. I mean, the execution order of statements within function calls is undefined, so pretty much even a Hello World has UB.
What I'm saying, it's UB that doesn't actually matter and that you shouldn't be caring about, because it doesn't alter the program correctness, and if it does, then you have been programming like a retard.
For example, whether foo(a+b, b+c); execute a+b or b+c first shouldn't matter at all, and if the statements you put in your parameters do need to be executed in a specific order (functions with side effects, prefix increments, etc.), you shouldn't be putting them there in the first place.
I remember that one blog who bitched about the undefined behaviour of a = 1;foo(++a, ++a, ++a)
If you ever run into such a situation, either you stop trying to write one liners, or your code is retarded, you yourself are retarded and you should go back to coding HTML5.
Using atoi()? Then you should be sure that the pointer you give it is in fact a null-terminated string that represents an integer. Defining behaviour of atoi() on an input it shouldn't get is stupid anyway.
Integer overflow? Why the fuck would you use it in a production codebase anyway besides, maybe, on some obscure embedded platform?
Division by zero? Check your goddamn variables if you know there's a risk such a division will happen.
Missing a return statement at the end of your non-void function? Hang thyself.
Smartass one-liners that can't be easily understood from looking at them? Jump off a bridge.
Relying on the value of unitialized variables? Seriously, just stop.

UB is very much okay, your program's output shouldn't depend on their existence and rustards should keep writing CoCs instead of giving their opinion on a topic they don't understand, that's just the point I'm trying to make. Apologies if I don't make myself clear enough, in fact I think my statement that your program should never run into UB is simply wrong, what I really meant was, UB should never be relevant to you.

Just take a look at retard
It's 4 one-liners/syntax hells that you shouldn't write in any language ever, and the absolute retardation of mutiplying a character by an integer, which you shouldn't write in any language either.
The use of these besides showing the world you're a retard? Absolutely none.

Also LOL @ your page.
Literally someone whose field of expertise is far, far away from C gives his shit opinion on it based on what he's read on en.wikipedia.org/wiki/Undefined_Behavior

Read the article. He has 15 years of experience in C and learned about this when he worked on a nuclear power plant project. That website isn't even about a single author, and they mix in this guy with girls talking about Bitcoins.

...

Ok, but the relevant point stands still: it's not code you should EVER write in production, not even in Rust or Java, so using it as a basis for criticism of C is simply retarded.
Does C have a lot of disadvantages? Yes. Is UB on unreadable one-liners one of them? No.


It was obvious from the very beginning that the answers would be UB.

it was obvious from the very beginning that you are a meme spouting retard

Hey {{{prrpx}}}, why don't you go back to writing Fizzbuzzes in some toy language?

what makes you think im prrpx?

You should all read :
blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html
blog.llvm.org/2011/05/what-every-c-programmer-should-know_21.html

UB is bad

>hackernoon.com/so-you-think-you-know-c-8d4e2cd6f6a6
I THOUGHT WE WERE DONE WITH FRAMES BULLSHIT

What a waste of dubs. Once again,
Yeah, totally things you do every single day in any programming language.

wtf i am a #cmissile now

It is more common than you might think, there are even arch specific bugs Xen had one like that i believe.
There is no reason to write applications in C anymore unless you are programming some embedded stuff.

yes it can

bump

Yet another anti-C meth rant by some guy who forgot to take his OCD medication. So this is what Holla Forums has become?

...

That, and faggots bumping fortnight-old threads because their daddy didn't beat them hard enough.

This kind of philosophy wont save you from UB. If you write to a static variable from a signal handler, that's UB. There is no "le best practice" that would have avoided this, you simply needed to read a few specs.
"Why would I ever use modular arithmetic"? OTOH The current meme is that signed overflow is okay despite it being UB because all machines now are 2s complement, and I think compilers now break your code if you depend on 2s complement overflow, but I don't really care to find out. The rest of your points are just rookie mistakes and obvious well known UB of C.

uhh, what I'm saying is all modern C programs _do_ rely on UB, and out of _necessity_. Like I said it's very hard to avoid UB to the point that in order to do it you'd basically be writing in a different language than the rest of the C programmers in existence. You'd be writing all these extra functions to make sure your program conforms despite that they do nothing of value on any of the platforms you target - they exist solely to satisfy the property of your program being "defined". Most people don't even know that it's UB to write to a non-volatile variable from one thread and read it from another thread [protip: just because you'd instinctivly put locks around this in a cargo cultish manner, doesn't mean you know of this UB], and that is __nothing__. The same problem exists in C#, Java, and Go BTW.

The likes of Rust or Cyclone provide a real solution to a real problem: maintaining memory-safety without overhead / unpredictability of GC. I'm not saying they are good solutions, but they are solutions.

Also, we need to keep 2 concepts here separate. UB is merely one thing by itself. Memory-safety is another. It just so happens that memory-safety violations in C are defined as UB. This is not typically the case in assembly for example aside from some obscure edge cases like undefined instructions. UB and preserving memory-safety are two completely separate non-trivial problems in C. Memory-safety is violated if you read from uninitialized memory anywhere, or you read/write outside the bounds of an array. Once you violate memory-safety, it's very possible that you have RCE, information disclosure, or similar vulns such as allowing an attacker to overwrite variables (such as user.is_admin) somewhere else in memory or in a previous stack frame.

We could easily code C in a way that it's obvious that memory-safety is preserved, but then there is literally no point in writing in C. It would just be the same as what Java does automatically (modulo GC), or what Rust or Cyclone do automatically. For example the C I just wrote this month, we get a giant array that takes milliseconds (which is just barely fast enough) to process via algorithm A. We don't do any bounds checking in there. Instead, we run a complex algorithm when _receiving_ the array to ensure that the subsequent thousands of runs of algorithm A will not corrupt memory. There are 10-20 invariants that algorithm A assumes about every cell in the array, and the initialization algorithm ensures they are all true before we even run algorithm A. This is the only way algorithm A is able to complete fast enough. This is literally what C and assembly are all about (aside from the simple stuff like SIMD or modifying some obscure register), and 99.999% of the LARPing retards and C apologists on the internet don't understand this.

Now, the hardest part (which nobody actually cares about, because C is full of retards who haven't the slightest clue of anything) is that all kinds of UB in C could actually lead to violating memory-safety. Again I'm talking about subtle UB, not obvious things like writing to the 4097th element of a 4096 element array. Due to this fact (without even caring about compiler bugs), it's literally harder to preserve memory-safety in C than it is in any assembly language.

lol C programmers have UB in their brain

We're just rolling backwards at this point. A whole generation of programmers are too stupid to use C or autotools and instead of learning they cry about difficulty and create shitty alternatives that work worse than what we already had.

read

No, nemory (sic) safety is something only retarded pajeets need to be treated with baby gloves at the expense of performance.

Meanwhile we have retards like you who insist on making all your downstream users pay the cost of probing for whether the AIX FORTRAN 77 compiler supports arguments in the correct order because you're too arrogant to use a build system that isn't 45 years of layered dogshit

I think we all know what the real solution to this is.

We need to rewrite C.

...

Sorry, here:

It's called Jai

...

We need to rewrite C in Rust.*

actually keked

It's not perfect, but when you rage about it and come up with something worse that dies within a couple years it's clear that the problem was you.

no

all those ada shills have actually convinced me to learn ada instead of c
too bad i won't have as many indian friends to compliment my work after i'm done