Cyber Grand Challenge, prizes up to 2 million dollars ! (DARPA)

Skybuck Flying

2015-09-23 11:37:18 UTC

Also very interesting read:

http://blog.trailofbits.com/2015/07/15/how-we-fared-in-the-cyber-grand-challenge/

"
How We Fared in the Cyber Grand Challenge
July 15, 2015 by Artem Dinaburg 6 Comments

The Cyber Grand Challenge qualifying event was held on June 3rd, at exactly
noon Eastern time. At that instant, our Cyber Reasoning System (CRS) was
given 131 purposely built insecure programs. During the following 24 hour
period, our CRS was able to identify vulnerabilities in 65 of those programs
and rewrite 94 of them to eliminate bugs built in their code. This proves,
without a doubt, that it is not only possible but achievable to automate the
actions of a talented software auditor.

Despite the success of our CRS at finding and patching vulnerabilities, we
did not qualify for the final event, to be held next year. There was a fatal
flaw that lowered our overall score to 9th, below the 7th place threshold
for qualification. In this blog post we’ll discuss how our CRS works, how it
performed against competitor systems, what doomed its score, and what we are
going to do next.
Cyber Grand Challenge Background

The goal of the Cyber Grand Challenge (CGC) is to combine the speed and
scale of automation with the reasoning capabilities of human experts.
Multiple teams create Cyber Reasoning Systems (CRSs) that autonomously
reason about arbitrary networked programs, prove the existence of flaws in
those programs, and automatically formulate effective defenses against those
flaws. How well these systems work is evaluated through head-to-head
tournament-style competition.

The competition has two main events: the qualifying event and the final
event. The qualifying event was held on June 3, 2015. The final event is set
to take place during August 2016. Only the top 7 competitors from the
qualifying event proceed to the final event.

During the qualifying event, each competitor was given the same 131
challenges, or purposely built vulnerable programs, each of which contained
at least one intentional vulnerability. For 24 hours, the competing CRSes
faced off against each other and were scored according to four criteria. The
full details are in the CGC Rules, but here’s a quick summary:

The CRS had to work without human intervention. Any teams found to use
human assistance were disqualified.
The CRS had to patch bugs in challenges. Points were gained for every
bug successfully patched. Challenges with no patched bugs received zero
points.
The CRS could prove bugs exist in challenges. The points from patched
challenges were doubled if the CRS could generate an input that crashed the
challenge.
The patched challenges had to function and perform almost as well as the
originals. Points were lost based on performance and functionality loss in
the patched challenges.

A spreadsheet with all the qualifying event scores and other data used to
make the graphs in this post is available from DARPA (Trail of Bits is the
ninth place team). With the scoring in mind, let’s review the Trail of Bits
CRS architecture and the design decisions we made.
Preparation

We’re a small company with a distributed workforce, so we couldn’t
physically host a lot of servers. Naturally, we went with cloud computing to
do processing; specifically, Amazon EC2. Those who saw our tweets know we
used a lot of EC2 time. Most of that usage was purely out of caution.

We didn’t know how many challenges would be in the qualifying event — just
that it would be “more than 100.” We prepared for a thousand, with each
accompanied by multi-gigabyte network traffic captures. We were also
terrified of an EC2 region-wide failure, so we provisioned three different
CRS instances, one in each US-based EC2 region, affectionately named Biggie
(us-east-1), Tupac (us-west-2), and Dre (us-west-1).

It turns out that there were only 131 challenges and no gigantic network
captures in the qualifying event. During the qualifying event, all EC2
regions worked normally. We could have comfortably done the qualifying event
with 17 c4.8xlarge EC2 instances, but instead we used 297. Out of our
abundance of caution, we over-provisioned by a factor of ~17x.
Bug Finding

The Trail of Bits CRS was ranked second by the number of verified bugs found
(Figure 1). This result is impressive considering that we started with
nothing while several other teams already had existing bug finding systems
prior to CGC.

Figure 1: Teams in the qualifying event ranked by number of bugs found.
Orange bars signify finalists.

Our CRS used a multi-pronged strategy to find bugs (Figure 2). First, there
was fuzzing. Our fuzzer is implemented with a custom dynamic binary
translator (DBT) capable of running several 32-bit challenges in a single
64-bit address space. This is ideal for challenges that feature multiple
binaries communicating with one another. The fuzzer’s instrumentation and
mutation are separated, allowing for pluggable mutation strategies. The DBT
framework can also snapshot binaries at any point during execution. This
greatly improves fuzzing speed, since it’s possible to avoid replaying
previous inputs when exploring new input space.
Figure 2: Our bug finding architecture. It is a feedback-based architecture
that explores the state space of a program using fuzzing and symbolic
execution.

Figure 2: Our bug finding architecture. It is a feedback-based architecture
that explores the state space of a program using fuzzing and symbolic
execution.

In addition to fuzzing, we had not one but two symbolic execution engines.
The first operated on the original unmodified binaries, and the second
operated on the translated LLVM from mcsema. Each symbolic execution engine
had its own strengths, and both contributed to bug finding.

The fuzzer and symbolic execution engines operate in a feedback loop
mediated by a system we call MinSet. The MinSet uses branch coverage to
maintain a minimum set of maximal coverage inputs. The inputs come from any
source capable of generating them: PCAPs, fuzzing, symbolic execution, etc.
Every tool gets original inputs from MinSet, and feeds any newly generated
inputs into MinSet. This feedback loop lets us explore the possible input
state with both fuzzers and symbolic execution in parallel. In practice this
is very effective. We log the provenance of our crashes, and most of them
look something like:

Network Capture ⇒ Fuzzer ⇒ SymEx1 ⇒ Fuzzer ⇒ Crash

Some bugs can only be triggered when the input replays a previous nonce,
which would be different on every execution of the challenge. Our bug
finding system can produce inputs that contain variables based on program
outputs, enabling our CRS to handle such cases.

Additionally, our symbolic executors are able to identify which inputs
affect program state at the point of a crash. This is a key requirement for
the success of any team competing in the final as it enables the CRS to
create a more controlled crash.
Patching

Our CRS’s patching effectiveness, as measured by the security score, ranks
as fourth (Figure 3).
Figure 3: Teams in the qualifying event ranked by patch effectiveness
(security score). Orange bars signify finalists.

Figure 3: Teams in the qualifying event ranked by patch effectiveness
(security score). Orange bars signify finalists.

Our CRS patches bugs by translating challenges into LLVM bitcode with
mcsema. Patches are applied to the LLVM bitcode, optimized, and then
converted back into executable code. The actual patching works by gracefully
terminating the challenge when invalid memory accesses are detected.
Patching the LLVM bitcode representation of challenges provides us with
enormous power and flexibility:

We can easily validate any memory access and keep track of all memory
allocations.
Complex algorithms, such as dataflow tracking, dominator trees, dead
store elimination, loop detection, etc., are very simple to implement using
the LLVM compiler infrastructure.
Our patching method can be used on real-world software, not just CGC
challenges.

We created two main patching strategies: generic patching and bug-based
patching. Generic patching is an exclusion-based strategy: it first assumes
that every memory access must be verified, and then excludes accesses that
are provably safe. The benefit of generic patching is that it patches all
possible invalid memory accesses in a challenge. Bug-based patching is an
inclusion-based strategy: it first assumes only one memory access (where the
CRS found a bug) must be verified, and then includes nearby accesses that
may be unsafe. Each patching strategy has multiple heuristics to determine
which accesses should be included or excluded from verification.

The inclusion and exclusion heuristics generate patched challenges with
different security/performance tradeoffs. The patched challenges generated
by these heuristics were tested for performance and security to determine
which heuristic performed best while still fixing the bug. For the
qualifying event, we evaluated both generic and bug-based patching, but
ultimately chose a generic-only patching strategy. Bug-based patching was
slightly more performant, but generic patching was more comprehensive and it
patched bugs that our CRS couldn’t find.
Functionality and Performance

Functionality and performance scores combine to create an availability
score. The availability score is used as a scaling factor for points gained
by patching and bug finding. This scaling factor only matters for
successfully patched challenges, since those are the only challenges that
can score points. The following graphs only consider functionality and
performance of successfully patched challenges.
Functionality

Out of the 94 challenges that our CRS successfully patched, 56 retained full
functionality, 30 retained partial functionality, and 8 were nonfunctional.
Of the top 10 teams in the qualifying event, our CRS ranks 5th in terms of
fully functional patched challenges (Figure 4). We suspect our patched
challenges lost functionality due to problems in mcsema, our x86 to LLVM
translator. We hope to verify and address these issues once DARPA
open-sources the qualifying event challenges.
Figure 4: The count of perfectly functional, partially functional, and
nonfunctional challenges submitted by each of the top 10 teams in the
qualifying event. Orange bars signify finalists.

Figure 4: The count of perfectly functional, partially functional, and
nonfunctional challenges submitted by each of the top 10 teams in the
qualifying event. Orange bars signify finalists.
Performance

The performance of patched challenges is how our CRS snatched defeat from
the jaws of victory. Of the top ten teams in the qualifying event, our CRS
placed last in terms of patched challenge performance (Figure 5).
Figure 5: Average and median performance scores of the top ten qualifying
event participants. Orange bars signify finalists.

Figure 5: Average and median performance scores of the top ten qualifying
event participants. Orange bars signify finalists.

Our CRS produces slow binaries for two reasons: technical and operational.
The technical reason is that performance of our patched challenges is an
artifact of our patching process, which translates challenges into LLVM
bitcode and then re-emits them as executable binaries. The operational
reason is that our patching was developed late and optimized for the wrong
performance measurements.

So, why did we optimize for the wrong performance measurements? The official
CGC performance measurement tools were kept secret, because the organizers
wanted to ensure that no one could cheat by gaming the performance
measurements. Therefore, we had to measure performance ourselves, and our
metrics showed that CPU overhead of our patched challenges was usually
negligible. The main flaw that we observed was that our patched challenges
used too much memory. Because of this, we spent time and effort optimizing
our patching to use less memory in favor of using more CPU time.

It turns out we optimized for the wrong thing, because our self-measurement
did not agree with the official measurement tools (Table 1). When
self-measuring, our worst-performing patching method had a median CPU
overhead of 33% and a median memory overhead of 69%. The official qualifying
event measured us at 76% CPU overhead and 28% memory overhead. Clearly, our
self-measurements were considerably different from official measurements.
Measurement Median CPU Overhead Median Memory Overhead
Worst Self-Measured Patching Method 33% 69%
Official Qualifying Event 76% 28%

Table 1: Self measured CPU and memory overhead and the official qualifying
event CPU and memory overhead.

Our CRS measured its overall score with our own performance metrics. The
self-measured score of our CRS was 106, which would have put us in second
place. The real overall score was 21.36, putting us in ninth.

An important aspect of software development is choosing where to focus your
efforts, and we chose poorly. CGC participants had access to the official
measuring system during two scored events held during the year, one in
December 2014 and one in April 2015. We should have evaluated our patching
system thoroughly during both scored events. Unfortunately, our patching
wasn’t fully operational until after the second scored event, so we had no
way to verify the accuracy of our self-measurement. The performance penalty
of our patching isn’t a fundamental issue. Had we known how bad it was, we
would have fixed it. However, according to our own measurements the patching
was acceptable so we focused efforts elsewhere.
What’s Next?

According to the CGC FAQ (Question 46), teams are allowed to combine after
the qualifying event. We hope to join forces with another team that
qualified for the CGC final event, and use the best of both our technologies
to win. The technology behind our CRS will provide a significant advantage
to any team that partners with us. If you would like to discuss a potential
partnership for the CGC final, please contact us at ***@trailofbits.com.

If we cannot find a partner for the CGC final, we will focus our efforts on
adapting our CRS to automatically find and patch vulnerabilities in real
software. Our system is up to the task: it has already proven that it can
find bugs, and all of its core components were derived from software that
works on real Linux binaries. Several components even have Windows and
64-bit support, and adding support for other platforms is a possibility. If
you are interested in commercial applications of our technology, please get
in touch with us at ***@trailofbits.com.

Finally, we plan to contribute back fixes and updates to the open source
projects utilized in our CRS. We used numerous open source projects during
development, and have made several custom fixes and modifications. We look
forward to contributing these back to the community so that everyone
benefits from our improvements.
"

Bye,
Skybuck =D