This blog is a summary of study for the paper as follows, figures and facts are derived from this paper and should not be abused for other purpose.

Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. 2008. BotMiner: clustering analysis of network traffic for protocol- and structure-independent botnet detection. In Proceedings of the 17th conference on Security symposium (SS’08). USENIX Association, Berkeley, CA, USA, 139-154.

While machine learning has been proved effective enough in many different fields, notably such as NLP, OCR, recommendation system and so on, but its applications on Security is relatively barely satisfactory. Given that the security related data volume is booming up at a scale of GB-level per day in many organization, this is absolutely urgent to get our hands on the so called Data-Driven-Security. Amongst all, during my study over this subfield of applied machine learning, Botnets detection is actually a pretty good example or subject of research.

A few words about Botnets

A Botnet can be understood as a group of machines, infected or intended, communicated and controlled by a botmaster to carry on malicious activities through over the network. Obviously a botnet can perform serious harm on a legitimate network or system, known such as DDoS attacks, spams, phishing, identity theft and information exfiltration. Typical structure of Botnet can be centralized or distributed (P2P), and typical protocol of C&C can be IRC, HTTP. Since HTTP is normally allowed by most of networks, HTTP-based P2P Botnet is getting more and more popular.

Fig.1 Possible Botnet structures. a centralized b

Fig.1 Possible Botnet structures. a centralized b

Previous work has developed many techniques to detect Botnets, however, either focus on particular C&C protocol, structure, infection model of botnets, or be incapable of dealing changing C&C server addresses (e.g., fast-flux service network). In this paper, however, authors proposed a general data-drive framework based on intrinsic characteristics of botnets, namely,

  • who is talking to whom? (C-plane)
  • who is doing what? (A-plane)

The assumptions behind it is that we believe that an identifiable botnet is always driven by a certain number of C&C servers, and is intended to perform malicious activities to some assets. Therefore, the characteristics of an identifiable botnet can be summarized as being C&C patterns and the malicious activities patterns, as shown in Fig.1. By doing so, the detection framework is more independent of structure, protocols, infection models and so on, since we are inspecting the botnets by looking at its behavior.

C-plane

The BotMiner framework is thus divided into two parts, that is, C-plane and A-plane. C stands for C&C which examines network flow between botmaster and bots, because it is believed the network flow between them follows some certain patterns. It helps logging the network flow in a format suitable for efficient storage and further analysis.

A-plane

On the other handside, A-plane focus on outbound traffic of activities performed by the bots. Suspicious activities such as scanning, spamming, binary downloading and exploit attempts could very possibly follow some certain patterns. To detect those malicious activities, they deployed a variety of IDS engines to identify the traffic patterns.

Importantly we note that either C-plane or A-plane is not enough to detect botnets, which can usually produce high false positive. BotMiner combines two planes and cross-correlate the outputs from both planes to produce the final results. The architecture of BotMiner is depicted in Fig.2.

Fig.2. BotMiner architecture

Fig.2. BotMiner architecture

Learning traffic

As seen in Fig.2, outbound traffic in A-plane and network flows data in C-plane will be filtered and preprocessed to prepare vector-like features, just as commonly required by machine learning algorithms. For C-plane, similar network flow patterns are aggregated according to source IP and destination IP, also port number and protocol types, which define the who is talking to whom. Features are then built for example, number of flows per hour, number of packets per flow, avg. number of bytes per packets and avg. number of bytes per second. This characterizes the communication pattern when clients are talking to servers. Then a 2-step clustering is applied on the dataset, where X-means is used. For A-plane, it also follows 2-layer clustering, that is, Snort output are clustered firstly according to different types of activities, and further clustered within a similar activity. For instances, scanning on same ports will be classified as the same cluster. Overlapping of SMTP destinations will also be classified as the same cluster. This defines who is doing what. clustering results will be cross correlated to compute the final cluster result, which identifies the detected botnets. To confirm the cross-plane correlation, a score has to be assigned on host, where we expect higher score when the host belongs to multiple malicious activities. In the meanwhile, if the host also belongs to at least one C-cluster sharing a common network flow patterns, then we believe this host belongs to certain botnet.

Results

The results look pretty good, BotMiner is able to detect almost all the botnets, detailed in Table.1. You can refer more explanation of the observations in the paper.

Table.1 Botnet detection results using BotMiner
Botnet #Bots Detected? Clustered Bots Detection Rate False Positive Clusters/Hosts FP Rate
IRC-rbot 4 YES 4 100% 1/2 0.003
IRC-sdbot 4 YES 4 100% 1/2 0.003
IRC-spybot 4 YES 3 75% 1/2 0.003
IRC-N 259 YES 258 9.6% 0 0
HTTP-1 4 YES 4 100% 1/2 0.003
HTTP-2 4 YES 4 100% 1/2 0.003
P2P-Storm 13 YES 13 100% 0 0
P2P-Nugache 82 YES 82 100% 0 0

Conclusion

Finally when we retrospect the work on detecting botnets using learning based techniques, it is believed and proved eventually that the assumptions made about the botnets are actually realistic and approchable. As we have seen, the whole frame work is grounded on two facts: who is talking to whom and who is doing what. Since we believe this characterizes the fundamental behavior of malware instances, which will be told apart from normal instances. Although a lot of effort on feature engineering is still indispensable for efficiency and precision, the intrinsic properties of botnets have shown those infected machines that have similar communication patterns, meanwhile perform the same set of multiple suspicious activities.