Hindawi Journal of Computer Networks and Communications Volume 2019, Article ID 5758437, 10 pages https://doi.org/10.1155/2019/5758437 Research Article Using Burstiness for Network Applications Classification Hussein Oudah ,1,2 Bogdan Ghita ,1 Taimur Bakhshi ,3 Abdulrahman Alruban,1,4 and David J. Walker5 1Centre for Security, Communications and Network Research, University of Plymouth, Plymouth, UK 2Department of Mathematics and Computer Applications, Al-Muthanna University, Samawah, Iraq 3National University of Computer & Emerging Sciences, Lahore, Pakistan 4Department of Information Technology, Computer Sciences and Information Technology College, Majmaah University, Al-Majmaah, 11952, Saudi Arabia 5Centre for Robotics and Neural Systems, University of Plymouth, Plymouth, UK Correspondence should be addressed to Bogdan Ghita; bogdan.ghita@plymouth.ac.uk Received 29 March 2019; Revised 19 June 2019; Accepted 25 July 2019; Published 20 August 2019 Academic Editor: Djamel F. H. Sadok Copyright © 2019 Hussein Oudah et al. -is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Network traffic classification is a vital task for service operators, network engineers, and security specialists to manage network traffic, design networks, and detect threats. Identifying the type/name of applications that generate traffic is a challenging task as encrypting traffic becomes the norm for Internet communication. -erefore, relying on conventional techniques such as deep packet inspection (DPI) or port numbers is not efficient anymore.-is paper proposes a novel flow statistical-based set of features that may be used for classifying applications by leveraging machine learning algorithms to yield high accuracy in identifying the type of applications that generate the traffic. -e proposed features compute different timings between packets and flows. -is work utilises tcptrace to extract features based on traffic burstiness and periods of inactivity (idle time) for the analysed traffic, followed by the C5.0 algorithm for determining the applications that generated it. -e evaluation tests performed on a set of real, uncontrolled traffic, indicated that the method has an accuracy of 79% in identifying the correct network application. 1. Introduction with applications based on machine learning algorithms [4]. -is method relies on the characteristics of IP flows such as In the context of the ever-increasing network activity and the number of packets in a flow and size and duration of a reliance on the Internet, monitoring and characterising web flow which reflect unique patterns for applications. -e traffic is critical for network administrators for operational aforementioned method was considered flexible for and security activities. A number of directions were ex- emerging traffic as it utilises the network level (packet plored, such as establishing what websites the users are header) with promising results rather than the application interested in, how much traffic is generated by specific level (packet contents) [5]. applications, and whether these applications or services can For better classification decisions, the prior art proposed be controlled in terms of network resource demands [1]. -e either new machine learning algorithms (MLA) or novel research community proposed a number of alternatives, feature-based approaches. -e majority of previous studies with earlier studies focusing on port-based approaches and in the domain of traffic classification focused upon in- deep packet inspection. However, these methods failed to troducing new MLA, and little attention has been given to identify applications as they currently use dynamic/well- the extraction of new features. To this end, this study aims to known ports such as port 80 or encrypted methods such as explore the potential of extracted new features subset and SSL/TLS [2, 3]. In recent years, studies have focused on using find out whether these features have a positive impact on the statistical features approach for identifying traffic associated system performance.-e new features need to be sufficiently 2 Journal of Computer Networks and Communications discriminative in order to distinguish between applications. behavioural patterns, activities and applications can be -e proposed approach focuses on classifying web-based classified. Although the method showed acceptable per- applications such as Facebook and YouTube, which use the formance (over 90%) [15] and it can detect the application network/Internet to manage requests from a client to an type, it could not correctly identify the application names, application server rather than network-specific utilities/ classifying both Yahoo and Gmail as e-mail [16]. In con- protocols such as SMTP and FTP. Identifying the optimal trast, high accuracy was achieved (over 95%) by applying features for applications reduces the potentially large di- the latter approach of statistical methods [17–20], using mensionality and might be useful to improve the system statistical features derived from the packet header, such as performance [6]. -e proposed features were extracted from number of packets, packet size, interarrival packets time, real data, which were collected from a group of 20 users at and flow duration with the aid of machine learning algo- the University of Plymouth for two months, using tcptrace rithms. -e advantage of using ML algorithms is that they tool and evaluated with the C5.0 algorithm [7]. -e data can be used in real-time environments to provide rapid were labelled based on DNS queries and IP addresses for application detection with high accuracy. For instance, the each examined application, which was identified from our authors of [21] used the Naı̈ve Bayes techniques with the previous studies [8, 9]. statistical features to identify traffic. Other ML algorithms -e rest of the paper is organised as follows: Section 2 utilised were Bayesian neural networks, support vector discusses the state-of-the-art traffic classification approaches machines, and decision trees [7, 22, 23]. In [7], the author in more detail to provide a review of the limitations of used a C5.0 decision tree algorithm to classify seven ap- present techniques. Section 3 highlights the proposed plications with average accuracy over 99%. However, the method and analysis and introduces the feature set. Section 4 process of feature selection, which must be flexible to the presents the results using C5.0 algorithm, and Section 5 network circumstances, is a critical point in the construc- concludes the paper with a summary of achievements. tion of a classifier [6]. Given this classification, the statistical differences 2. Related Work between interarrival times of packets and flows approach outlined in this work strengthen the behavioural and As mentioned in the introductory section, the classification statistical methods by considering arrival times of packets of applications has received significant interest from the and flows as discriminating features among applications. research community in recent years. -is section summa- -e authors in [24] proved that there is variability rises previously proposed classification techniques and their (burstiness) in network traffic by using a measure called limitations. index of variability. -e hypothesis that timing can be used to discriminate between applications was also put forward in [25], which highlighted that applications 2.1. TrafficClassificationTechniques. In the early days of the generate different behaviour based on statistical features Internet, its applications were identified easily based only relating to the timing of packets arriving. More details upon port number [10]. IANA [11] assigned protocols to about burstiness were proposed in [26], which defines it well-known transport layer ports; therefore, the identifi- in two levels. -e first level was called a small-time scale cation process was merely based upon matching the port flight (STF) which means that the interarrival times of number in the packet header with the table containing the packets occur within a predefined time T (a constant port applications. Due to the continuous growth of In- threshold in the range of 5–10milliseconds). -e second ternet applications, standard ports are no longer used; level is a large-time scale flight (LTF), defined as larger instead, they have been moved towards a web-based front- interarrival times of packets with a value of 40– end or have used dynamic ports [2]. Consequently, this 1,000 milliseconds. A different number of bursts would be method becomes inaccurate when identifying applications generated for each category based on the value of the and typical performance ranges between 30 and 70%, threshold. depending on the mix of traffic, and this includes a number From an efficiency perspective, it should be noted that of applications to be identified [12]. Following from the the statistical approach is appropriate for traffic classification improvement in processing power, deep packet inspection as it can deal with encrypted traffic, which nowadays has (DPI) [13] was then the preferred choice, as it identifies become dominant, and it can adapt to real-time traffic. In signatures of applications or protocols based on the our previous studies [8, 9], we proposed a set of attributes content of the packets. DPI also became inefficient as most based on burstiness and idle time; six applications were used traffic nowadays is encrypted. Moreover, it breaches the to evaluate our features with the aid of C5.0 method, the privacy of the users and did not scale well from a com- results showed high accuracy in identifying six applications putational perspective with the increase in core network over 97%. speed [3, 14]. -e research community therefore introduced two techniques to avoid these limitations, focusing on host 2.2. Splitting Traffic Based on DNS Requests. Given the di- behaviour and statistical methods. -e former technique is minishing success of port analysis and DPI, traffic clas- based on the idea that hosts generate different communi- sification can also be based on DNS analysis. -e authors cation patterns at the transport layer; by extracting these of [27, 28] focused on the volume and variety of DNS Journal of Computer Networks and Communications 3 queries generated from both clients and servers, aiming to 3. Proposed Method and Features observe the effect of caching mechanisms on the client side. Other studies, such as [29, 30], exploited the DNS As concluded from the prior art, the statistical approach is information to reveal malware activities. Furthermore, the a robust and reliable approach, allowing efficient network- authors in [31] used DNS queries to classify traffic by based (packet header) rather than application-based matching keywords in the domain names table with the (packet contents) analysis. Extracting the most discrimi- collected flows of traffic. -ese labelled flows were cat- native features that characterise applications remains the egorised based on domain name similarity, and the aim key to success in this approach without being biased by was to break down the traffic volume. Similarly, the study either user or network circumstances such as congestion [32] utilised DNS to tag flows by capturing the first packet and delay. of each flow and exploiting the domain name which was -is paper aims at identifying an additional set of fea- separated into keywords to form vectors for each appli- tures that can be used to discriminate between applications, cation. Also, they used the port number and transport based on the statistical differences between interarrival times application name as features to classify applications. -ey of packets and flows. We focus in particular on burstiness, claimed that the provided DNS information could be which defines closely spaced data exchanges, such as objects useful to identify more than 30% of traffic. In [33, 34], the on the same page, and idle periods, which separate longer- authors used DNS to label flows based on the keywords term transactions, such as moving from one page to another available after resolving IP addresses. Otherwise, the flows when the user is browsing a website. -e assumption that we would be classified based on selected attributes and with make is that different applications produce different dis- the aid of machine learning to improve accuracy. Another tributions of packet size, duration, distribution of the bursts, study [35] collected large traffic from University of and idle time parameters. Consequently, Internet applica- Auckland and it found that about 10% of their observed tions behave inherently different, generating different TCP traffic did not use DNS lookups and neither did about amounts of data, creating various connections and timing 85% of UDP lookups. patterns between the generated packets and flows, beyond Using a similar scenario, the authors in [36] argued the generic distribution of connections for overall traffic that traffic could be classified based on the IP address and [42]. For instance, streaming a video on Netflix versus e-mail hostname. Although the results showed that up to 55% of checking or using social media could lead to significantly web traffic could be identified based on the proposed different packet arrival patterns and hence a slightly different method, it also had a high accuracy in identifying appli- burstiness signature. cations such as WhatsApp, Twitter, and Dropbox. Based -e following example explains the concept of burst- on the long-term monitoring, the authors concluded that iness and how it may be used to discriminate the behaviour the IP addresses of servers associated with each application of Internet applications. When a user is browsing an remain stable for short periods, but they change over long application, for instance, the BBC news website (https:// periods. -e study recommended updating and checking www.bbc.com/news), the session would consist of some the IP addresses frequently for the methods that rely on IP pages that the user chooses to visit. Within each page, the addresses as a key feature. Similarly, the authors in [37] browser will be requesting and downloading the objects proposed a method to label websites based on server IP embedded in the page, some on the same site and some addresses. Firstly, they collected data from different users hosted on other sites. From a timing perspective, the working on the same website to ensure that the server IP download of objects on a page would appear as a burst of address belongs to the same application, and then they connections, followed by a period of inactivity (idle time) built a ground truth of IP addresses for specific applica- while the user reads the page until he/she decides to click tions and used it to classify a mix of traffic flows. -e on a link and load another page. Figure 1 shows how the method showed good results when considering DNS group of packets forms a burst based on interpacket arrival queries. Following the same line of research, the authors in time and inactivity of time between bursts. -is burstiness [38] used IP server addresses to group traffic applications phenomenon could be happen within packets or flows. In to study the user activities, and the authors in [39, 40] this study, the burstiness concept will be defined in two claimed that the IP address represents an informative levels, the first level is in the context of packet analysis and feature. In [1], the authors claimed that traffic could be the second level is in the context of flow analysis. classified based on DNS, and they proved that a majority of traffic could be resolved, such as HTTP and HTTPS generated traffic, except P2P applications. 3.1. Packet Analysis. In a packet-based analysis, the bursts From previous studies, we conclude that DNS in- and idle times would be formed based on the interarrival formation and IP addresses could be active factors in times for packets during the connection between the client classifying applications. We need to look into these attri- and server. -is level was defined in [43] as a group of butes for each application and check if they are unique and consecutive packets with shorter interarrival delays than robust when presented with the variable network environ- the packets arriving before or after them. Given one of the ment. -e next section focuses on the concept of burstiness two unidirectional data flows within a connection, a and idle time and how burstiness-related features may be burst_threshold (T) is defined as a maximum time delay generated from tcptrace [41]. between the arrivals of two consecutive packets that belong 4 Journal of Computer Networks and Communications 3.2. Flow Analysis. -e same concept was applied to cal- culate the burst and idle time between flows. -e calculation X-axis Burst1 Burst2 Burst3 Burst4 was measuring time differences between the initial times of flows and subsequent flows, which are calculated from the Interarrival first packet of each flow. -e timestamp of the first packet time time of the first flow is subtracted from the timestamp of the first packet time of the second flow; if the time difference is Idle time1 Idle time2 Idle time3 less than 1 s, then a burst is formed. Otherwise, if the time Packet arrival time difference is more than 10 s, then the period is considered an idle time. Table 2 summarises burst-based features. Figure 1: Definition of bursts and idle time. to the same burst. Similarly, idle_threshold (I) is defined 3.3. ConventionalAnalysis. In our experiment, the proposed the distance between groups of packets of interarrival time features are compared with the previous ones to show the at which could be identified the idle time that separates two effects of the proposed method in distinguishing between consecutive data exchanges and could be defined as I. In applications. -ese features were calculated for each di- order to provide a meaningful description of the in- rection of flow as shown in Table 3. teractions, the analysis must establish the values for Tand I and whether they should be constant or dynamic. A 3.4. Splitting Traffic. In our earlier studies [8, 9], data previous study [26] proposed two ranges for T, of 5ms– collection was based on controlled application usage, with 10ms and 40ms–1 s. Another study [43] proposed two users being given instructions of what to do, which ap- different scenarios for the value of T; the first one was plications to use, and for how long. -e users were asked to dynamic, which means different values could be for T, browse these applications separately. Hence, the data were while the second scenario was fixed without proposing any collected per application and dumped in labelled files. values for T. In order to get an image of the range of time Accordingly, from each application file, the destination IP values for the protocol interaction, Figure 2 shows the address was extracted and dumped in separated files. For interpacket arrival time for five applications. Most dis- this paper, a real data traffic was collected and the DNS tributions of the interpacket arrival time fall under requests were used to identify which application is 1 second, except for YouTube that falls under 0.5 seconds. requested. We acknowledge that DNS may not be the most Accordingly, the burst_threshold could be set to 1 second, accurate method, but it allowed testing the accuracy of our while the idle threshold was set to 10 seconds. While the method. In addition to the automatic allocation to ap- application does indeed exhibit a different signature in plications, after each request, IP addresses were extracted terms of packet arrival distribution, user behaviour may for three seconds to update the IP address files, finally also influence this distribution, particularly in relation to matching between traffic flows and the IP address files and long-term activity, as idle times are a factor of user be- storing the matched traffic in files based on each appli- haviour too. -e idle time could be varied according to the cation. -ese two mechanisms were to label data with the behaviour of the user when he/she moves from one page to applications which will be the input to the classifier and to another. As shown in previous studies, the distribution of examine the proposed features. timing for user connections may be used as a discriminant among users [38, 44]. However, while users may introduce a level of noise in the distribution, a sufficiently large data 4. Experimental Methodology sample would allow determining the benefits and limita- A high-level architecture of the proposed system is presented tions of the method. Prior studies, such as [45], utilised idle in Figure 3, which highlights the key components of the time values typically ranging from 10 seconds to 5minutes, application identification scheme. Firstly, a real traffic was the idle_threshold (I) proposed to set 10 seconds. As collected from twenty hosts in an office environment for shown in Figure 1, many features could be extracted from over two months. Afterwards, the packet trace was parsed to each flow and each direction such as a total number of read the captured DNS requests, and the name was used to bursts in the direction a-b/b-a, the total number of packets identify the application type, such as Google or BBC news. within bursts for each direction, and the total size of bursts Also, the packet trace was analysed by tcptrace to generate in bytes in each direction. -e pseudocode in Algorithm 1 flows with proposed and conventional features. Finally, the summarises the estimation of bursts and idle time between resulting flows were labelled based on IP address, which was packets and for each flow. For each packet arrival, the then fitted into a C5.0 classifier. -e following subsections interarrival time is compared with the two burstiness provide more detail on the methodology. thresholds to determine whether the packet is part of a new burst or session. -e possible features that could be extracted from the pseudocode are described in Table 1, 4.1.DataCollection. -e raw data traffic was collected from each of the inputs in the table is a pair of variables, one for a lab in the University of Plymouth between May and July the a-to-b direction and one for the b-to-a direction, where 2018 from a group of 20 PhD students. -e data were a refers to the client side and b refers to the server side. collected using a tcpdump tool via a network-based Journal of Computer Networks and Communications 5 100 90 80 70 60 50 40 30 20 10 0 0.001 0.01 0.1 1 10 100 1000 10000 100000 Inter-packets arrival time (msec) Amazon Skype CNN YouTube Instagram Figure 2: -e distribution of interpacket arrival times for five applications. burst_threshold� 1 s idle_threshold� 10 s initialise burst and idle time parameters while packets arriving do calculate interarrival_time if interarrival_time< burst_threshold current_burst ++ current_session ++ else burst_counter ++ current_burst� 1 if interarrival_time> idle_threshold current_session� 1 session_counter ++ idle_time +� interarrival_time fi fi done ALGORITHM 1: Estimation of packet bursts and idle time. Table 1: Burstiness and idle time features amongst packets (for each direction). Feature Description Bursts Total number of bursts between packets for eachdirection Packets-in-burst Total number of packets in bursts for each direction Burst-size Total bytes for bursts for each direction Burst-size-b/Burst-size-a -e ratio between burst-size-b and burst-size-a Burst-duration -e time duration of bursts for each direction Burst-duration-a/Packet-a -e ratio between burst duration and number of allpackets in bursts Burst-duration-b/Packet-b -e ratio between burst duration and number of allpackets in bursts Idle-time -e accumulation of inactive time for all packets Idle-time-data -e accumulation of inactive time for only datapackets Frequency 6 Journal of Computer Networks and Communications Table 2: Burstiness and idle time features for flow analysis. Feature Description Burst-no Total number of bursts between flows for each session Flows-no Total number of flows within all bursts for eachsession Packets-no Total number of packets within all bursts for eachsession Packets-data-no Total number of data packets within all bursts foreach session Burst-size -e total size of all bursts in bytes for each session Average-burst-size -e average size of all bursts for each session Burst-duration -e total time duration for all bursts Burst-duration/burst-no -e ratio between burst duration and the totalnumber of bursts for each session Burst-idle-time -e total inactive time between flows for each session Table 3: Conventional features proposed by previous studies. Features Description Packets Total number of packets Data_packets Total number of data packets Flags_packets Total number of TCP flag packets First_packet -e size of the first packet Flow_duration -e time of the last packet subtracted by the time ofthe first packet Inter_arrival_time -e time duration of each direction divided by thetotal number of packets Packets_b/packets_a A ratio of received packets to transmitted packets Data_packets_b/data_packets_a A ratio of receiving of data packets to transmitteddata packets Flags_packets_b/Flags_packets_a A ratio of received of flags packets to transmitted offlags packets Application 1 Application 2 Packets trace Flows . Decision tree . (C5.0) Application 9 DNS enquires Matching IP flow with IP (application requests) Features extraction files BBC news . . Facebook Basic features, burst . and idle time features . YouTube IP files . . Figure 3: Proposed traffic classification methodology. method and were divided into 24 samples per day; each 4.2. DNS Enquires. -e collected data were packet-based, sample represented a one-hour traffic of pcap format which contain DNS queries; the content of the DNS requests is which is suitable to be input to the tcptrace tool; this used to identify applications. In each DNS packet request, there division reduces the size and processing time of each is a keyword that refers to the requested application. For in- sample. stance, the keywords for BBC news and YouTube are bbc.co.uk Journal of Computer Networks and Communications 7 and youtube.com, respectively. A Python script is used to read Table 4: Overall results for classification of the observed data. the DNS request from the user and tag the time of the ap- Application Flows Duration (h) Number of samples plication requested. -ree seconds after each request usually belong to the same application as noticed from the monitoring BBC news 3,150 1.6 6Facebook 98,210 33.1 287 the traffic. -erefore, the IP addresses for this period were Google 59,422 88.5 892 tagged as well for the same requested application. -is process Yahoo mail 6,795 0.8 9 partitions the traffic into many applications considering the YouTube 66,500 76.5 714 specific time stamp of each request which is essential in the next Gmail 1,448,392 143 870 stage. Amazon 23,975 6.6 34 Plymouth.ac.uk 24,225 42.5 286 Bing 10,324 17.2 110 4.3. Feature Extraction (Packet Analysis). -e collected In- ternet traffic was analysed using the tcptrace tool [41]. -e tool takes packets as input and output flows that are sharing technique is used to evaluate the performance of machine the same five tuples (source IP address, source port number, learning model by testing the model on unseen data to avoid destination IP address, destination port number, and pro- overfitting and underfitting problems. -is technique par- tocol). -e concept of burstiness and idle time was imple- titioned the data into three equal parts, and the model was mented in this tool to generate the desired features; the script trained on two parts of the data and tested on the remaining was written inside the tool to extract the packet features. part. -e process was repeated three times, and the error was Moreover, features that were proposed by other studies were calculated by taking the average of all errors. -e classifi- extracted from the same tool. In this stage, three types of cation algorithm was applied to all three feature sets with six flows were classified; firstly, all flows which were extracted different boosting values (0, 10, 15, 20, 50, and 100). -e with packet-based features, secondly, some flows were boosting refers to algorithms that apply weak classifiers to tagged with time and name of applications, and finally, most build a strong classifier by combining the results. -e al- flows were tagged as unknown flows. gorithm gives all records the same weight and applies a sequence of iterations of classification; the misclassified records increase their weight, while the weight of the right 4.4. IP Address Matching. -e uncontrolled data were classified records is reduced. Finally, a strong classifier is analysed as shown in the previous subsection into flows that created from incorporating the individual ones with the best contained known flows based on reading the DNS requests tuning for the parameters to avoid overfitting. -e results in plus the three seconds after the requests. -erefore, the Table 5 indicate low accuracy for set1 compared to set2 as the matching process started with reading the known flows to burstiness features to increase the efficiency of the classifier determine the application name and afterwards fetching the in discriminating the different applications. Combining both specified file of the IP addresses for that application. Sec- sets showed considerable improvement in classification ondly, matching the unknown flows with the specified file accuracy peaking at 79.68% at boost 10. until the end of the flow trace and tag them as known flows. -e proposed features show the ability to better dis- Finally, dumping known flows in separated files and la- criminate between the applications in comparison with the belling them according to the application name. Based on other parameters, which enhances the classifier capability. previous studies, the IP files are subjected to change con- Table 6 shows the comparison between basic and burstiness tinuously by the owners of applications for security reasons. features for the most attributes that were used by C5.0 -erefore, updating these addresses is essential, but it must classifier.-e burstiness attributes reported maximum usage be automatic and during the identification process. -e in segregating the applications.-is is another indicator that results are nine applications with details in Table 4. As seen the classifier strongly relied on the proposed features because from the results, the application that was most frequently they provide high discrimination between applications. used by the users was the Gmail, against very low usage for Yahoo mail. -e application identification accuracy of the proposed features versus traditional ones was evaluated 4.5. Confusion Matrix. -e accuracy, as presented in the using three feature sets. -e first set included features previous section, represents only the ratio of correctly suggested by previous studies as introduced in Table 3; the classified samples versus all samples. In order to further second set contained the burstiness and idle time features analyse the performance of the classifier, Table 7 presents a proposed by this paper, as presented in Tables 1 and 2, while confusion matrix for the observed data, which is presented the third set included the full list of inputs from the other in Table 4. Rows of the matrix represent predicted samples of two. Cross-validation technique was used for training and each application and columns represent the actual samples. testing a model with three folds as ratio 2/1, respectively. For example, the actual samples for Gmail are 289, which are Cross validation is a statistical method that divides data into a summation of a first column, 198 samples are predicated equal folds, one fold used for validating the model, and the correctly, while the other samples are predicted as false- others used for training it. In each new round, a different positives for different applications. On the contrary, the row fold is used for validating the model so that the training can for Gmail represented a predication of other applications be shown to be effective across different datasets. During the and classified as false-negative for Gmail application. -e process, each fold will be used for validation once. -is overall performance of the classifier is high for all 8 Journal of Computer Networks and Communications Table 5: Average accuracies with different feature sets using 3-fold cross validation. Boosting 0 10 15 20 50 100 Set1 47.77 56.56 58.05 58.54 60.30 60.31 Set2 49.30 58.75 60.21 61.11 64.23 65.51 Set3 52.55 79.73 73.99 67.78 68.10 67.13 Table 6: Attribute usage in C 5.0 classifier. Basic features usage (75-100)% data_packets[min, max], flow_duration[mean, min], flags_packets[mean, min, max], inter_arrival_time_data[sd, min] Burstiness features usage (75-100)% burst_size_bytes[md, min, mean], burst_no[sd, min], idle_time_data[mean, min, sd], pkt_data_count[min, mean], pkt_count[min, sd], inter_arrival_time_burst_conns[min, sd], inter_arrival_time_burst[mean, max], burst_size_bytes_data[max, min, mean], burst_duration [sd, mean], burst_data_no[min] Table 7: Confusion Matrix results for optimal classifier3. Applications Gmail Ymail Amazon BBC Bing Facebook Google UoP site YouTube Gmail 198 0 0 0 3 6 14 5 13 Ymail 0 4 0 0 0 0 0 0 0 Amazon 0 0 9 0 0 0 0 0 0 BBC 0 0 0 2 0 0 0 0 0 Bing 4 0 0 0 20 2 16 0 7 Facebook 14 0 0 0 5 82 0 0 8 Google 32 0 2 0 2 2 247 0 12 UoP site 11 0 0 0 0 3 0 90 1 YouTube 30 0 0 0 4 0 20 0 198 applications except for Google applications (i.e., Gmail, constructing the truth table for application membership of YouTube, and Google search engine). Out of the total tested flows relied on IP addresses and DNS. Unfortunately, due to samples, it was observed that Yahoo mail, Amazon, and BBC the underlying CDN hosting of different applications, this news recorded optimal accuracy and that University of classification led to inaccurate results. Moreover, the data Plymouth website recorded lowest rate of false-negatives. traffic was collected at the University of Plymouth and from -e reason for obtaining high classification accuracy for managed computers owned by the university and included these applications could be attributed to the fact that they many web-based services that introduced noise in the col- have unique behaviours, distinct from the others. On the lected data. On the contrary, comparing results with previous contrary, Google applications (Gmail, YouTube, and Goo- studies that reported high accuracy, most had classified traffic gle) performed the worst in terms of classification, as they either according to network protocols such as FTP, IMAP, belong to the same company and they were misclassified as and HTTP or as per the application class such as e-mail, P2P, each other. Overall, the accuracy of all applications was and streaming. Protocol, port number, or class-based traffic is satisfactorily high. generally easy to identify, and hence, the reported accuracy is also usually high. However, reviewing literature, very few 5. Conclusion studies such as [46] have classified traffic according tomodern applications (i.e., Facebook and Google services). Moreover, -is study proposed a novel set of features based on these studies relied on the DPI method for labelling traffic in interarrival times between packets and flows, most specifi- which they used a supervised approach for traffic classifica- cally burstiness and idle time, for application identification. tion. DPI had been considered trustworthy by such studies From the experimental results, the proposed features out- [47, 48] until 2009 where a study in [49] claimed that libraries perform the traditional features that were proposed by of DPI are unreliable. Nowadays, current applications are previous studies; furthermore, higher accuracy is achieved web-based and become almost encrypted; therefore, the DPI when combining both proposed and traditional features. A method cannot cope with modern services as it is based on modified tcptrace tool was used to extract the new features, matching payload patterns, IP address, and port number [46]. and a C5.0 classifier was used to detect applications based Future work will primarily consider larger datasets with on real data collection. Overall, accuracy was more than different types of applications and more end users in order to 79%; however, some applications resulted in low accuracies fully investigate the performance of the proposed work. such as Google, Gmail, and YouTube as they belong to the Moreover, future work will also focus on recognizing new same owner. One of the limitations of this work was that applications that emerge over time by applying the proposed Journal of Computer Networks and Communications 9 method. Finally, a more accurate approach for labelling the Computer, Communication and Control (IMCCC), pp. 508– traffic should also be incorporated to ensure the robustness of 511, Harbin, China, December 2012. the method. [13] A. Boukhtouta, S. A. Mokhov, N.-E. Lakhdari, M. Debbabi, and J. Paquet, “Network malware classification comparison Data Availability using DPI and flow packet headers,” Journal of Computer Virology and Hacking Techniques, vol. 12, no. 2, pp. 69–100, -e data used to support the findings of this study are in- 2016. cluded within the article. [14] T. Bujlow, V. Carela-Español, and P. Barlet-Ros, “In- dependent comparison of popular DPI tools for traffic clas- Conflicts of Interest sification,” Computer Networks, vol. 76, pp. 75–89, 2015.[15] A. Bashir, C. Huang, B. Nandy, and N. Seddigh, “Classifying -e authors declare that they have no conflicts of interest. P2P activity in netflow records: a case study on BitTorrent,” inProceedings of the 2013 IEEE International Conference on Communications (ICC), pp. 3018–3023, Budapest, Hungary, References June 2013. [1] I. N. Bermudez, M. Mellia, M. M. Munafo, R. Keralapura, and [16] B. Park, Y. Won, J. Chung, M.-S. Kim, and J. W.-K. Hong, A. Nucci, “DNS to the rescue: discerning content and services “Fine-grained traffic classification based on functional sepa- in a tangled web,” in Proceedings of the 12th ACM SIGCOMM ration,” International Journal of Network Management, Conference on Internet Measurement, (IMC’12), pp. 413–426, vol. 23, no. 5, pp. 350–381, 2013. Vienna, Austria, November 2012. [17] A. Vlăduţu, D. Comăneci, and C. Dobre, “Internet traffic [2] A. Moore and K. Papagiannaki, “Toward the accurate iden- classification based on flows’ statistical properties with ma- tification of network applications,” in Proceedings of the chine learning,” International Journal of Network Manage- Passive and Active Measurement Workshop, Boston, MA, ment, vol. 27, no. 3, article e1929, 2017. USA, March 2005. [18] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, “Traffic [3] M. Finsterbusch, C. Richter, E. Rocha, J.-A. Muller, and classification through simple statistical fingerprinting,” ACM K. Hanssgen, “A survey of payload-based traffic classification SIGCOMM Computer Communication Review, vol. 37, no. 1, approaches,” IEEE Communications Surveys & Tutorials, p. 5, 2007. vol. 16, no. 2, pp. 1135–1156, 2014. [19] R. Alshammari and A. N. Zincir-Heywood, “How robust can a [4] S. Valenti, D. Rossi, A. Dainotti, A. Pescapè, A. Finamore, and machine learning approach be for classifying encrypted M. Mellia, “Reviewing traffic classification,” Data Traffic VoIP?,” Journal of Network and Systems Management, vol. 23, Monitoring and Analysis, vol. 7754, pp. 123–147, 2013. no. 4, pp. 830–869, 2015. [5] Y.Wang, Y. Xiang, J. Zhang,W. Zhou, G.Wei, and L. T. Yang, [20] A. Ulliac and B. V. Ghita, “Non-intrusive identification of “Internet traffic classification using constrained clustering,” peer-to-peer traffic,” in Proceedings of the 2010 Bird In- IEEE Transactions on Parallel and Distributed Systems, vol. 25, ternational Conference on Communication Beory, Reliability, no. 11, pp. 2932–2943, 2014. and Quality of Service, pp. 175–183, Athens, Greece, June [6] A. Hajjar, J. Khalife, and J. Dı́az-Verdejo, “Network traffic 2010. application identification based on message size analysis,” [21] A. W. Moore, D. Zuev, A. W. Moore, and D. Zuev, “Internet Journal of Network and Computer Applications, vol. 58, traffic classification using bayesian analysis techniques,” ACM pp. 130–143, 2015. SIGMETRICS Performance Evaluation Review, vol. 33, no. 1, [7] T. Bujlow, T. Riaz, and J. M. Pedersen, “A method for p. 50, 2005. classification of network traffic based on C5.0 machine [22] T. Auld, A. W. Moore, and S. F. Gull, “Bayesian neural learning algorithm,” in Proceedings of the International networks for internet traffic classification,” IEEE Transactions Conference on Computing, Networking and Communications on Neural Networks, vol. 18, no. 1, pp. 223–239, 2007. (ICNC), pp. 237–241, Maui, HI, USA, March 2012. [23] A. Este, F. Gringoli, and L. Salgarelli, “Support vector ma- [8] H. Oudah, B. Ghita, and T. Bakhshi, “Network application chines for TCP traffic classification,” Computer Networks, detection using traffic burstiness,” in Proceedings of the World vol. 53, no. 14, pp. 2476–2490, 2009. Congress on Internet Security (WorldCIS-2017), Cambridge, [24] G. Y. Lazarou, J. Baca, V. S. Frost, and J. B. Evans, “Describing UK, December 2017. network traffic using the index of variability,” IEEE/ACM [9] H. Oudah, B. Ghita, and T. Bakhshi, “A novel features set for Transactions on Networking, vol. 17, no. 5, pp. 1672–1683, internet traffic classification using burstiness,” in Proceedings 2009. of the (ICISSP 5th International Conference on Information [25] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of- Systems Security and Privacy), pp. 397–404, Prague, Czech service mapping for QoS: a statistical signature-based ap- Republic, July 2019. proach to IP traffic classification,” in Proceedings of the 4th [10] N. Al Khater and R. E. Overill, “Network traffic classification ACM SIGCOMM Conference on Internet Measurement—IMC techniques and challenges,” in Proceedings of the 2015 Tenth ’04, pp. 135–148, Boston, MA, USA, July 2004. International Conference on Digital Information Management [26] S. Shakkottai, N. Brownlee, and K. C. Claffy, “A study of (ICDIM), pp. 43–48, Jeju, South Korea, October 2015. burstiness in TCP flows,” Lecture Notes in Computer Science, [11] M. S. Joe Touch, E. Lear, A. Mankin et al., Service Name and vol. 3431, pp. 13–26, 2005. Transport Protocol Port Number Registry, IANA, Playa Vista, [27] R. Liston, S. Srinivasan, and E. Zegura, “Diversity in DNS CA, USA, 2016, http://www.iana.org/assignments/service- performance measures,” in Proceedings of the Second ACM names-port-numbers/service-names-port-numbers.xhtml. SIGCOMM Workshop on Internet Measurment—IMW ’02, [12] R. Zou, T. Xu, and H. Hou, “An enhanced Netflow data vol. 19, Marseille, France, November 2002. collection system,” in Proceedings of the 2012 Second In- [28] J. Jung, E. Sit, H. Balakrishnan, and R. M. Morris, “DNS ternational Conference on Instrumentation, Measurement, performance and effectiveness of caching,” in Proceedings of 10 Journal of Computer Networks and Communications the First ACM SIGCOMM Workshop on Internet Measure- [43] R. Krzanowski, “Burst (of packets) and burstiness,” in Pro- ment Workshop—IMW ’01, vol. 10, no. 5, pp. 589–603, San ceedings of the 66th IETF Meeting, Montreal, Quebec, Canada, Francisco, CA, USA, November 2001. July 2006. [29] D. Wessels, “Is your caching resolver polluting the internet?,” [44] T. Bakhshi and B. Ghita, “User traffic profiling,” in Proeedings in Proceedings of the ACM SIGCOMM Workshop on Network of the 2015 Internet Technologies and Applications (ITA), Troubleshooting Research, Beory and Operations Practice pp. 91–97, Wrexham, UK, September 2015. Meet Malfunctioning Reality—NetT ’04, pp. 271–276, Port- [45] R. Hofstede, P. Celeda, B. Trammell et al., “Flow monitoring land, OR, USA, September 2004. explained: from packet capture to data analysis with NetFlow [30] D. Whyte, E. Kranakis, and P. Van Oorschot, “DNS-based and IPFIX,” IEEE Communications Surveys & Tutorials, detection of scanning worms in an enterprise network,” in vol. 16, no. 4, pp. 2037–2064, 2014. Proceedings of the 12th Annual Network And Distributed [46] Z. Aouini, A. Kortebi, Y. Ghamri-Doudane, and I. L. Cherif, System Security Symposium, vol. 1, pp. 1–17, San Diego, CA, “Early classification of residential networks traffic using C5.0 USA, January 2005. machine learning algorithm,” in Proceedings of the Wireless [31] D. Plonka and P. Barford, “Flexible traffic and host profiling Days (WD), pp. 46–53, Dubai, UAE, April 2018. via DNS rendezvous,” in Proceedings of the SATIN, Ted- [47] L. Bernaille, R. Teixeira, L. Bernaille et al., Early Recognition of dington, UK, April 2011. Encrypted Applications to Cite Bis Version, Springer, Berlin, [32] P. Foremski, C. Callegari, and M. Pagano, “DNS-Class: im- Germany, 2007. mediate classification of IP flows using DNS,” International [48] R. Alshammari and A. N. Zincir-Heywood, “Machine Journal of Network Management, vol. 24, no. 4, pp. 272–288, learning based encrypted traffic classification: identifying SSH 2014. and Skype,” in Proceedings of the 2009 IEEE Symposium on [33] N. F. Huang, C. C. Li, C. H. Li, C. C. Chen, C. H. Chen, and Computational Intelligence for Security and Defense I. H. Hsu, “Application identification system for SDN QoS Applications, Ottawa, Canada, July 2009.[49] G. Maier, A. Feldmann, V. Paxson, and M. Allman, “On based on machine learning and DNS responses,” in Pro- dominant characteristics of residential broadband internet ceedings of the 2017 19th Asia-Pacific Network Operations and traffic,” in Proceedings of the 9th ACM SIGCOMM Confer- Management Symposium (APNOMS), pp. 407–410, Seoul, ence on Internet Measurement Conference—IMC ’09, p. 90, South Korea, September 2017. Chicago, IL, USA, November 2009. [34] G. Mamidisetti and G. T. Varma, “Performance issues of internet protocol versions,” International Journal of Soft Computing and Engineering, vol. 3, no. 6, pp. 30–32, 2014. [35] M. Janbeglou, H. Naderi, and N. Brownlee, “Effectiveness of DNS-based security approaches in large-scale networks,” in Proceedings of the 28th International Conference on Advanced Information Networking and Applications Workshops, pp. 524–529, Victoria, Canada, May 2014. [36] M. Trevisan, I. Drago, M. Mellia, and M. M. Munafo, “To- wards web service classification using addresses and DNS,” in International Wireless Communications and Mobile Com- puting Conference (IWCMC), pp. 38–43, Paphos, Cyprus, September 2016. [37] L. M. Torres, E. Magana, M. Izal, and D. Morato, “A popu- larity-aware method for discovering server IP addresses re- lated to websites,” in Proceedings of the Global Information Infrastructure Symposium—GIIS 2013, Trento, Italy, October 2013. [38] T. Bakhshi and B. Ghita, “Traffic profiling: evaluating stability in multi-device user environments,” in Proceedings of the 30th International Conference on Advanced Information Net- working and Applications Workshops (WAINA), pp. 731–736, Crans-Montana, Switzerland, May 2016. [39] A. N. Mahmood, C. Leckie, and P. Udaya, “An efficient clustering scheme to exploit hierarchical data in network traffic analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 6, pp. 752–767, 2008. [40] D. Tammaro, S. Valenti, D. Rossi, and A. Pescapé, “Exploiting packet-sampling measurements for traffic characterization and classification,” International Journal of Network Man- agement, vol. 22, no. 6, pp. 451–476, 2012. [41] S. Ostermann, “tcptrace—Official Homepage,” 2016, http:// www.tcptrace.org/. [42] B. V. Ghita, S. M. Furnell, B. M. Lines, and E. C. Ifeachor, “Endpoint study of internet paths and web pages transfers,” Campus-Wide Information Systems, vol. 20, no. 3, pp. 90–97, 2003. International Journal of Rotating Advances in Machinery Multimedia En Jougrnail onf eering The Scientific Journal ofWorld Journal Sensors Hindawi Hindawi Publishing Corporation Hindawi Hindawi Hindawi www.hindawi.com Volume 2018 hwtwtpw:/./hwinwdwaw.hii.ncodmawi.com Volume 20183 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 Journal of Control Science and Engineering Advances in Civil Engineering Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 Submit your manuscripts at www.hindawi.com Journal of Journal of Electrical and Computer Robotics Engineering Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 VLSI Design Advances in OptoElectronics International Journal of Modelling & International Journal of Simulation Aerospace Navigation and Observation in Engineering Engineering Hindawi Hindawi Hindawi Hindawi Volume 2018 Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com www.hindawi.com Volume 2018 International Journal of International Journal of Antennas and Active and Passive Advances in Chemical Engineering Propagation Electronic Components Shock and Vibration Acoustics and Vibration Hindawi Hindawi Hindawi Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018