Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (2024)

Fengxiao Tang
Central South University
tangfengxiao@csu.edu.cn
  Xiaonan Wang
Xinjiang University
107552304984@stu.xju.edu.cn
  Xun Yuan
Central South University
yuan.xun@csu.edu.cn
  Linfeng Luo
Central South University
luolinfeng@csu.edu.cn
  Ming Zhao
Central South University
meanzhao@csu.edu.cn
  Nei Kato
Tohoku University
kato@it.is.tohoku.ac.jp

Abstract

Network device and system health management is the foundation of modern network operations and maintenance. Traditional health management methods, relying on expert identification or simple rule-based algorithms, struggle to cope with the dynamic heterogeneous networks (DHNs) environment. Moreover, current state-of-the-art distributed anomaly detection methods, which utilize specific machine learning techniques, lack multi-scale adaptivity for heterogeneous device information, resulting in unsatisfactory diagnostic accuracy for DHNs. In this paper, we develop an LLM-assisted end-to-end intelligent network health management framework. The framework first proposes a Multi-Scale Semanticized Anomaly Detection Model (MSADM), incorporating semantic rule trees with an attention mechanism to address the multi-scale anomaly detection problem in DHNs. Secondly, a chain-of-thought-based large language model is embedded in downstream to adaptively analyze the fault detection results and produce an analysis report with detailed fault information and optimization strategies. Experimental results show that the accuracy of our proposed MSADM for heterogeneous network entity anomaly detection is as high as 91.31%.

1 Introduction

With the development of communication technology and unmanned control technology towards B5G/6G, dynamic heterogeneous networks (DHNs)[36] play an increasingly important role in many key areas such as emergency communication, transportation, and military administration[11]. As shown in Fig.1, DHNs consist of various types of communication devices such as base stations, drones, and mobile phones, which have been deployed in harsh and dynamically changing environments for long periods[30], are prone to various anomalies and faults[33]. Therefore, to enhance the availability and reliability of DHNs, it is essential to perform timely health management to detect network anomalies and diagnose network faults[8].

Modern health management is a comprehensive analysis technique that not only presents and visualizes anomalous data but also digs the fault type and reasons behind the abnormal data in the whole network, thus a series of decisions can be made to mitigate the problem[9].

A typical health management life cycle includes at least three phases: (1) Anomaly Detection[19]: Here, a monitor performs anomaly detection of multivariate time series data ( e.g., packet loss, byte error, etc.). (2) Fault Detection[17]: network managers (NMs) assess various aspects of the event and engage in several rounds of communication to pinpoint the cause of the anomaly. (3) Mitigation[1]: the NMs implement several actions to mitigate the incident and restore the health of the communication service. The accuracy of anomaly detection and fault detection is the foundation of the health management life cycle, however, the increasing variety and dynamicity in DHNs result in two key challenges of health management of DHNs[20]: 1. How to accurately infer faults through local information when global information is difficult to obtain in real-time. 2. How to accurately locate faults in heterogeneous devices with differences in information scale and fault mechanisms.

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (1)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (2)

The traditional Bayesian-based health management methods are widely used in network fault detection, which establish connections between network anomalies and their root causes for performance diagnosis[2]. However, Bayesian methods rely on directed acyclic graphs that lack scalability, making them unsuitable for DHNs. Simultaneously, frequent changes in topology complicate the ability of traditional distributed anomaly detection algorithms to detect local or minor anomalies in DHNs[13].

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (3)

Recently, machine learning-based health management methods have been widely researched and recognized as state-of-the-art algorithms for network fault detection[23, 15, 32, 35, 5]. However, those machine learning-based algorithms either relay on global network information or ignore the nonuniformed Key Performance Indicators (KPIs) and state information of heterogeneous nodes. Besides, Those diagnostic algorithms do not cover the complete health management life cycle and still rely on NMs to perform manual troubleshooting to mitigate anomalies after detection, which not only fails to utilize anomaly data efficiently but also significantly increases the time and complexity of anomaly handling.

To address the above problems, we developed an LLM-assisted end-to-end intelligent network health management framework. In the framework, we first propose a Multi-Scale Semanticized Anomaly Detection Model (MSADM) to deal with uniformed KPIs and state information problems, and then integrate LLM to perform full life cycle end-to-end health management.

Unlike existing models that can only handle specific faults of specific devices, the MSADM incorporates multi-scale semantic rule trees with Transformer to unify and standardize abnormal text reports based on the different abnormal degrees of various nodes. Thus, the MSADM can be implemented in differential entities to automatically identify abnormal communication entities and generate unified and standardized expressions of abnormal information.

As shown in Fig.1, to perform end-to-end health management, we integrate LLM in the health management framework to cover the full life cycle and employ MSADM as the facilitating agent for the LLM. This strategic integration facilitates the collection and initial processing of abnormal data, thereby effectively preventing diagnostic errors caused by inconsistent data representations. This preliminary processing also significantly reduces the computational demands on LLM. As shown in Fig.2, the effectiveness of this approach is evident through the detailed diagnostic results generated by LLM. These results succinctly outline the abnormal status and potential causes for each network entity, underscoring the robust capability of our proposed health management program. The main contributions of this paper are summarized as follows:

  • We propose an end-to-end health management framework for DHNs. This framework manages network health through only local and neighboring information and covers the full stages of the health management life cycle, including anomaly detection, fault detection, and mitigation.

  • We propose a Multi-Scale Semanticized Anomaly Detection Model (MSADM) to deal with uniformed KPIs and state information problems. This model standardizes abnormal information from various DHNs equipment, addressing the inefficiencies inherent in traditional distributed anomaly detection information sharing.

  • We incorporate LLM into the network health management process to perform a full life cycle of end-to-end health management. By employing the thinking prompt method, LLM not only analyzes abnormal situations but also offers mitigation solutions.

2 Background and Motivation

In this section, we first review the current research status of anomaly detection models. We then identify the shortcomings and defects of existing methods in DHNs health management. Finally, we explore the potential benefits of integrating semantic work into the health management process of wireless heterogeneous networks.

2.1 Related Work

The traditional anomaly detection algorithm detects anomalies by monitoring wireless measurement data and comparing it with established norms [29]. However, this approach overly depends on expert annotations and proves both time-consuming and labor-intensive. Concurrently, researchers also attempt to validate their findings using both simulated and actual data. Yet, these studies typically rely on a single KPI, such as the call drop rate, to classify anomalies, thereby constraining diagnostic precision to a degree [14]. The Bayesian-based classification method, extensively explored in [2][3], uses probability and graph theories to correlate network anomalies with their root causes. Despite its widespread application, the efficacy of this method significantly hinges on a substantial corpus of historical anomaly data since the causal graphs it generates demand extensive prior knowledge. Moreover, the Bayesian approach faces challenges in scalability and adaptability, struggling to perform well in dynamic, heterogeneous wireless network environments.

Machine learning, recognized as a powerful analytical tool, can effectively mine and perceive potential information in data and sharply detect subtle changes in network status and KPIs, thus enabling faster and more precise network anomaly detection[23]. Researchers propose a diagnostic method based on a supervised genetic fuzzy algorithm[15]. This method employs a genetic algorithm to learn a fuzzy rule base from a combined dataset of simulated and real data containing 72 records. Its accuracy heavily relies on the labeled training set. The Deep Transformer-based temporal anomaly detection model, TranAD[32], incorporates an attention sequence encoder and leverages broader temporal trend knowledge to swiftly conduct anomaly detection. DCdetector[35] masters the representation of abnormal samples using a dual attention mechanism and contrastive learning. While machine learning methods have advanced in feature learning and enhanced their generalization capabilities, they face challenges in wireless networks. Abnormalities are sporadic, and scarce abnormal samples make the models prone to overfitting. Moreover, modeling only the entire network fails to adapt to dynamic DHNs.

Although research on distributed anomaly detection solutions is extensive[5], practical applications suffer due to inconsistent network entity feature representation, weakening detection capabilities[25]. Additionally, using machine learning to model each device alone is both time-consuming and labor-intensive. The models also struggle to capture the interactive information of communication devices. Additionally, existing distributed fault detection methods often consider abnormal situations as a whole, which neglects the specific abnormal representation of individual communication entities, thereby complicating the rapid detection of abnormal nodes by NMs.

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (4)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (5)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (6)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (7)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (8)

2.2 Problem Statement and Our Objectives

Within DHNs, the diverse range of communication devices poses challenges for domain experts in gathering data encompassing all device anomaly types for model training. Furthermore, these models typically lack autonomous learning capabilities. Consequently, the emergence of new communication devices or technologies within the network often detracts from the detection efficacy of the model, leading to performance degradation.

In addition to the aforementioned shortcomings, existing anomaly detection research often emphasizes enhancing detection accuracy or model interpretability. However, the comprehensive coverage of the entire health management life cycle is seldom taken into account. For anomalies detected by the model, the prevalent approach involves NMs extracting information and experience from satisfactorily resolved and archived cases (i.e. marked cases) to alleviate the anomalies[24]. Undoubtedly, this significantly diminishes the efficiency of anomaly mitigation.

We incorporate LLM into the health management life cycle, leveraging its reasoning capabilities to identify the root causes of abnormal situations, thereby furnishing NMs with end-to-end anomaly resolutions. Moreover, LLM’s learning capability enables rapid adaptation to new abnormal information from communication entities. To facilitate LLM in gathering anomaly information, we devised MSADM, deployed on communication entities to execute anomaly detection and information collection. Given the distributed deployment of MSADM, our scheme offers entity-level visibility, contrasting with prior distributed anomaly detection models. In the subsequent section, we will elaborate on our solution scheme in detail.

3 System Architecture

We have introduced an end-to-end health management scheme in DHNs. The Fig.3 displays the architecture of this scheme. An essential component of our solution involves processing time-series data from various devices through a rule base to generate a list of statuses with a uniform scale. We will further elaborate on the creation and use of rule base in (Section3.1). Once we establish the status list with unified scales, our MSADM can pinpoint anomalies using a built-in rule-enhanced transformer time-series classification model (Section3.2) and create anomaly descriptions by integrating semantic rule trees (Section3.3). Additionally, we have developed a statement processing structure equipped with prompts to support the LLM in analyzing these anomaly descriptions. This structure aids the LLM in identifying the causes of anomalies and devising mitigation strategies. The LLM’s output will act as the anomaly report for the network, which NMs will use to swiftly address the anomalies and ensure network health (Section3.4). Below, we provide a detailed introduction to each part of our scheme.

3.1 Construction of Rule Base

In this section, we present the packet loss rate (PLR) as an example to illustrate the shortcomings of existing distributed approaches. We compute a positively distributed interval for the average PLR over T𝑇Titalic_T for all devices. Next, we insert the average value of each device into the interval, and its distribution appears in Fig.4. The distribution of PLR varies significantly across different devices, and if such a dataset is used for model training, the model will struggle to adapt to this scenario of anomalous performance with multi-scale devices. Fig.5 shows the change in anomaly detection accuracy for different devices before and after using the rule base. Next, we will provide a detailed description of the process for designing and using the rule base.

We analyze the KPIs[16] common to multiple devices within the simulated network and construct the rule base accordingly. A comprehensive list of KPI types and contents is detailed in appendixA. For each device type, we analyze the collected data to ascertain the distribution of each KPI across various dimensions. Subsequently, we compare the actual KPI changes for these devices against their respective distributions to pinpoint anomalous statuses.

We represent the network background information within T𝑇Titalic_T under normal conditions as 𝒩normal=(Nf,Ef,T)subscript𝒩𝑛𝑜𝑟𝑚𝑎𝑙subscript𝑁𝑓subscript𝐸𝑓𝑇\mathcal{N}_{normal}=(N_{f},~{}E_{f},~{}T)caligraphic_N start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT = ( italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_T ).Nfsubscript𝑁𝑓N_{f}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT denotes the attributes of the node itself, expressed as Nf={fN1,fN2,,fNn}subscript𝑁𝑓subscript𝑓𝑁1subscript𝑓𝑁2subscript𝑓𝑁𝑛N_{f}=\{~{}f_{N1},~{}f_{N2},~{}\ldots,~{}f_{Nn}\}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT italic_N 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_N 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_N italic_n end_POSTSUBSCRIPT }, while Efsubscript𝐸𝑓E_{f}italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT represents the attributes of the communication link, similar to the node, and is given by Ef={fE1,fE2,,fEn}subscript𝐸𝑓subscript𝑓𝐸1subscript𝑓𝐸2subscript𝑓𝐸𝑛E_{f}=\{~{}f_{E1},~{}f_{E2},~{}\ldots,~{}f_{En}\}italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT italic_E 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_E 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_E italic_n end_POSTSUBSCRIPT }. T𝑇Titalic_T indicates the period for recording network information.

We collected a substantial number of 𝒩normalsubscript𝒩𝑛𝑜𝑟𝑚𝑎𝑙\mathcal{N}_{normal}caligraphic_N start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT for hom*ogeneous entities to enhance our analysis. For each KPI, we calculate its average value (Avg,Fasubscript𝐹𝑎F_{a}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT), fluctuation value (Jitter,Fjsubscript𝐹𝑗F_{j}italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT), variance (Variance,Fvsubscript𝐹𝑣F_{v}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT), and trend (Trend,Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). The average represents the center or average of the dataset and aids in understanding the general performance level. The fluctuation value represents the dispersion or range of values in the dataset, calculated as the average of the differences between adjacent data points. Variance, the average of the squared differences of each data point from the mean, measures the extent to which individual data points deviate from the mean. Data trends describe the changes in data over time.

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (9)

We can readily compute the numerical distribution diagram of the first three dimensions, thereby getting a set of intervals Dist𝐷𝑖𝑠𝑡Distitalic_D italic_i italic_s italic_t that depicts the abnormality of the performance indicator. According to its distribution, the interval closer to the peak indicates that the dimensional data aligns more closely with normal data and should be considered more normal. As trend falls into categories such as rise, fall, fluctuation, etc. Its calculation is different. We assess the instantaneous performance and overall trend of the network based on the number of extreme points obtained. The data within T𝑇Titalic_T is subdivided into n small periods t𝑡titalic_t. By obtaining the average value within each t𝑡titalic_t, the continuous time data is converted into discrete data values v={v1,v2,,vn}𝑣subscript𝑣1subscript𝑣2subscript𝑣𝑛v=\{~{}v_{1},~{}v_{2},~{}\ldots,~{}v_{n}\}italic_v = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }.

To mitigate noise interference and facilitate smoother data processing, we increase the threshold hhitalic_h during the identification of maximum and minimum values. If a value and its adjacent value differ by no more than one hhitalic_h, we do not classify it as an extreme point. The presence of multiple maximum and minimum values signifies a fluctuating trend. Conversely, a single minimum value suggests a sudden drop, whereas a single maximum value indicates a sudden rise. Regarding the threshold definition, we derive it from the distribution of fluctuation values among n𝑛nitalic_n discrete data points under normal conditions. Utilizing this methodology enables us to ascertain the trend status of performance indicators.

We apply formula1 to determine the number of maximum and minimum values in this set of discrete data sets, taking the trend of PLRs as an illustrative example. The formula is expressed as follows:

Nextrema=i=2n1(ϕmax(i)+ϕmin(i)),subscript𝑁𝑒𝑥𝑡𝑟𝑒𝑚𝑎superscriptsubscript𝑖2𝑛1subscriptitalic-ϕ𝑚𝑎𝑥𝑖subscriptitalic-ϕ𝑚𝑖𝑛𝑖N_{extrema}~{}=~{}{\sum_{i=2}^{n-1}~{}(\phi_{max}(i)+\phi_{min}(i))},italic_N start_POSTSUBSCRIPT italic_e italic_x italic_t italic_r italic_e italic_m italic_a end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ( italic_i ) + italic_ϕ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ( italic_i ) ) ,(1)

where the formula for determining the extreme point is as follows:

ϕmax(i)={1,(vi>vi1)(vi>vi+1)(min(|vivi1|,|vivi+1|)>h)0,otherwise,,subscriptitalic-ϕmax𝑖cases1subscript𝑣𝑖subscript𝑣𝑖1limit-fromsubscript𝑣𝑖subscript𝑣𝑖1otherwisesubscript𝑣𝑖subscript𝑣𝑖1subscript𝑣𝑖subscript𝑣𝑖10otherwise\begin{aligned} \phi_{\text{max}}(i)~{}=~{}\begin{cases}1,&(v_{i}>v_{i-1})%\land(v_{i}>v_{i+1})\land\\&(~{}\min(|v_{i}-v_{i-1}|,~{}|v_{i}-v_{i+1}|)>h)\\0,&\text{otherwise},\end{cases}\end{aligned},start_ROW start_CELL italic_ϕ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_i ) = { start_ROW start_CELL 1 , end_CELL start_CELL ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_v start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ∧ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ∧ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( roman_min ( | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT | , | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | ) > italic_h ) end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise , end_CELL end_ROW end_CELL end_ROW ,(2)
ϕmin(i)={2,(vi<vi1)(vi<vi+1)(min(|vivi1|,|vivi+1|)>h),0,otherwise.subscriptitalic-ϕmin𝑖cases2subscript𝑣𝑖subscript𝑣𝑖1limit-fromsubscript𝑣𝑖subscript𝑣𝑖1otherwisesubscript𝑣𝑖subscript𝑣𝑖1subscript𝑣𝑖subscript𝑣𝑖10otherwise\displaystyle\phi_{\text{min}}(i)=\begin{cases}2,&(v_{i}<v_{i-1})\land(v_{i}<v%_{i+1})\land\\&(~{}\min(|v_{i}-v_{i-1}|,~{}|v_{i}-v_{i+1}|)>h),\\0,&\text{otherwise}.\end{cases}italic_ϕ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ( italic_i ) = { start_ROW start_CELL 2 , end_CELL start_CELL ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_v start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ∧ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ∧ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( roman_min ( | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT | , | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | ) > italic_h ) , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise . end_CELL end_ROW(3)

As demonstrated in formula2, when the absolute value of the difference between point tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its two neighboring points exceeds hhitalic_h, we classify the point as an extreme value point. The determination of the minima is shown in formula3.

The algorithm1 outlines the procedure for computing the four evaluation dimensions from our rule base and obtaining the KPIs status list:

1:Input: Performance indicator data list in time T𝑇Titalic_T:data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a;

2:Range of the indicator configuration file list: intervals𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠intervalsitalic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l italic_s;

3:Time T𝑇Titalic_T;

4:avg,jitter,variancegetAttributeRate()𝑎𝑣𝑔𝑗𝑖𝑡𝑡𝑒𝑟𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑔𝑒𝑡𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑅𝑎𝑡𝑒avg,jitter,variance~{}\leftarrow~{}getAttributeRate()italic_a italic_v italic_g , italic_j italic_i italic_t italic_t italic_e italic_r , italic_v italic_a italic_r italic_i italic_a italic_n italic_c italic_e ← italic_g italic_e italic_t italic_A italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_R italic_a italic_t italic_e ( );

5:fori0𝑖0i\leftarrow 0italic_i ← 0 to len(intervals)1𝑙𝑒𝑛𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠1len(intervals)-1italic_l italic_e italic_n ( italic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l italic_s ) - 1do

6:ifavgisinintervals[i]𝑎𝑣𝑔𝑖𝑠𝑖𝑛𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠delimited-[]𝑖avgisinintervals[i]italic_a italic_v italic_g italic_i italic_s italic_i italic_n italic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l italic_s [ italic_i ]then

7:status𝑠𝑡𝑎𝑡𝑢𝑠statusitalic_s italic_t italic_a italic_t italic_u italic_s \leftarrow i𝑖iitalic_i;

8:break𝑏𝑟𝑒𝑎𝑘breakitalic_b italic_r italic_e italic_a italic_k;

9:endif

10:endfor

11:trend𝑡𝑟𝑒𝑛𝑑trenditalic_t italic_r italic_e italic_n italic_d \leftarrow getTrend();

12:status.add(trend)formulae-sequence𝑠𝑡𝑎𝑡𝑢𝑠𝑎𝑑𝑑𝑡𝑟𝑒𝑛𝑑status.add(trend)italic_s italic_t italic_a italic_t italic_u italic_s . italic_a italic_d italic_d ( italic_t italic_r italic_e italic_n italic_d );

13:return status[4]𝑠𝑡𝑎𝑡𝑢𝑠delimited-[]4status[4]italic_s italic_t italic_a italic_t italic_u italic_s [ 4 ]


We have also explored the possibility of using a machine learning-based classification model to categorize data trends. However, suppose new features or wireless access technologies emerge in the future, affecting the performance evaluation data of KPIs. In that case, we will need to recollect and relabel the dataset to train the model. In contrast, with the rule-based method, we only need to gather sufficient data and update the threshold using the built-in script to refresh the rule base. Therefore, the rule base demonstrates superior scalability and adaptability.

3.2 Anomaly Information Learning and Detection

We have designed an anomaly detection architecture for KPIs time series data in MSADM. Fig.6 illustrates the structure of the anomaly detection model. In this framework, the time-series data initially passes through a convolutional layer that captures time-series features within a specific segment, followed by a two-layer converter to fully perceive changes in the KPIs. To enhance the model’s robustness, we have embedded a rule-filtered states list prior to the model entering the fully connected layer. Because our goal is for MSADM to recognize the anomaly type while performing anomaly detection, a four-layer fully connected network is employed. The first two layers sense the data association, while the latter two layers handle the detection and classification tasks. The remainder of this section details specific model design concepts.

For anomaly detection tasks, certain element fragments often harbor more anomaly information features. Convolutional Neural Networks (CNN) improve classification accuracy by extracting local features from time series[10]. However, the sequence of elements and their interdependencies are essential for time series analysis. While CNNs excel at focusing on local features, their capability to model global dependencies is comparatively limited[22]. In time series classification tasks that require a global perspective, this limitation may lead to a decrease in model accuracy.

The Transformer, via its self-attention mechanism, can process sequences of any length[21]. This feature efficiently captures global dependencies within sequences, effectively overcoming CNN’s limitations in global modeling.

After applying the rule-embedded transformer, we get the attention output a𝑎aitalic_a. we incorporate the KPIs status list obtained through rule filtering into the model’s learning dimension. This status list aids the model in better distinguishing between abnormal and normal situations. Therefore, before inputting data into the FCL, we utilize the linear transformation function f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to combine the status representation s𝑠sitalic_s with the attention output a𝑎aitalic_a. The interactive representation of the KPIs statuses with the output of the attention mechanism Isasubscript𝐼𝑠𝑎I_{sa}italic_I start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT can be denoted as:

Isa=f1(W1[s,a]+b),subscript𝐼𝑠𝑎subscript𝑓1subscript𝑊1𝑠𝑎𝑏I_{sa}=f_{1}(W_{1}[s,~{}a]+b),italic_I start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_s , italic_a ] + italic_b ) ,(4)

where W1,bsubscript𝑊1𝑏W_{1},bitalic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b are trainable parameters. f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the activation function, and we use ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U.

The fully connected layer gradually transforms the extracted features into classification probabilities that identify anomalies. Simultaneously, the model goes beyond merely outputting these probabilities; it also specifies the type of anomaly detection identifying the abnormal entity. Consequently, we have separated the fully connected layer at the end to acquire both anomaly detection results and anomaly types through distinct linear layers.

During training, given the dual tasks of classification and detection, we formulate the actual loss function as the summation of two cross-entropy loss functions. The loss5 is as follows:

loss=inycilog(pci)inydilog(pdi),losssuperscriptsubscript𝑖𝑛subscript𝑦𝑐𝑖subscript𝑝𝑐𝑖superscriptsubscript𝑖𝑛subscript𝑦𝑑𝑖subscript𝑝𝑑𝑖\text{loss}=-\sum_{i}^{n}y_{ci}\log(p_{ci})-\sum_{i}^{n}y_{di}\log(p_{di}),loss = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_c italic_i end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_c italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_d italic_i end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_d italic_i end_POSTSUBSCRIPT ) ,(5)

where the log function is the softmax activation function, ycisubscript𝑦𝑐𝑖{y_{ci}}italic_y start_POSTSUBSCRIPT italic_c italic_i end_POSTSUBSCRIPT, ydisubscript𝑦𝑑𝑖y_{di}italic_y start_POSTSUBSCRIPT italic_d italic_i end_POSTSUBSCRIPT is the actual value, pcisubscript𝑝𝑐𝑖{p_{ci}}italic_p start_POSTSUBSCRIPT italic_c italic_i end_POSTSUBSCRIPT, pdisubscript𝑝𝑑𝑖{p_{di}}italic_p start_POSTSUBSCRIPT italic_d italic_i end_POSTSUBSCRIPT is the predicted value, and n is the size of the output.

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (10)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (11)

3.3 Semantic Rule Tree Structure

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (12)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (13)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (14)

In the initial section, we obtain a list of statuses S𝑆Sitalic_S for the KPIs of the anomaly network entities, filtered according to predefined rules. Utilizing these status lists, MSADM generates detailed anomaly information reports for anomalous network entities via a semantic rule tree.

We explored logical semantics, distributed semantics, hybrid semantics for the NLG model, and a Knowledge Graph-based replication mechanism for sentence generation[4][18]. These models necessitate a large amount of high-quality textual training datasets. However, since our method generates sentences from a list of statuses, training becomes highly inefficient following a significant number of events, and the utterances produced are overly slow and filled with superfluous information. Moreover, the dataset requires expansion to train the model whenever a new description of an anomaly manifestation arises.

Our goal is to generate timely, accurate, and concise sentences. Therefore, we opted, after careful consideration, to employ a template-like approach to sentence generation. Given the limited variety of statuses in the status list, we chose to select words that correspond to the number of statuses for each KPIs evaluation dimension. Unlike traditional template-based approaches, we use a tree structure with a unique one-to-many configuration that effectively captures the abnormal statuses of KPIs under various evaluation metrics. This structure is not only highly flexible and extensible but also facilitates the future integration of new evaluation metrics and statuses. We employ this tree structure to generate sentences for each KPI, which are then compiled into the comprehensive anomaly reports.

As shown in Fig.7, we maintain a vocabulary describing KPIs performance metrics and KPIs status levels and a lexicalized tree adjoining grammar (LTAG) representing the lexicality of words. MSADM can utilize the evaluation dimensions of arbitrary KPIs as the root, connect syntactic trees to form the syntactic part of a sentence and construct a sentence tree by positioning fixed vocabulary in the leaves. Meanwhile, to further speed up the sentence generation, we added the pruning operation of words and LTAGs before sentence generation and tried to keep only the words related to the current KPIs.

The specific build process is as follows: MSADM traverses the sentence tree starting from the root, categorized by a KPIs type with a list of evaluated dimensions and statuses. Each traversal from the root to the leaves yields a semanticized description corresponding to the current KPIs statuses. Considering that actual KPIs data may be more precise than the status description, we incorporate a judgment call in the sentence generation process. When a KPI exhibits significant abnormalities, we add its actual values, such as mean, variance, and jitter, within the timeframe T𝑇Titalic_T to enrich the information content of the sentence. The process is shown by algorithm 2.

1:Input:WordsWList𝑊𝐿𝑖𝑠𝑡WListitalic_W italic_L italic_i italic_s italic_t;Grammars GRs𝐺𝑅𝑠GRsitalic_G italic_R italic_s,KPIsK𝐾Kitalic_K;

2:R𝑅Ritalic_R \leftarrow pruneGrammar(GRs𝐺𝑅𝑠GRsitalic_G italic_R italic_s);

3:Ws𝑊𝑠Wsitalic_W italic_s \leftarrow pruneWList(WList𝑊𝐿𝑖𝑠𝑡WListitalic_W italic_L italic_i italic_s italic_t,R𝑅Ritalic_R);

4:sentenceT𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑇sentenceTitalic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e italic_T \leftarrow Tree()𝑇𝑟𝑒𝑒Tree()italic_T italic_r italic_e italic_e ( ); /* init sentence tree */

5:fork𝑘absentk\leftarrowitalic_k ← to K𝐾Kitalic_Kdo

6:forw𝑤absentw\leftarrowitalic_w ← to Ws𝑊𝑠Wsitalic_W italic_sdo

7:isLexical𝑖𝑠𝐿𝑒𝑥𝑖𝑐𝑎𝑙isLexicalitalic_i italic_s italic_L italic_e italic_x italic_i italic_c italic_a italic_l,index𝑖𝑛𝑑𝑒𝑥indexitalic_i italic_n italic_d italic_e italic_x \leftarrowlexicalrequirements(w,R)𝑙𝑒𝑥𝑖𝑐𝑎𝑙𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑚𝑒𝑛𝑡𝑠𝑤𝑅lexicalrequirements(w,R)italic_l italic_e italic_x italic_i italic_c italic_a italic_l italic_r italic_e italic_q italic_u italic_i italic_r italic_e italic_m italic_e italic_n italic_t italic_s ( italic_w , italic_R );

8:ifTrue𝑇𝑟𝑢𝑒Trueitalic_T italic_r italic_u italic_e==isLexical𝑖𝑠𝐿𝑒𝑥𝑖𝑐𝑎𝑙isLexicalitalic_i italic_s italic_L italic_e italic_x italic_i italic_c italic_a italic_lthen

9:node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e \leftarrow Tree(w,R[index])𝑇𝑟𝑒𝑒𝑤𝑅delimited-[]𝑖𝑛𝑑𝑒𝑥Tree(w,R[index])italic_T italic_r italic_e italic_e ( italic_w , italic_R [ italic_i italic_n italic_d italic_e italic_x ] );

10:sentenceT.append(node)formulae-sequence𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑇𝑎𝑝𝑝𝑒𝑛𝑑𝑛𝑜𝑑𝑒sentenceT.append(node)italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e italic_T . italic_a italic_p italic_p italic_e italic_n italic_d ( italic_n italic_o italic_d italic_e );

11:endif

12:endfor

13:endfor

14:return sentenceT𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑇sentenceTitalic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e italic_T;

After compiling all abnormal sentence expressions from a node and considering the input constraints of the LLM, we strike a balance between the simplicity of the report and the completeness of the information. We then assess the need to further refine the entity information collected in the sentences based on the report’s length and the severity of the KPIs anomalies. We use regular expressions to optimize the report content while ensuring that essential and critical anomaly information is retained.

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (15)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (16)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (17)
Classification AccuracyDetection AccuracyRecallFNRFPRDetection Time/ms
SR-CNN59.3687.8894.485.5252.782.69
CL-MPPCA69.6986.5689.4110.6130.9219.05
ANOMALYBERT66.5386.7895.754.2568.4813.15
LSTM-transformer72.0288.8796.103.8955.7425.21
MSADM76.7391.3196.284.7233.1519.89

3.4 Information Integration

The LLM’s powerful natural language processing capabilities allow it to deeply understand semantic information and derive meaningful features and patterns[6]. Simultaneously, LLM’s continuous learning ability enables it to adapt and respond effectively to evolving event types, showcasing remarkable scalability and rapid adaptability in complex scenarios[28].

In the information integration phase, we compile the abnormal reports of communication entities within the DHNs and generate prompt text language that the LLM can understand, and tailor.

LLMs often struggle with complex and in-depth reasoning due to their reliance on patterns in data rather than true understanding, leading to difficulties in consistently generating accurate, contextually appropriate responses that require deep domain knowledge or logical consistency[34]. In our integration process, we have bootstrapped the LLM to assist in generating anomaly reports that better align with the requirements of NMs, based on the life cycle of health management.

The structure of the prompt is illustrated in Fig.8. We provide the model with context, questions, and options. The context enables the LLM to comprehend network anomaly information. The question addresses the needs of NMs, specifically the types of abnormalities that may occur and the associated mitigation plans. The option constrains the LLM’s inference results to the specified types of anomalies, thereby enhancing the accuracy of the inferences. Naturally, the options also include others.

Given that large models face input length limitations, the anomaly context must encompass all relevant information of abnormal entities within the local network at the time of the anomaly, a requirement that significantly exceeds the input capacity of the existing model. Consequently, the anomaly context cannot be directly embedded within the prompt text. We collate the collected contextual information regarding entity anomalies, utilize the abnormal status to pinpoint KPIs exhibiting significant abnormalities within network entities and provide a detailed description of such KPIs. Conversely, KPIs exhibiting minor abnormalities are summarized in a consolidated manner. Furthermore, we incorporate the abnormal detection results obtained in section3.2 into the report, thereby enriching the LLM with additional dimensions of information focus.

4 Experimentation

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (18)
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (19)

We implemented MSADM using Python 3.7 and Torch 1.13.1. Due to resource constraints, we utilized eight RTX4090 with 24G RAM on Ubuntu 22.04 for data simulation, model training, and testing. We executed the techniques and algorithms by the system architecture (Fig.6).

We employ NS-3[27] for network simulation. We simulated four different communication entities by varying the transmit power, bandwidth, and other configurations. Furthermore, we categorize network anomalies into six distinct categories and introduce these anomalies into the simulation. Additionally, we construct four diverse communication devices by adjusting parameters such as node bandwidth and movement speed (see appendixC for anomaly types). Subsequently, based on these devices, we build a heterogeneous network, inject network anomalies, and capture KPI changes. We accumulated a total of nearly 20,000 data entries across seven network scenarios, all of which were labeled. We release an open-source demo and dataset111Demo and Dataset: https://github.com/SmallFlame/MSADM of MSADM to illustrate this workflow.

We will evaluate our scheme from two perspectives to demonstrate its effectiveness. Firstly, we will illustrate the superior accuracy and efficiency of MSADM in anomaly detection models. Secondly, we will present the anomaly report, along with the diagnostic results and scheme descriptions provided by LLM, to verify the feasibility of our approach.

4.1 MSADM Evaluations

We surveyed several popular time series classification models that utilize various technologies. CL-MPPCA employs both neural networks and probabilistic clustering to enhance anomaly detection performance[31]. SR-CNN integrates SR and CNN models to boost the accuracy of time series anomaly detection[26]. AnomalyBERT, built on the Transformer architecture, is designed to discern temporal contexts and identify unnatural sequences [12]. LSTM-transformer introduces a novel hybrid architecture combining LSTM and Transformer, tailored for multi-task real-time prediction[7]. We compare these models with the anomaly detection module of MSADM. We will train these models using the same equipment and conduct a comprehensive comparison.

In Fig.9, the model’s evolution in classification accuracy, detection accuracy, and cross-entropy loss function is depicted over increasing iterations. Notably, our model consistently achieves the highest accuracy, ultimately converging to 91.3%. This figure marks an approximately 3% lead over the runner-up model, LSTM-transformer. Additionally, the Cross-Entropy loss of our model substantially surpasses that of other models upon final convergence.

In Table1, we conducted a comparative analysis between MSADM and various other models concerning fault detection accuracy, anomaly detection accuracy, detection recall rate (Recall), detection false negative rate (FNR), and detection false positive rate (FPR). We meticulously assessed performance across these metrics. Notably, we highlight the superior performance of MSADM, as indicated by the bold data for each metric. The conclusive findings demonstrate that MSADM surpasses other models across most performance indicators. It’s worth mentioning that the detection time, while marginally lower than the LSTM method lacking rule embedding, is attributed to the initial requirement of rule filtering.

The ROC curves represent the true positive rate (TPR) and false positive rate (FPR) under different threshold settings[2]. To compare the robustness and reliability of the models. We plotted the ROC curve. As shown in Fig.10, the ROC curve of the MSADM model is higher than other models most of the time, while the AUC of MSADM is 0.1 higher than the current hottest LSTM-transformer structure.

Due to the anomaly’s limited range of influence, enlarging the network size might result in overlooking the anomaly. Fig.11 illustrates the variation in model accuracy corresponding to changes in network size. In both scenarios with a small and large number of nodes, the MSADM model outperforms other models in both anomaly detection and classification accuracy.

Fig.12 illustrates the confusion matrix analysis of the anomaly detection results produced by our MSADM model on the test set. Fig.12 (a) primarily assesses the model’s accuracy in identifying various anomalies. The results underscore the model’s high accuracy across most anomaly-type classification tasks. Fig.12 (b) depicts the accuracy of anomaly detection. Our identification accuracy for abnormal samples reaches as high as 95%, implying that we can analyze and collect information from almost all abnormal network entities within the network structure. For normal samples that are incorrectly detected, because we gauge the degree of abnormality in the generation of abnormal reports, a large amount of minor abnormal information will not excessively consume abnormal reporting resources.

4.2 Semanticization Evaluations

In this section, we present the text generation component of MSADM to showcase the quality of our semantic generation. We will also highlight segments of the LLM output to underscore the benefits of our thought prompts in guiding LLM reasoning. Due to space constraints, we display only a portion of the anomaly report and LLM output, with the complete textual content available as chapters in the appendix.

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (20)

In the event of a node application crash, the current node becomes unable to request and respond to data packets due to application anomaly, yet it retains its functionality as a packet forwarding relay. We use this scenario as an example to demonstrate the practicality of our generated statements.

The results are depicted in Fig.13. We show a partial anomaly report generated by a single network entity when an anomaly occurs. This section includes descriptions of packet rates, bit error rates, and latencies, while also providing anomalies diagnosed by the model. It is evident from the report that the PLRs and the bit error rate of the nodes are notably high, whereas the PLR and the bit error rate of the communication link remain relatively unaffected, aligning with the observed real-world scenario. See the appendixD for the complete report.

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (21)

We input the analyzed data from the collected reports into the LLM to generate relevant reports and conclusions. The solution produced by the LLM appears in Fig.14. By incorporating chain-of-thought-based prompts, the LLM assesses various factors that may have contributed to the anomaly, including software and hardware issues, as well as troubleshooting and resolution strategies. This exception report, enhanced by LLM’s insights, significantly surpasses traditional operations and maintenance documentation by reducing empiricism that leads to incorrect exception handling. At the same time, the anomaly solution enables NMs to rapidly mitigate anomalies and maintain network health. The comprehensive exception analysis report is detailed in the appendixE.

5 Discussion

We have illustrated the advantages of our scheme for assisting network operators with health management in DHNs. In this section, we explore potential future directions in conjunction with our scheme.

Modeling Stateful Behaviors:To better adapt to the diverse communicating entities in the DHNs, we deliberately made trade-offs to enhance the model’s scalability. Currently, we model KPIs commonly owned by each entity. However, this approach overlooks the intricate interactions between higher layers, such as the transport protocols they utilize, network layer TM mechanisms, and potential device interactions. A promising future direction involves leveraging MSADM to model the state behavior of higher-level network participants (e.g., Web Server,SQL Server), such as the application layer, and integrating them with our scheme to form a network for microservice architecture-based anomaly detection solutions.

Self-evolution of the LLM:In this article, we utilize LLM to generate the final anomaly inference results. However, this process is one-way and cannot provide feedback to the large model itself. In the future, we posit that the self-evolution method of the learning model can be employed to aid LLM in learning, enhancing, and self-evolving from the experiences it generates. Simultaneously, the evolved LLM can assist MSADM in augmenting and maintaining semantic rule trees to enrich the vocabulary and enhance the quality of the generated sentences.

6 Conclusion

We introduce semantic expression into wireless networks for the first time and develop an LLM-assisted end-to-end health management scheme for DHNs. Our model automatically processes collected anomaly data, predicts anomaly categories, and offers mitigation options. To address the inability of algorithms that depend on expert input or basic rule-based systems to adapt to multi-device environments, we propose the MSADM. MSADM utilizes a predefined rule base to monitor the state of entity communication KPIs, conducts anomaly detection and classification through a rule-enhanced Transformer structure, and produces unified and standardized textual representations of anomalies using a semantic rule tree. Furthermore, the inclusion of a chain-of-thought-based LLM in the diagnostic process not only enhances fault detection but also generates detailed reports that pinpoint faults and recommend optimization strategies. Experiments demonstrate that MSADM surpasses current mainstream models in anomaly detection accuracy. Additionally, the experimentally generated anomaly reports and solutions highlight our approach’s potential to boost the efficiency and accuracy of intelligent operations and maintenance analysis in distributed networks.

References

  • [1]Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan.Recommending root-cause and mitigation steps for cloud incidents using large language models.In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1737–1749, 2023.
  • [2]Raquel Barco, Pedro Lázaro, Luis Díez, and Volker Wille.Continuous versus discrete model in autodiagnosis systems for wireless networks.IEEE Transactions on Mobile Computing, 7(6):673–681, 2008.
  • [3]Raquel Barco, Volker Wille, Luis Díez, and Matías Toril.Learning of model parameters for fault diagnosis in wireless networks.Wireless Networks, 16:255–271, 2010.
  • [4]Connor Baumler and Soumya Ray.Hybrid semantics for goal-directed natural language generation.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1936–1946, 2022.
  • [5]Francesca Boem, AlexanderJ Gallo, DavideM Raimondo, and Thomas Parisini.Distributed fault-tolerant control of large-scale systems: An active fault diagnosis approach.IEEE Transactions on Control of Network Systems, 7(1):288–301, 2019.
  • [6]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
  • [7]Kangjie Cao, Ting Zhang, and Jueqiao Huang.Advanced hybrid lstm-transformer architecture for real-time multi-task prediction in engineering systems.Scientific Reports, 14(1):4890, 2024.
  • [8]Xuehan Chen, Jingjing Tan, Litian Kang, Fengxiao Tang, Ming Zhao, and Nei Kato.Frequency selective surface towards 6g communication systems: A contemporary survey.IEEE Communications Surveys & Tutorials, pages 1–1, 2024.
  • [9]Yinfang Chen, Huaibing Xie, Minghua Ma, YuKang, Xin Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, etal.Automatic root cause analysis via large language models for cloud incidents.In Proceedings of the Nineteenth European Conference on Computer Systems, pages 674–688, 2024.
  • [10]Jiezhu Cheng, Kaizhu Huang, and Zibin Zheng.Towards better forecasting by fusing near and distant future visions.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34(04), pages 3593–3600, 2020.
  • [11]Samira Hayat, Evşen Yanmaz, and Raheeb Muzaffar.Survey on unmanned aerial vehicle networks for civil applications: A communications viewpoint.IEEE Communications Surveys & Tutorials, 18(4):2624–2661, 2016.
  • [12]Yungi Jeong, Eunseok Yang, JungHyun Ryu, Imseong Park, and Myungjoo Kang.Anomalybert: Self-supervised transformer for time series anomaly detection using data degradation scheme.arXiv preprint arXiv:2305.04468, 2023.
  • [13]MRuofan KarthikeyJini, SVWanithaDevi, JSrinivasan, Bing Arulpg, Wei Wei, Xiaolan Zhang, Xian Chen, Yaakov Bar-Shalom, and Peter Willett.Detecting node failures in mobile wireless networks: A probabilistic approachetecting node failures in mobile wireless networks: a probabilistic approach.IEEE Transath, Actions on Mobile Computing, 15(7):1647–1660, 2015.
  • [14]RanaM Khanafer, Beatriz Solana, Jordi Triola, Raquel Barco, Lars Moltsen, Zwi Altman, and Pedro Lazaro.Automated diagnosis for umts networks using bayesian network approach.IEEE Transactions on vehicular technology, 57(4):2451–2461, 2008.
  • [15]EmilJ Khatib, Raquel Barco, Ana Gómez-Andrades, and Inmaculada Serrano.Diagnosis based on genetic fuzzy algorithms for lte self-healing.IEEE Transactions on vehicular technology, 65(3):1639–1651, 2015.
  • [16]Slawomir Kukliński and Lechosław Tomaszewski.Key performance indicators for 5g network slicing.In 2019 IEEE conference on network softwarization (NetSoft), pages 464–471. IEEE, 2019.
  • [17]Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, and Zhe Wang.Correlating events with time series for incident diagnosis.In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, Aug 2014.
  • [18]Ziyu Lyu, Yue Wu, Junjie Lai, Min Yang, Chengming Li, and Wei Zhou.Knowledge enhanced graph neural networks for explainable recommendation.IEEE Transactions on Knowledge and Data Engineering, page 1–1, Jan 2022.
  • [19]Minghua Ma, Shenglin Zhang, Junjie Chen, Jim Xu, Haozhe Li, Yongliang Lin, Xin Nie, BoZhou, Yong Wang, and Dan Pei.Jump-starting multivariate time series anomaly detection for online service systems.USENIX Annual Technical Conference,USENIX Annual Technical Conference, Jan 2021.
  • [20]STEINDER Malgorzata.A survey of fault localization techniques in computer networks.Elsevier Science of Computer Programming Journal, pages 165–194, 2004.
  • [21]Matthew Middlehurst, Patrick Schäfer, and Anthony Bagnall.Bake off redux: a review and experimental evaluation of recent time series classification algorithms.Data Mining and Knowledge Discovery, pages 1–74, 2024.
  • [22]Navid MohammadiFoumani, Lynn Miller, ChangWei Tan, GeoffreyI. Webb, Germain Forestier, and Mahsa Salehi.Deep learning for time series classification and extrinsic regression: A current survey.ACM Comput. Surv., 56(9), apr 2024.
  • [23]IsaacKofi Nti, JuanitaAhia Quarcoo, Justice Aning, and GodfredKusi Fosu.A mini-review of machine learning in big data analytics: Applications, challenges, and prospects.Big Data Mining and Analytics, 5(2):81–97, 2022.
  • [24]Gopika Premsankar, Mario DiFrancesco, and Tarik Taleb.Edge computing for the internet of things: A case study.IEEE Internet of Things Journal, 5(2):1275–1284, 2018.
  • [25]Bing Qian and Shun Lu.Detection of mobile network abnormality using deep learning models on massive network measurement data.Computer Networks, 201:108571, 2021.
  • [26]Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and QiZhang.Time-series anomaly detection service at microsoft.In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3009–3017, 2019.
  • [27]GeorgeF. Riley and ThomasR. Henderson.The ns-3 Network Simulator, pages 15–34.Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
  • [28]Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu AwalMd Shoeb, Abubakar Abid, Adam Fisch, AdamR Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, etal.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022.
  • [29]Péter Szilágyi and Szabolcs Nováczki.An automatic detection and diagnosis framework for mobile communication systems.IEEE transactions on Network and Service Management, 9(2):184–197, 2012.
  • [30]Fengxiao Tang, Xuehan Chen, TiagoKoketsu Rodrigues, Ming Zhao, and Nei Kato.Survey on digital twin edge networks (diten) toward 6g.IEEE Open Journal of the Communications Society, 3:1360–1381, 2022.
  • [31]Shahroz Tariq, Sangyup Lee, Youjin Shin, MyeongShin Lee, Okchul Jung, Daewon Chung, and SimonS Woo.Detecting anomalies in space using multivariate convolutional lstm with mixtures of probabilistic pca.In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2123–2133, 2019.
  • [32]Shreshth Tuli, Giuliano Casale, and NicholasR Jennings.Tranad: Deep transformer networks for anomaly detection in multivariate time series data.arXiv preprint arXiv:2201.07284, 2022.
  • [33]Xianbin Wang, Jie Mei, Shuguang Cui, Cheng-Xiang Wang, and XueminSherman Shen.Realizing 6g: The operational goals, enabling technologies of future networks, and value-oriented intelligent multi-dimensional multiple access.IEEE Network, 37(1):10–17, Jan 2023.
  • [34]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, EdChi, QuocV Le, and Denny Zhou.Chain-of-thought prompting elicits reasoning in large language models.In S.Koyejo, S.Mohamed, A.Agarwal, D.Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems, volume35, pages 24824–24837. Curran Associates, Inc., 2022.
  • [35]Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, and Liang Sun.Dcdetector: Dual attention contrastive representation learning for time series anomaly detection.In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3033–3045, 2023.
  • [36]Xun Yuan, Fengxiao Tang, Ming Zhao, and Nei Kato.Joint rate and coverage optimization for the thz/rf multi-band communications of space-air-ground integrated network in 6g.IEEE Transactions on Wireless Communications, pages 1–1, 2023.

Acknowledgments

Appendix A Evaluation of network attributes and performance metrics

Network EntityPerformance Indicators
Node AttributesPacket Loss Rate
Bit Error Rate
Neighboring Nodes Number
Routing Table Number
Cache Size
Link AttributesPacket Loss Rate
Bit Error Rate
Transmission Delay

We use KPIs from both communication nodes and communication links as rule-based filtering features and machine-learning features to detect and classify anomalies. The specific features considered are shown in Table 2 below.

Appendix B NETWORK Node PARAMETERS

Device NameTransmitting PowerBandwidthCommunication protocolRangeSpeed
Mobile Phone23 dBm20 MHzLTE200m10 m/s
Vehicle30 dBm10 MHz802.11p200m20 m/s
UAV20 dBm5 MHz802.11AC400m15 m/s
Base Station43 dBm100 MHzLTE500m

On the network simulation platform ns-3, we designed and configured four different devices to build a virtual Ad-hoc network (refer to Table 3 for specific device configurations). This network consists of 9 to 20 nodes. We set a data collection duration of 30 seconds and defined a collection period of 200ms.

Appendix C Anomalies Categories

LayerNameDescription
Application LayerApplication DownloadApplication failures can lead to the node’s incapability to request and respond to packets; however, it can still function as a relay station for packet forwarding, ensuring continuous network connectivity.
Malicious TrafficThe node sends and requests a large amount of data in a short period.
Transport LayerNetwork CongestionThe traffic in the network exceeds the processing capacity of network devices or links.
Data Link LayerCommunication ObstaclesObstacles obstructed the line of sight between nodes, causing the wireless transmission to be blocked.
Out-of-RangeNode mobility resulted in going out of communication range.
Physical LayerNetwork Node CrashNode lost complete network communication capability due to hardware failure.

When using traditional machine learning techniques for fault detection, we are particularly concerned with obtaining sufficient labeled negative samples. In the context of DHNs, there is a wide range of anomaly types. Therefore, a careful classification of common fault types is crucial. Table4 shows our final classification results for these anomaly types, which are seven in total.

Appendix D Complete Anomaly Report

The rest of this section shows the anomaly reports that our scenario outputs:

Current NetWork Context: The current node0 status is as follows:The packet loss rate shows a very high average value is 44.43%, with extremely volatile fluctuation and has fallen sharply and then rose trend. The information about the communication links of the current node is as follows:The current node may have a fault appdown! The current node1 status is as follows: The number of neighboring nodes shows serious over-average num minor fluctuation and has an up trend. The number of routing table caches shows serious over-average num minor fluctuation and has an up trend. The information about the communication links of the current node is as follows: The current node may fault maliciously! The current node2 status is as follows:The number of neighboring nodes shows serious over-average num minor fluctuation and has an up trend.The number of routing table caches shows serious average num minor fluctuation and has an up trend.The information about the communication links of the current node is as follows: The current node may have a fault appdown!

Questions: According to the preceding description, if similar historical fault information exists, identify the fault type and provide a solution.If no, identify the current fault type and provide the optimal solution. Select a fault type from the options. The fault type mentioned above may not be correct. Determine and confirm the fault according to the context’s information. If you have different views on the fault, state the cause.

Options: Please select the anomaly type that best matches the context’s performance from the following:a: Node Down; b: Malicious Traffic; c: Network Congestion; d: Communication Obstacles; e: Out-of-Range; f: Network Node Crash.

Appendix E Complete LLM output

The rest of this section shows the output of the LLM-based on the anomaly report:

E.1 Node Analysis

Based on the provided network context, here’s the detailed analysis:

Node0: Status: High packet loss rate averaging 44.43Communication Link Info: Indicates a possible fault for the app down.Analysis: The high packet loss rate and the indication of an appdown fault suggest a significant issue with the application layer on Node0. This could be due to software errors, resource limitations, or misconfigurations.Recommended Action: Investigate the application running on Node0, check for any software updates, review configurations, and allocate additional resources if necessary to mitigate packet loss.

Node1: Status: The number of neighboring nodes and routing table caches are significantly above average, with minor fluctuations and an upward trend.Communication Link Info: Possible fault for malicious activity.Analysis: The increasing number of neighboring nodes and routing table caches, along with the indication of malicious activity suggests a potential security breach or an attack targeting Node1.Recommended Action: Conduct a thorough security audit, implement intrusion detection systems, and strengthen network security measures to detect and mitigate malicious activity targeting Node1.

Node2: Status: Similar to Node1, with an increasing number of neighboring nodes and routing table caches showing minor fluctuations and an upward trend.Communication Link Info: Also indicates a possible fault for appdown.Analysis: The pattern of increasing neighboring nodes and routing table caches, along with the appdown fault indication, points to a potential issue similar to Node0, possibly related to application failures or disruptions.Recommended Action: Conduct a detailed investigation into the application running on Node2, review configurations, and address any software issues or resource constraints to prevent further disruptions.

E.2 Overall Network Analysis

The network is experiencing multiple issues across different nodes, including high packet loss rates, indications of malicious activity, and potential application failures.There might be underlying systemic issues affecting the stability and security of the network, such as insufficient resources, misconfigurations, or vulnerabilities.Addressing these issues promptly is crucial to prevent service disruptions, security breaches, and network performance degradation.

Recommended Actions:Perform comprehensive network diagnostics to identify and address underlying issues.Implement robust security measures to mitigate malicious activities, including firewalls, intrusion detection systems, and access controls.Optimize resource allocation and configurations to improve network performance and reliability.Regularly monitor and maintain the network to detect and prevent future issues.If historical fault information exists, it should be analyzed to identify patterns and potential root causes, aiding in developing more effective long-term solutions.

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization (2024)
Top Articles
Latest Posts
Article information

Author: Prof. An Powlowski

Last Updated:

Views: 5753

Rating: 4.3 / 5 (64 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Prof. An Powlowski

Birthday: 1992-09-29

Address: Apt. 994 8891 Orval Hill, Brittnyburgh, AZ 41023-0398

Phone: +26417467956738

Job: District Marketing Strategist

Hobby: Embroidery, Bodybuilding, Motor sports, Amateur radio, Wood carving, Whittling, Air sports

Introduction: My name is Prof. An Powlowski, I am a charming, helpful, attractive, good, graceful, thoughtful, vast person who loves writing and wants to share my knowledge and understanding with you.