Project by Viktória Koncserová: Anomaly Detection with Real-Time Tracking of Fault Origin

In today's data-driven world, precise decision-making hinges on accurate data analysis. However, anomalies—unusual data points that differ significantly from the rest of the dataset—can still arise, posing challenges to this process. Anomaly detection, which identifies these outliers, and root cause analysis, which investigates their origins, are essential for maintaining data reliability.

This semester project focuses on employing various anomaly detection methods and performing root cause analysis on two distinct datasets to ensure robust data quality and informed decision-making.

Identifying Outliers in Renewable Energy Production:

To identify anomalies from both real measured data and predicted data, and to determine whether these anomalies were caused by weather conditions or a failure of any solar panel, three different unsupervised Anomaly Detection methods were applied:

  1. K-Means Clustering: This method groups similar values together based on their proximity to each other.
  2. Isolation Forest Method: It works by randomly selecting a feature and then splitting the selected feature's value range until each observation is isolated. Observations that require fewer random splits to be isolated are considered anomalies, as they are easier to separate from the rest of the data.
  3. Auto Encoder: This method extracts significant patterns from the data by finding its low-dimensional representation and reconstructing it to the original space.

With the help of a voting system, which took into consideration the results from all three methods, we determined the final anomalies, as visible in Fig. 1.

Fig.1: Final anomalies detected on the Solar Panel data

Finally, in the root cause analysis using Pearson's Correlation Method, we determined that all the anomalies appeared because of weather conditions, not because of the failure of the solar panels. The Pearson correlation helps identify the root cause of anomalies by quantifying the linear relationship between variables, thus revealing patterns and potential sources of the anomalies. The formula for Pearson correlation is:

By applying this method, we confirmed the weather conditions as the primary cause of the detected anomalies.

Battery Energy Storage System Data (BESS):

A battery energy storage system (BESS) captures energy from both renewable and non-renewable sources, storing excess energy when generation exceeds demand and supplying electricity when needed. These systems consist of battery packs, power conversion systems, and control electronics. Despite their numerous benefits, BESS are susceptible to various anomalies that can impact their performance and safety.

One common concern is degradation, where the battery's capacity diminishes over time due to repeated charge and discharge cycles. Overheating poses another risk, often caused by high charge/discharge rates or inadequate cooling systems. A frequent failure in BESS involves the cooling fans, which play a critical role in maintaining optimal operating temperatures within the battery system, preventing overheating, and ensuring efficient performance. If the cooling fans malfunction or fail, the system's thermal management could be compromised, leading to increased temperatures and potential thermal runaway situations.

The project aimed to develop an automatic root cause anomaly algorithm using a single binary anomaly value and data from multiple temperature sensors to identify which cooling fan malfunctioned.

Fig.2: Average temperatures in the Modules when the fan error occured

For this purpose we applied these methods:

  1. Biserial Correlation: This specific correlation method measures the relationship between two variables where one value is continuous and the other is a naturally binary variable. In our case, continuous values were the Average Module Temperature values, and binary variables were the Anomalies.
  2. Pearson Correlation: Measures linear correlation between variables.
  3. Principal Component Analysis (PCA) and Support Vector Machine (SVM): This method finds a low-rank representation of the data, discriminating between signal and noise from outliers. Principal components are ordered such that the first components capture most of the information presented in the data.

In the case of PCA analysis, we plotted the first two principal components with the created decision boundary, visible in Fig. 3.

Fig.3: Decision Boundary of Faulty and Normal Values with two principal components

As we can see in Fig. 3, most of the anomaly and normal values are on the correct side of the decision boundary. The number of data points on the wrong side of the decision boundary can be reduced by adding more principal components.

Fig.4: Results of the root cause analysis

Through PCA analysis, we determined that the fan error occurred in Module 9. We got the same result with Biserial Correlation. However, Pearson's Correlation indicated a different faulty module. The exact results for all three analyses are visible in Fig. 4. Module 9 was identified as malfunctioning by two methods, and subsequent inspection confirmed its faulty status, validated by experts. Thus, we can conclude that the root cause analysis was successful.