Contact us today.Phone: +1 888 282 0696Email: sales@aurorait.com

Data Poisoning and Exactly Why Organizations Need to Take It Seriously

The Netflix docuseries Alexander – The Making of a God, compels lovers of history to revisit some of the better-known facts of the Greek wannabe who went on to rule a major part of the ancient world. Historians tell us Alexander, at the height of his conquests, was administered poison while drinking at a friend’s house. Though it ultimately proved fatal, it seriously impacted the physical and mental abilities of the all-conquering leader, whose army was poised to invade Arabia.

Data poisoning is not very different in that it also involves the ‘poisoning’ of training data sources, to corrupt results produced by them.

What it is

The genesis of data poisoning can be traced to the evolution of Machine Learning (ML) models that use training data to generate results desired by users. ML is today powering a wide variety of areas including cybersecurity, social networks, search engines, and OTT streaming platforms – while providing fertile fields for it at the hands of bad actors.

Data poisoning is simply the malicious polluting of the training data and the algorithmic manipulation of an ML model, with the sole intention of altering the output that organizations will use in their decision-making. Often referred to as Adversarial Machine Learning, it can produce incorrect predictions that can seriously impact business processes in organizations.

It generally takes the form of (4):

  • Label Flipping: where one class of labels for data entries are intentionally pointed at another class, causing the learning algorithm to make erroneous classifications
  • Outliers Injection: where unrelated data points are introduced into the training data, to distort the algorithm’s reading of the training data
  • Feature Manipulation: where characteristics of the data points are tweaked by introducing adversarial patterns that impact the learning model’s generalization capabilities of new, unpoisoned data.

When it happens

Data poisoning is not your traditional cyberattack, which generally aims at quickly inflicting damage to the compromised system. Instead, it takes a long-term view, with the attack attempting to gain acceptance by persistently injecting the malicious inputs into the training data.

To perceive how it works, it is necessary to remember that most ML models are developed by data scientists with historical data. However, these models are also equipped to accept further data inputs in the course of operations from the host organizations.

It is generally inflicted at this latter stage, with the ‘feeding’ of misleading data that will taint the automation process, and thereby produce faulty results. TechRadar (1) cites the example of the famous Amazon and Netflix AI engines, that could theoretically produce differential ratings and recommendations by ‘poisoning’ the engine with manipulated data feeds via automated bots.

The devastating consequences

The devastation caused by algorithmic manipulation can be quite severe. Prominent setbacks include impacted model performance metrics, triggering erroneous decisions, and generally making the organizational cybersecurity structure susceptible to future attacks. Cybersecurity experts recognize that it can be used for a huge variety of nefarious purposes including:

  • Disinformation
  • Phishing scams
  • Altering of public opinion
  • Promotion of unsolicited content, and
  • Discrediting individuals or brands

Even tech giants like Google (2) and Microsoft (3) have not been spared. Google went on record to state that advanced spammer groups made no fewer than four large-scale malicious attempts to skew their Gmail filter to categorize spam emails as genuine.  Microsoft’s Tay  – the Twitter (now X) chatbot intended for casual conversation over the networking platform had to be shut down just 16 hours into commissioning, with threat actors feeding offensive tweets into its algorithm.

Tainted Learning Models are generally regarded as ‘compromised and useless’ after an attack. Data scientists confirm that the process of sifting through the ‘bad samples’ injected into the Learning Module is so laborious that retraining the model is the best option. Further, there is no guarantee for prediction accuracy, once a model has been compromised. Thus, data poisoning not only disrupts the business process but also results in further costs with investment needed in new learning models.

What organizations can do to curb the menace?

Curbing the menace is not an easy task. Cybersecurity experts emphasize detection and prevention, as once poisoning has taken root, the learning model is badly compromised. Remediation of the model is laborious and time-consuming, with the model being deemed unreliable. Organizations would be well advised to follow several prevention steps (1) (4) such as:

  • Hiring experienced data scientists and analysts
  • Data input sanitization prior to the training process
  • Definition of a robust business process flow for model training and re-training (in the event of the model having been compromised)
  • Integrating well-established model regularization methods at the outset to limit their impact
  • Tracking the origin and authenticity of all data points
  • Setting up stringent real-time processes for observing model behavior, breach attempts, data anomalies, result patterns
  • Highly-restricted use of opensource data for training data creation
  • Instituting strong processes to ensure sensitive data integrity among employees and data handlers
  • Periodic independent third-party audits for model architecture, system checks, and vulnerabilities

Growing concern

Organizations taking data poisoning lightly, thinking they are safe in their utopian world, would be doing so at their peril. Still, to make the point of the dangers it poses, it is best to take the case of ChatGPT, the AI-generated tool, currently being used by an estimated 180 million users worldwide. The company has disclosed a March 2023 hack, confirming that the breach involved the exploitation of a bug in the open-source code (5) powering the AI-generative platform.

Though the breach was quickly contained, the continued reliance on chatbots that deploy open-source libraries by uninformed users is a cause for concern. Bad actors are repeatedly targeting open-source sources – SecurityIntelligence says attacks on open-source libraries by bad actors have increased by 742% since 2019 (6).

Thankfully, there is a growing concern. Many majors including Amazon, Apple, Verizon, Citigroup, and JP Morgan are restricting employee usage of ChatGPT. Many countries too have banned or restricted its use – Russia, China, Syria, and Italy to name a few. The US cybersecurity industry is watching to see how governmental regulation in the country pans out. Going forward. legislation, organizational measures, and user awareness will be moot in curbing the menace.

Final thoughts

One can only wonder how much further would Alexander the Great have gone in his conquests, had he put in place preventive measures such as these to circumvent his poisoning – as other historical figures through time did. For someone hailing from a civilization where poisons abounded, it was a monumental mistake.

The cybersecurity industry, still with immense growth potential, would do well to learn from this analogy.

References

ChatGPT Confirms Data Breach, Raising Security Concerns (securityintelligence.com)


Contact us at sales@aurorait.com or call 888-282-0696 to learn more about how Aurora can help your organization with IT, consulting, compliance, assessments, managed services, or cybersecurity needs.

Recent Posts