What’s at Stake?
As the AI wave reshapes the tech industry, companies are racing to use its potential—whether to improve operational efficiency or deliver shiny new features to their customers. For organizations lacking the resources to self-host massive models, this means having to feed the data into third-party APIs, whether for training or inference purposes. This dependency introduces significant security and governance challenges as sensitive data leaves the organizational perimeter.
The data at risk can include:
Personally Identifiable Information (PII): Your customers' sensitive personal data.
Intellectual Property (IP): Proprietary materials, codebase, and knowledge unique to your organization.
Trade Secrets & Sensitive Company Information: Critical internal data and processes.
The level of caution required depends on the sensitivity of the data and the use case at hand. While anonymization and encryption may mitigate some risks, others demand stricter measures, such as terms of service that explicitly prohibit data storage or model training by the third party API provider.
Is the Threat Even Real?
When Things Went Wrong: Microsoft’s Example
Case Study: Microsoft’s 38TB Data Breach In a glaring example of security lapses in AI, Microsoft’s AI research team accidentally exposed 38 terabytes of private data while sharing open-source training materials on GitHub. The breach included:
Disk backups from two employees’ workstations.
Secrets, private keys, passwords.
Over 30,000 internal Microsoft Teams messages.
The root cause? Misconfigured SAS tokens, an Azure feature for sharing storage data. Instead of restricting access to specific files, the researchers inadvertently shared an entire storage account containing sensitive data.
This incident underscores the heightened risks organizations face when handling massive AI training datasets. As teams rush to deploy AI solutions, insufficient attention to data security can lead to severe consequences—even for tech giants. Source.
Governments Getting Involved! The European AI Act
In April 2024, The European Commission officially proposed the Artificial Intelligence Act, which was later adopted by the European Parliament and the Council of the European Union. This 144 pages document lays down requirements and obligations regarding specific uses of AI, which is important to you if you are operating under EU laws. Here are some specific references from the EU AI Act regarding data governance, data leakage, data security, and cross-border data considerations:
- Data Governance:
Clause/Page: Article 10, Pages 56–58
High-risk AI systems must follow robust data governance practices, ensuring data relevance, representativeness, and freedom from bias or errors for their intended purposes. This includes practices related to data preparation, collection, and examination for bias.
- Data Security:
Clause/Page: Recital 76, Page 21
Cybersecurity measures are critical to protect AI systems against malicious attempts to alter their use or compromise security, including threats like data poisoning and adversarial attacks.
Clause/Page: Article 78, Page 105
Confidentiality of data processed for compliance purposes is emphasized, requiring robust cybersecurity measures and deletion of data once it's no longer needed.
- Data Leaving the EU:
Clause/Page: Recital 69, Page 19
The Act builds on GDPR principles, including data minimization and restrictions on cross-border data transfers. Techniques such as localized processing are encouraged to minimize data transmission between entities.
- Prevention of Data Leakage:
Clause/Page: Article 10, Pages 56–58
Special categories of personal data must not be transmitted, transferred, or accessed by unauthorized parties, and must be deleted once their purpose is fulfilled.
Clause/Page: Article 59, Page 91
In AI regulatory sandboxes, data sharing outside the sandbox is strictly controlled, with measures to prevent leakage.
You can think of it as a more conclusive GDPR but on AI. Therefore it’s even more important to make decisions about your AI data security if you are in the EU!
What Options Do You Have?
Here, we focus on one of the most frequent use cases of AI today: Large Language Models (LLMs). These models are particularly important from a security perspective because, unlike traditional ML models, they are very hard and expensive to self-host. Sharing them with others via an API may seem like a tempting option.
Enterprise API Offerings
Many companies, like OpenAI, provide special deals for enterprises with terms ensuring data security. They even offer solutions to keep your data within your Cloud provider, such as OpenAI’s integration with Azure and Amazon Bedrock’s offerings for AWS (which include models from Anthropic and Meta). These options can be slightly more expensive than standard API pricing but are far more cost-effective than self-hosting.
Data Control Options: APIs can be configured to prevent data retention or secondary training.
Localized Processing: Some platforms (e.g., OpenAI on Azure or Amazon Bedrock) offer options to keep data within your chosen cloud environment.
Cost vs. Risk Trade-off: While enterprise options may be pricier than regular API pricing, they remain significantly more economical than self-hosting large-scale models.
Self-Hosting LLMs
If your use case involves highly sensitive data and you cannot trust third-party APIs, self-hosting is an option to consider. Self-hosting is also practical for smaller models, which are suitable for relatively straightforward tasks that don’t require the power of large models. This approach offers maximum control over your data but requires substantial resources:
Infrastructure Costs: Hosting large models is computationally and financially expensive and requires robust infrastructure.
Use Case Alignment: Self-hosting is ideal for smaller-scale tasks or use cases requiring stringent security, such as internal business processes.
\> At DataChef, we developed Damavand, a turnkey solution for deploying LLM infrastructure on the cloud. Whether for APIs or self-hosted models like those running on AWS SageMaker. Damavand simplifies the setup process, allowing you to focus on your AI initiatives.
Local LLMs for Code Security
One specific use case for self-hosted or local LLMs is code generation. Many engineers rely on tools like GitHub Copilot to enhance productivity. However, these tools require transmitting codebases to external APIs, which can pose significant risks if the code contains sensitive or proprietary information. One solution would be to deploy a small, efficient LLM locally. Techniques like quantization and frameworks like Ollama make this process very easy. You can also find some great small LLMs focused on coding on HuggingFace.