Airbnb Data Privacy and Security Engineering: Inspekt Data Classification Service and Angmar Secret Detection
Airbnb’s second privacy‑security article describes how the Inspekt service automatically classifies personal and sensitive data across diverse stores using regexes, Aho‑Corasick tries, machine‑learning models and custom validators, measures validator quality, and how the Angmar system scans code repositories for secrets via CI checks and pre‑commit hooks, with plans to broaden coverage to more APIs and data stores.
Welcome to the second article of the "Airbnb Data Privacy and Security Engineering" series, which explains how Airbnb builds powerful, automated, and scalable data privacy and security capabilities.
The article discusses the challenges of locating personal and sensitive data across the company’s ecosystem, including constantly evolving data, manual classification errors, increasing privacy regulations, and the risk of leaking keys in code repositories.
To address these challenges, Airbnb built a data classification tool called Inspekt , consisting of two services: a Task Creator that determines what to scan and creates tasks, and a Scanner that samples and scans data to detect personal and sensitive information.
Inspekt supports four scanning methods:
Regexes : pattern matching for fixed‑format data such as dates, emails, etc.
Tries (Aho‑Corasick) : substring matching for non‑fixed patterns like names.
Machine‑learning models : multi‑task CNN, BERT‑NER, and custom models for complex or multilingual data.
Hard‑coded code : custom validators written in code, e.g., an IBAN validator.
Validators are stored as JSON blobs. An example validator configuration that detects columns or content containing the keyword "birthdate" is shown below:
{
"dataElementName": "birthdateKeyword",
"scanningMethods": [
{
"methodName": "birthdate_content_regex",
"methodType": "content_regex",
"contentRegexConfig": {
"allowList": ["birthdate"]
},
"methodName": "birthdate_colname_regex",
"methodType": "colname_regex",
"colNameRegexConfig": {
"allowList": ["birthdate"]
}
}
],
"evaluationExpression": "birthdate_content_regex || birthdate_colname_regex"
}The Scanner runs in a Kubernetes‑based distributed system, pulling tasks from an SQS queue, sampling data from MySQL, Hive, Elasticsearch, S3, etc., applying each validator, storing matches in a database, and finally deleting the processed SQS message.
Inspekt also includes a Quality Measurement Service that tracks precision, recall, and accuracy of each validator. Ground Truth data (positive and negative samples) are collected, labeled using AWS Ground Truth, and fed back to retrain the ML models, improving detection over time.
Beyond data stores, Airbnb built Angmar to detect and protect secrets (API keys, vendor keys, DB credentials) in the codebase hosted on GitHub Enterprise. Angmar consists of a CI check that scans each push using the open‑source detect‑secrets library and a pre‑commit hook that blocks secret‑containing commits. Customizations include Airbnb‑specific secret types, path‑based filters to reduce false positives, deduplication logic, and an allow‑list for urgent overrides.
Future work includes extending Inspekt to scan Thrift APIs, third‑party apps (Google Drive, Box), and additional data stores such as DynamoDB and Redis.
The article concludes that the described architecture enables large‑scale detection of personal and sensitive data, laying the groundwork for upcoming privacy and security use cases.
Airbnb Technology Team
Official account of the Airbnb Technology Team, sharing Airbnb's tech innovations and real-world implementations, building a world where home is everywhere through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.