Airbnb’s Listing Attribute Extraction Platform (LAEP): End-to-End Structured Information Extraction Using Machine Learning and NLP
Airbnb’s Listing Attribute Extraction Platform (LAEP) uses a custom NER model, word‑embedding mapping, and a BERT‑based scorer to automatically pull, normalize, and validate structured attributes from hosts’ unstructured text, boosting coverage for downstream tools and enhancing guest‑host matching at scale.
Airbnb uses machine learning and natural language processing (NLP) to mine unstructured textual data for useful listing information, creating personalized experiences for guests.
Understanding and collecting structured data about listings is essential for matching guest needs (e.g., work‑friendly spaces, baby equipment). However, many attributes are not explicitly listed, leading to a mismatch between host descriptions and guest requirements.
To address this, Airbnb developed the Listing Attribute Extraction Platform (LAEP), an ML system that automatically extracts structured attributes from large‑scale unstructured text without requiring hosts to manually input every possible attribute.
LAEP extracts listing attributes, detects entities such as activities, facilities, and landmarks, and feeds the resulting structured data into downstream tools like the Attribute Prioritization System (APS) and the Listing Attribute Collection System (Eve), forming a comprehensive knowledge base.
Before LAEP, Airbnb relied on host‑edited pages, supplemental review flows (SRF), and third‑party integrations, which suffered from limited coverage and reduced data collection.
LAEP Architecture
LAEP consists of three main components:
Named Entity Recognition (NER) : Identifies predefined entity categories (e.g., amenities, services, location features) in text. Airbnb built a custom NER model trained on 30,000 annotated examples from six data sources, using a CNN‑based architecture to label token spans.
Entity Mapping (EM) : Maps detected phrases to standardized listing attributes in Airbnb’s attribute taxonomy. The process includes preprocessing (lower‑casing, lemmatization), projecting attributes into a word‑embedding space (fine‑tuned word2vec), and selecting the most similar attribute via cosine similarity, assigning a confidence score.
Entity Scoring (ES) : Determines whether a mapped attribute actually exists in the listing. A fine‑tuned BERT classifier processes the detected phrase and surrounding context (±32 tokens) to output a discrete label (Yes, Unknown, No) and a confidence score.
For NER, the model outputs tuples of <entity_label, start_index, end_index> . Evaluation on a 9:1 train‑test split uses strict matching, reporting precision, recall, and F1 per category.
Entity Mapping handles the many ways a single attribute can be described (e.g., over twelve variations for “key box”). Unsupervised learning reduces the labeling effort, and a confidence threshold filters low‑certainty mappings.
Entity Scoring employs a BERT‑based architecture: input tokens are truncated to a maximum length of 512, with the surrounding 65 words (32 before and after the phrase) providing context. The [CLS] token embedding passes through a fully connected layer, dropout, and ReLU projection to produce class probabilities.
Conclusion
The LAEP system provides an end‑to‑end pipeline that extracts, normalizes, and validates listing attributes from diverse textual sources while respecting privacy constraints. Deployed in downstream applications such as APS and Eve, LAEP enables Airbnb to better understand listings at scale, discover new attribute categories, and continuously improve host and guest experiences.
Airbnb Technology Team
Official account of the Airbnb Technology Team, sharing Airbnb's tech innovations and real-world implementations, building a world where home is everywhere through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.