Artificial Intelligence 11 min read

Airbnb’s Listing Attribute Extraction Platform (LAEP): End-to-End Structured Information Extraction Using Machine Learning and NLP

Airbnb’s Listing Attribute Extraction Platform (LAEP) uses a custom NER model, word‑embedding mapping, and a BERT‑based scorer to automatically pull, normalize, and validate structured attributes from hosts’ unstructured text, boosting coverage for downstream tools and enhancing guest‑host matching at scale.

Airbnb Technology Team

Jan 31, 2024

Airbnb’s Listing Attribute Extraction Platform (LAEP): End-to-End Structured Information Extraction Using Machine Learning and NLP

Airbnb uses machine learning and natural language processing (NLP) to mine unstructured textual data for useful listing information, creating personalized experiences for guests.

Understanding and collecting structured data about listings is essential for matching guest needs (e.g., work‑friendly spaces, baby equipment). However, many attributes are not explicitly listed, leading to a mismatch between host descriptions and guest requirements.

To address this, Airbnb developed the Listing Attribute Extraction Platform (LAEP), an ML system that automatically extracts structured attributes from large‑scale unstructured text without requiring hosts to manually input every possible attribute.

LAEP extracts listing attributes, detects entities such as activities, facilities, and landmarks, and feeds the resulting structured data into downstream tools like the Attribute Prioritization System (APS) and the Listing Attribute Collection System (Eve), forming a comprehensive knowledge base.

Before LAEP, Airbnb relied on host‑edited pages, supplemental review flows (SRF), and third‑party integrations, which suffered from limited coverage and reduced data collection.

LAEP Architecture

LAEP consists of three main components:

Named Entity Recognition (NER) : Identifies predefined entity categories (e.g., amenities, services, location features) in text. Airbnb built a custom NER model trained on 30,000 annotated examples from six data sources, using a CNN‑based architecture to label token spans.

Entity Mapping (EM) : Maps detected phrases to standardized listing attributes in Airbnb’s attribute taxonomy. The process includes preprocessing (lower‑casing, lemmatization), projecting attributes into a word‑embedding space (fine‑tuned word2vec), and selecting the most similar attribute via cosine similarity, assigning a confidence score.

Entity Scoring (ES) : Determines whether a mapped attribute actually exists in the listing. A fine‑tuned BERT classifier processes the detected phrase and surrounding context (±32 tokens) to output a discrete label (Yes, Unknown, No) and a confidence score.

For NER, the model outputs tuples of <entity_label, start_index, end_index>. Evaluation on a 9:1 train‑test split uses strict matching, reporting precision, recall, and F1 per category.

Entity Mapping handles the many ways a single attribute can be described (e.g., over twelve variations for “key box”). Unsupervised learning reduces the labeling effort, and a confidence threshold filters low‑certainty mappings.

Entity Scoring employs a BERT‑based architecture: input tokens are truncated to a maximum length of 512, with the surrounding 65 words (32 before and after the phrase) providing context. The [CLS] token embedding passes through a fully connected layer, dropout, and ReLU projection to produce class probabilities.

Conclusion

The LAEP system provides an end‑to‑end pipeline that extracts, normalizes, and validates listing attributes from diverse textual sources while respecting privacy constraints. Deployed in downstream applications such as APS and Eve, LAEP enables Airbnb to better understand listings at scale, discover new attribute categories, and continuously improve host and guest experiences.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP BERT Structured Data entity extraction Airbnb NER

Written by

Airbnb Technology Team

Official account of the Airbnb Technology Team, sharing Airbnb's tech innovations and real-world implementations, building a world where home is everywhere through technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.