Evaluating Pilosa on Dense, Low‑Cardinality Data Using the NYC Taxi Dataset
This article examines whether Pilosa, a bitmap index originally built for sparse high‑cardinality data, can efficiently handle dense relational datasets by benchmarking it against a billion‑row NYC taxi trip dataset and comparing query performance with other database systems.
Pilosa is an open‑source distributed bitmap index designed for fast queries on massive, sparse, high‑cardinality datasets such as user attributes, and the article investigates whether it can also be applied to dense relational data.
While Pilosa excels at filtering millions of attribute combinations in milliseconds for sparse user‑segmentation data, the authors seek a dense, low‑cardinality dataset to evaluate its broader applicability.
The NYC taxi trip dataset, containing over a billion records with fields like pickup/drop‑off times, locations, distance, fare, payment method, and passenger count, is chosen as the test case.
Four benchmark queries are defined: (1) total passengers per driver, (2) average fare per passenger, (3) annual trips per passenger, and (4) passenger counts grouped by passenger number, year, and trip distance, ordered by yearly passenger totals.
A three‑node Pilosa cluster is deployed on AWS c4.8xlarge instances, with an additional c4.8xlarge node for data loading. Data is ingested using the pdk taxi -b 2000000 -c 50 -p <pilosa-host-ip>:10101 -f <pdk_repo_location>/usecase/taxi/greenAndYellowUrls.txt command, a process that takes roughly 2 hours 50 minutes.
Performance results show that query 1 benefits from Pilosa’s pre‑computed storage, making network latency dominant, while the remaining queries perform exceptionally well—query 3 completes in 0.177 seconds, comparable to high‑end GPU or Xeon Phi configurations, and often outperforms heterogeneous hardware setups.
The authors conclude that, despite Pilosa’s bitmap compression not being optimized for dense data, the promising results justify further research and optimization for broader use cases.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.