Databases 8 min read

Can Pilosa Handle Dense Relational Data? A Deep Dive with NYC Taxi Dataset

Pilosa, originally built for sparse high‑cardinality user attributes, is evaluated on a dense, low‑cardinality NYC taxi dataset to see if it can serve as a general‑purpose index, with performance comparisons against Spark, PostgreSQL, Elasticsearch, and kdb+ across multiple query scenarios.

21CTO
21CTO
21CTO
Can Pilosa Handle Dense Relational Data? A Deep Dive with NYC Taxi Dataset
Pilosa is an open‑source distributed in‑memory bitmap index that accelerates queries on massive datasets, making it suitable for user‑data queries and profiling; but can it also be used for non‑sparse relational data?

Pilosa was initially designed for analyzing user attribute data that is sparse and high‑cardinality. When we set out to build a company around Pilosa, the first step was to evaluate its effectiveness on other data types.

For user‑segmentation data, attributes are numerous yet extremely sparse because many attributes apply to only a few users, while most users have only a few hundred or thousand attributes. Pilosa handles this elegantly, filtering any combination of millions of attributes in a few milliseconds to find matching users.

Pilosa illustration
Pilosa illustration

While Pilosa works well for sparse user‑segmentation data, to become a universal index it must also support dense, low‑cardinality data and work beyond just segment queries.

To test Pilosa we needed a dense, low‑cardinality dataset that could also be explored with other solutions for performance comparison. The New York City government’s open‑source billion‑record taxi‑trip dataset fits these criteria, containing fields such as pickup/drop‑off times, locations, distance, fare, payment method, and passenger count.

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance [1]

Kx 1.1 billion taxi ride benchmark highlights advantages of kdb+ architecture [2]

Should I Stay or Should I Go? NYC Taxis at the Airport [3]

Mark Litwintschik’s performance comparison series, which includes Spark, PostgreSQL, Elasticsearch, and others, is especially useful for us.

We executed the same four queries on Pilosa to compare its performance with other solutions. The queries are:

How many passengers did each driver transport?

What is the average fare per passenger?

How many rides does each passenger take per year?

For each combination of passenger count, year, and trip distance, how many trips occurred, sorted by passenger count descending?

We deployed a three‑node Pilosa cluster on AWS c4.8xlarge instances and added an extra c4.8xlarge node for data loading. Data was loaded using the pdk tool with the following command:

pdk taxi -b 2000000 -c 50 -p <pilosa-host-ip>:10101 -f <pdk_repo_location>/usecase/taxi/greenAndYellowUrls.txt

The loading process took about 2 hours 50 minutes, including downloading CSV files from S3, parsing them, and ingesting the data into Pilosa.

Performance results
Performance results

Note that each hardware/software combination differs, making direct comparisons difficult. Pilosa “cheated” on Query 1 because its storage format pre‑computes that result, so most of the query time is network latency.

For the remaining queries, Pilosa performed exceptionally well—sometimes surpassing heterogeneous hardware such as multi‑GPU setups. Query 3 completed in 0.177 seconds, comparable to Nvidia Pascal Titan X performance. While kdb+/q also performed strongly, remember that Xeon Phi 7210 provides 256 hardware threads and 16 GB memory, bringing its bandwidth closer to GPUs, albeit at a higher cost (~$2400).

These results justify further investment in optimizing Pilosa for other workloads. Since Pilosa’s internal bitmap compression isn’t tuned for dense data, we have conducted additional research and achieved promising results, indicating substantial room for improvement.

For more details, see the Pilosa source code:

https://github.com/pilosa/pilosa

Related links:

Analyzing 1.1 Billion NYC Taxi and Uber Trips

kdb+ Benchmark

Should I Stay or Should I Go? NYC Taxis at the Airport

NYC Taxi Trip Record Data

Pilosa Project Repository

Pilosa Transportation Use Cases

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance benchmarkBitmap Indexdistributed databasesPilosaNYC Taxi Data
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.