Databases 8 min read

Can Pilosa Handle Dense Relational Data? A Deep Dive with NYC Taxi Dataset

Pilosa, originally built for sparse high‑cardinality user attributes, is evaluated on a dense, low‑cardinality NYC taxi dataset to see if it can serve as a general‑purpose index, with performance comparisons against Spark, PostgreSQL, Elasticsearch, and kdb+ across multiple query scenarios.

21CTO

Jun 11, 2017

Can Pilosa Handle Dense Relational Data? A Deep Dive with NYC Taxi Dataset

Pilosa is an open‑source distributed in‑memory bitmap index that accelerates queries on massive datasets, making it suitable for user‑data queries and profiling; but can it also be used for non‑sparse relational data?

Pilosa was initially designed for analyzing user attribute data that is sparse and high‑cardinality. When we set out to build a company around Pilosa, the first step was to evaluate its effectiveness on other data types.

For user‑segmentation data, attributes are numerous yet extremely sparse because many attributes apply to only a few users, while most users have only a few hundred or thousand attributes. Pilosa handles this elegantly, filtering any combination of millions of attributes in a few milliseconds to find matching users.

While Pilosa works well for sparse user‑segmentation data, to become a universal index it must also support dense, low‑cardinality data and work beyond just segment queries.

To test Pilosa we needed a dense, low‑cardinality dataset that could also be explored with other solutions for performance comparison. The New York City government’s open‑source billion‑record taxi‑trip dataset fits these criteria, containing fields such as pickup/drop‑off times, locations, distance, fare, payment method, and passenger count.

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance [1]

Kx 1.1 billion taxi ride benchmark highlights advantages of kdb+ architecture [2]

Should I Stay or Should I Go? NYC Taxis at the Airport [3]

Mark Litwintschik’s performance comparison series, which includes Spark, PostgreSQL, Elasticsearch, and others, is especially useful for us.

We executed the same four queries on Pilosa to compare its performance with other solutions. The queries are:

How many passengers did each driver transport?

What is the average fare per passenger?

How many rides does each passenger take per year?

For each combination of passenger count, year, and trip distance, how many trips occurred, sorted by passenger count descending?

We deployed a three‑node Pilosa cluster on AWS c4.8xlarge instances and added an extra c4.8xlarge node for data loading. Data was loaded using the pdk tool with the following command:

pdk taxi -b 2000000 -c 50 -p <pilosa-host-ip>:10101 -f <pdk_repo_location>/usecase/taxi/greenAndYellowUrls.txt

The loading process took about 2 hours 50 minutes, including downloading CSV files from S3, parsing them, and ingesting the data into Pilosa.

Note that each hardware/software combination differs, making direct comparisons difficult. Pilosa “cheated” on Query 1 because its storage format pre‑computes that result, so most of the query time is network latency.

For the remaining queries, Pilosa performed exceptionally well—sometimes surpassing heterogeneous hardware such as multi‑GPU setups. Query 3 completed in 0.177 seconds, comparable to Nvidia Pascal Titan X performance. While kdb+/q also performed strongly, remember that Xeon Phi 7210 provides 256 hardware threads and 16 GB memory, bringing its bandwidth closer to GPUs, albeit at a higher cost (~$2400).

These results justify further investment in optimizing Pilosa for other workloads. Since Pilosa’s internal bitmap compression isn’t tuned for dense data, we have conducted additional research and achieved promising results, indicating substantial room for improvement.

For more details, see the Pilosa source code:

https://github.com/pilosa/pilosa

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.