Big Data 12 min read

Build a Serverless Spark Log Analysis Pipeline on Alibaba Cloud EMR

This guide walks you through using Alibaba Cloud EMR Serverless Spark to ingest OSS‑HDFS audit logs, create source, detail, and summary data‑warehouse tables with Spark SQL, orchestrate the tasks in a workflow, and schedule daily IP‑traffic analysis, complete with code snippets and UI screenshots.

Alibaba Cloud Big Data AI Platform

Jun 3, 2024

Build a Serverless Spark Log Analysis Pipeline on Alibaba Cloud EMR

Background

With the rapid growth of internet services, log data has become a valuable resource for monitoring system activity, user behavior, and business operations. Traditional analysis tools can no longer keep up with the volume, prompting the need for efficient collection, storage, and powerful, flexible analysis platforms. Apache Spark, designed for large‑scale data processing, is an ideal engine for high‑performance log analytics.

OSS‑HDFS Audit Log Overview

Alibaba Cloud OSS‑HDFS, powered by JindoFS, bridges OSS object storage with the HDFS ecosystem, providing high‑performance, compatible storage for Hadoop, Hive, Spark, Flink, and other big‑data frameworks. Enabling HDFS on an OSS bucket creates an / .sysinfo/auditlog directory that stores recent audit logs split by date.

Audit logs record every HDFS operation (read, write, delete, etc.) with details such as timestamp, user, IP, command, and target path. Sample entries illustrate delete, getfileinfo, and mkdir operations.

EMR Serverless Spark Workspace Overview

EMR Serverless Spark is a fully managed, serverless Spark platform built on the Spark Native Engine. It offers elastic, high‑efficiency compute without the need to manage infrastructure and is 100% Spark‑compatible.

EMR Serverless Spark Task Development

Source Data Layer (ODS)

Create an external CSV table that points to the audit‑log directory. The logs are tab‑delimited, so sep='\t' is specified. Use the built‑in variable ${ds} for the business date (T‑1).

DROP TABLE IF EXISTS s_oss_hdfs_audit_tmp;</code><code>CREATE TABLE s_oss_hdfs_audit_tmp (</code><code>  tm string,</code><code>  allowed string,</code><code>  ugi string,</code><code>  ip string,</code><code>  cmd string,</code><code>  src string,</code><code>  dst string,</code><code>  perm string,</code><code>  proto string</code><code>) USING csv</code><code>OPTIONS (</code><code>  path 'oss://<BUCKET_NAME>.<REGION_ID>.oss-dls.aliyuncs.com/.sysinfo/auditlog/${ds}',</code><code>  sep '\t'</code><code>);

Data Warehouse Detail Layer (DWD)

Transform the ODS table into a Parquet table partitioned by date, converting timestamps and extracting field values.

CREATE TABLE IF NOT EXISTS dwd_oss_hdfs_audit_di (</code><code>  access_time timestamp,</code><code>  allowed string,</code><code>  ugi string,</code><code>  ip string,</code><code>  cmd string,</code><code>  src string,</code><code>  dst string,</code><code>  perm string,</code><code>  proto string,</code><code>  dt string</code><code>) USING parquet PARTITIONED BY (dt);</code><code>INSERT OVERWRITE TABLE dwd_oss_hdfs_audit_di PARTITION (dt='${ds}')</code><code>SELECT</code><code>  to_timestamp(tm),</code><code>  split_part(allowed,'=',2),</code><code>  split_part(ugi,'=',2),</code><code>  split_part(ip,'=',2),</code><code>  split_part(cmd,'=',2),</code><code>  split_part(src,'=',2),</code><code>  split_part(dst,'=',2),</code><code>  split_part(perm,'=',2),</code><code>  split_part(proto,'=',2)</code><code>FROM s_oss_hdfs_audit_tmp;

Data Warehouse Summary Layer (DWS)

Aggregate the detail layer to find the top 20 IP addresses by request count for the previous day.

CREATE TABLE IF NOT EXISTS dws_oss_hdfs_ip_ana (</code><code>  ip string,</code><code>  cnt bigint,</code><code>  dt string</code><code>) USING parquet PARTITIONED BY (dt);</code><code>INSERT OVERWRITE TABLE dws_oss_hdfs_ip_ana PARTITION (dt='${ds}')</code><code>SELECT ip, count(*) cnt</code><code>FROM dwd_oss_hdfs_audit_di</code><code>WHERE dt='${ds}'</code><code>GROUP BY ip</code><code>ORDER BY cnt DESC</code><code>LIMIT 20;

EMR Serverless Spark Workflow Orchestration

Create a workflow that links the three SQL tasks (ODS, DWD, DWS) and schedule it to run daily at 00:05.

In the workflow editor, add the three nodes in order, set upstream dependencies (DWD depends on ODS, DWS depends on DWD), and publish the workflow.

After publishing, enable the schedule switch. The workflow will execute automatically each day, and you can also trigger a manual run.

To view the results, run a simple query on the summary table:

SELECT * FROM dws_oss_hdfs_ip_ana WHERE dt='${ds}';

The query returns the top 20 IP addresses that accessed OSS‑HDFS on the previous day.

Conclusion

This tutorial demonstrates the end‑to‑end process of building a log‑analysis application with Alibaba Cloud EMR Serverless Spark, covering data ingestion, warehouse construction, task orchestration, scheduling, and interactive querying.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL Data Warehouse Apache Spark EMR Serverless Spark

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.