Big Data 8 min read

Which Tech Giant Dominates Open‑Source on GitHub? 2017 Contribution Analysis

This article examines 2017 GitHub data to compare how major tech companies like Microsoft, Google, Amazon, and others contribute to open‑source projects, revealing surprising rankings, star counts, and the methodology behind the analysis using Google BigQuery and custom SQL queries.

ITPUB
ITPUB
ITPUB
Which Tech Giant Dominates Open‑Source on GitHub? 2017 Contribution Analysis

Background

Open‑source has become a major driver in the IT industry, and major tech companies now contribute large amounts of code. This analysis quantifies corporate contributions on GitHub for the year 2017.

Data source

All GitHub events for 2017 were obtained from the GitHub Archive ( https://www.githubarchive.org/). Only events from repositories that gained more than 20 stars were considered. Corporate affiliation was inferred from the email domain in commit records; users without a corporate email were excluded.

Processing

Google BigQuery ( https://cloud.google.com/bigquery/) was used to run a StandardSQL query that:

Extracts all events from the monthly tables `githubarchive.month.2017*`.

Identifies “WatchEvent” rows to compute the star count per repository and keeps repositories with >20 stars.

Joins those repositories with “PushEvent” rows, extracts the author email, and derives the domain (e.g., google.com).

Aggregates per domain the number of distinct contributors (“githubers”), the set of contributed repositories, and the total star count of those repositories.

Filters out personal email providers (gmail.com, yahoo.com, etc.) and keeps only domains with at least 30 contributors.

#standardSQL
WITH period AS (
  SELECT *
  FROM `githubarchive.month.2017*`
),
repo_stars AS (
  SELECT repo.id,
         COUNT(DISTINCT actor.login) AS stars,
         APPROX_TOP_COUNT(repo.name, 1)[OFFSET(0)].value AS repo_name
  FROM period
  WHERE type='WatchEvent'
  GROUP BY repo.id
  HAVING stars > 20
),
pushers_guess_emails_and_top_projects AS (
  SELECT *,
         REGEXP_EXTRACT(email, r'@(.*)') AS domain
  FROM (
    SELECT actor.id,
           APPROX_TOP_COUNT(actor.login,1)[OFFSET(0)].value AS login,
           APPROX_TOP_COUNT(JSON_EXTRACT_SCALAR(payload, '$.commits[0].author.email'),1)[OFFSET(0)].value AS email,
           COUNT(*) AS c,
           ARRAY_AGG(DISTINCT TO_JSON_STRING(STRUCT(b.repo_name, stars))) AS repos
    FROM period a
    JOIN repo_stars b ON a.repo.id = b.id
    WHERE type='PushEvent'
    GROUP BY actor.id
    HAVING c > 3
  )
)
SELECT *
FROM (
  SELECT domain,
         githubers,
         (SELECT COUNT(DISTINCT repo) FROM UNNEST(repos) repo) AS repos_contributed_to,
         ARRAY(
           SELECT AS STRUCT
                  JSON_EXTRACT_SCALAR(repo, '$.repo_name') AS repo_name,
                  CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64) AS stars,
                  COUNT(*) AS githubers_from_domain
           FROM UNNEST(repos) repo
           GROUP BY repo_name, stars
           HAVING githubers_from_domain > 1
           ORDER BY stars DESC
           LIMIT 3
         ) AS top,
         (SELECT SUM(CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64))
          FROM (SELECT DISTINCT repo FROM UNNEST(repos) repo)) AS sum_stars_projects_contributed_to,
         ARRAY_CONCAT_AGG(ARRAY(SELECT * FROM UNNEST(repos) repo)) AS repos
  FROM pushers_guess_emails_and_top_projects
  WHERE domain IN UNNEST(SPLIT('google.com|microsoft.com|amazon.com','|'))
    AND domain NOT IN UNNEST(SPLIT('gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com|web.de|iki.fi|foxmail.com|yandex.ru','|'))
  GROUP BY domain
  HAVING githubers > 30
)
ORDER BY githubers DESC;

Key findings

Microsoft: ~1,300 employees contributed to 825 repositories, accumulating about 260 k stars.

Google: ~900 active contributors pushed code to roughly 1,100 repositories, with a total of ~530 k stars (driven by TensorFlow, Angular, Kubernetes, etc.).

Amazon: only 134 contributors and 158 repositories, yielding ~27 k stars.

Star count is not proportional to contributor count; Google’s star impact is roughly double Microsoft’s despite only a 25 % larger repository set.

Other notable corporate contributors include Red Hat, IBM, Pivotal, Intel, and Facebook, each surpassing Amazon in star impact.

Additional companies such as Alibaba, Uber, Wix, GitHub, Apache, Tencent, Baidu, Apple, Mozilla, Oracle, Stanford, and MIT appear in lower‑ranked groups, with projects like Alibaba’s Ant Design.

Interpretation

Microsoft, once perceived as hostile to open‑source, now actively participates and holds platinum membership in the Linux Foundation and CNCF. Google leads in star popularity due to high‑impact projects. Amazon lags behind both in contributor count and star influence.

Reproducibility

The analysis pipeline can be reproduced by:

Downloading the 2017 GitHub Archive data.

Running the above StandardSQL query in BigQuery.

Adjusting the domain filter list to include or exclude additional email providers.

Full ranking tables are available in the original Medium article:

https://medium.freecodecamp.org/the-top-contributors-to-github-2017-be98ab854e87

.

SQLGitHubGoogle BigQuerytech giants
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.