A Step-by-Step Guide to Uncovering Digital Complexity with GitHub Innovation Graph Data

By

Introduction

Traditional economic measures—like physical exports, patents, or scientific publications—have long been used to gauge the complexity of national economies. However, they miss a critical modern component: software. Code doesn't pass through customs; it travels via git pushes, cloud services, and package managers. This invisible productive knowledge has been called the "digital dark matter" of the economy. In a groundbreaking study published in Research Policy, researchers Sándor Juhász, Johannes Wachs, Jermain Kaminski, and César A. Hidalgo used data from the GitHub Innovation Graph to illuminate this darkness. They applied the Economic Complexity Index (ECI) to software production data, revealing a digital complexity that predicts GDP, inequality, and emissions beyond what traditional indicators capture. This how-to guide walks you through their methodology, so you can replicate and extend their work.

A Step-by-Step Guide to Uncovering Digital Complexity with GitHub Innovation Graph Data
Source: github.blog

What You Need

Step 1: Access and Understand the GitHub Innovation Graph Data

The GitHub Innovation Graph provides quarterly data on developer activity aggregated by economy and programming language. For each economy (identified by IP address geolocation), the dataset includes the number of developers who pushed code in a given language during the quarter. Begin by downloading the latest release (Q4 2025 in the original study). Load the data into your analysis environment and inspect its structure: rows represent economy-language pairs, with a count column for developers.

Step 2: Prepare the Data for Analysis

Filter to an appropriate time window (e.g., a single year or quarter). Aggregate counts by summing across quarters if needed. Create a country-by-language matrix where each cell contains the number of developers in country i using language j. Normalize by total developers per country to avoid size bias. If a language has zero developers in a country, set the cell to 0.

Step 3: Compute Revealed Comparative Advantage (RCA)

For each country–language pair, calculate the Revealed Comparative Advantage using the formula:

RCA_{ij} = (dev_{ij} / sum_j dev_{ij}) / (sum_i dev_{ij} / sum_{ij} dev_{ij})

This measures how concentrated a country is in a language relative to the global average. An RCA > 1 indicates specialization. Binarize the matrix: set values to 1 if RCA >= 1, else 0. This creates a binary matrix M where rows are countries, columns are languages.

Step 4: Calculate the Economic Complexity Index (ECI) for Software

Apply the Method of Reflections to the binary matrix. This iterative algorithm computes diversity (number of languages a country specializes in) and ubiquity (number of countries specializing in a language). The classic ECI is the second eigenvector of a particular matrix derived from diversity and ubiquity. Use the standard implementation (e.g., the economic_complexity Python library or custom code). The resulting ECI values for each country capture its software complexity.

A Step-by-Step Guide to Uncovering Digital Complexity with GitHub Innovation Graph Data
Source: github.blog

Step 5: Validate and Interpret the Software ECI

Compare your software-based ECI scores with traditional complexity measures (export, patent, or publication-based ECIs). The researchers found that software ECI correlates strongly with existing measures but also adds unique predictive power. Run regressions to see if software complexity predicts macroeconomic outcomes like GDP per capita, income inequality (Gini), or CO₂ emissions, after controlling for traditional complexity. A significant coefficient indicates that digital production reveals economic capabilities not captured by physical goods or patents.

Step 6: Perform Further Analysis (Optional)

Explore temporal dynamics by calculating ECI for multiple quarters and examining how countries’ digital complexity evolves. Network analysis can also reveal which languages serve as hubs of knowledge diffusion. You might also segment by developer type (e.g., open source vs. private repositories) if the data allows.

Tips for Success

By following these steps, you can reveal the digital complexity hidden within global software production—and contribute to a richer understanding of national economies in the digital age. As the original researchers showed, code may be invisible to customs, but it is far from irrelevant.

Tags:

Related Articles

Recommended

Discover More

Australian Solar Firm Signs Landmark Pact to Power Entire Small Island Nation Without DieselLoopsy Launches: Open-Source Tool Enables Seamless Terminal and AI Agent Communication Across DevicesHow to Test Vue Components Directly in Your Browser Without Node.js10 Lessons from the Vienna Circle for a More Amiable WebReplit CEO Amjad Masad on Cursor's $60B Acquisition Talks, Apple Tensions, and Why Independence Matters