Environmental footprint analysis • whep

What is a footprint analysis?

A footprint analysis answers the question: “How much of an environmental pressure (land, water, emissions…) caused in country A is ultimately driven by consumption in country B?”

Modern supply chains are global. Spain might import soybeans from Brazil, feed them to pigs, and export ham to France. A footprint analysis traces these flows backwards through the entire supply chain, so that the environmental cost of growing those soybeans is attributed to the French consumer who ate the ham — not to the Brazilian farmer who grew the beans.

This is done by building a Multi-Regional Input-Output (MRIO) model, a mathematical framework that captures all inter-industry and inter-country linkages in one system of equations.

The three-step pipeline

The whep package implements this in three steps:

build_io_model() — Assemble the MRIO tables from supply-use data, bilateral trade, and commodity balance sheets.
compute_leontief_inverse() — Compute the Leontief inverse, which captures both direct and indirect supply chain linkages.
compute_footprint() — Multiply the Leontief inverse by environmental extension data to trace pressures through to final consumption.

The methodology follows the FABIO model (Bruckner et al., 2019), which builds physical MRIO tables for the global food system.

Concepts and terminology

Before diving into the code, here are the key concepts:

Supply-use tables and process groups

A supply table records how much of each product each process produces. A use table records how much of each product each process consumes as input. Together they describe the technology of the economy.

In whep, each process belongs to one of four process groups, which together cover all transformations in the food system:

Process group	What it does	Inputs	Outputs
`crop_production`	Grow crops	Seed (from CBS)	Harvested crop + residues (straw, etc.)
`husbandry`	Raise livestock	Feed (crops, grass, etc.)	Live animals (heads) + non-slaughter products (milk, eggs, wool…)
`slaughtering`	Slaughter animals	Live animals (heads)	Meat, offals, fats, hides and skins
`processing`	Derive secondary products	Primary commodities (tonnes)	Processed products (oil, flour, sugar, etc.)

Each process is uniquely identified by a (proc_group, proc_cbs_code) pair.

Mixed units in the supply-use system

Most flows are measured in tonnes, but live animals are measured in heads (number of animals). This matches the FABIO convention. Different rows of the Z matrix can have different units because each row’s entries are only compared to that row’s total output X[i]. The technical coefficients $A_{ij} = Z_{ij} / X_j$ are dimensionless ratios, so the system remains mathematically consistent.

For example, a slaughtering process for cattle:

Use: 1000 heads of cattle (heads)
Supply: 400 tonnes of beef + 50 tonnes of offals + 30 tonnes of hides (tonnes)

The cattle heads and beef tonnes live in different rows of Z, each with its own unit.

Bilateral trade

A bilateral trade matrix records, for each product, how many tonnes (or heads, for live animals) flow from each exporting country to each importing country. This is what connects domestic economies into a global system.

Commodity balance sheets (CBS)

A commodity balance sheet is an accounting identity for each product in each country:

$\text{Production} + \text{Imports} = \text{Exports} + \text{Food} + \text{Feed} + \text{Other uses} + \Delta\text{Stocks}$

The right-hand categories (food, feed, other uses) form the final demand — the end of the supply chain where products are consumed rather than being used as intermediate inputs.

For live animals (which FAO does not include in its CBS data), whep constructs synthetic CBS entries from primary production data:

Meat animals (cattle, pigs, poultry…): all production goes to processing (i.e. they enter the slaughtering process).
Draft animals (horses, camels, asses…): all production goes to other_uses.

The Z matrix (inter-industry flows)

Z is the core of the MRIO model. It is a square matrix where each row and column represents a (country, product) pair. Entry $Z_{ij}$ gives the physical flow from sector $i$ to sector $j$ .

For example, if row 5 is (Spain, wheat) and column 12 is (Spain, flour), then $Z_{5,12}$ is the amount of Spanish wheat used by Spanish flour mills.

The model constructs Z using the industry technology assumption:

$Z = U_{MR} \times T$

where:

$U_{MR}$ is the multi-regional use matrix. It starts from single-country use tables and expands them with trade shares, so that each input can originate from any country. For example, if Spain’s pig farming uses 1000 tonnes of soybeans, and 60% of Spain’s soybean supply comes from Brazil, then 600 tonnes are attributed to Brazil and 400 to domestic production.
$T$ is the row-normalized supply matrix (the transformation matrix). Each row sums to 1 and encodes how each process distributes its output across products. This converts process-level flows to product-level flows.

Trade shares

For each product, trade shares tell us where each country gets its supply from. If Spain consumes 1000 tonnes of soybeans, and 600 come from domestic production while 400 come from Brazil, then the trade share from Brazil to Spain is 0.4 and the domestic share is 0.6.

These shares are computed from the bilateral trade matrices combined with domestic production data from the CBS. They are used to expand the single-country use tables into a multi-regional system where each input can originate from any country.

The Y matrix (final demand)

Y is the final demand matrix. Rows are (country, product) pairs (same as Z), and columns are (country, demand category) pairs. Entry $Y_{ij}$ tells us how much of product $i$ goes to final demand category $j$ .

The demand categories are derived from the CBS columns present in the data, typically: food, other_uses, and stock_addition.

Like Z, the final demand is expanded with trade shares so that, for example, food consumption of wheat in France can partly originate from French wheat and partly from imported wheat.

The X vector (total output)

X is a vector where $X_i$ is the total output of sector $i$ . By construction:

$X = Z \cdot \mathbf{1} + Y \cdot \mathbf{1}$

meaning total output equals intermediate deliveries plus final demand.

FABIO adjustments

After the initial construction of Z and Y, the model applies several corrections following the FABIO methodology:

Leftover redistribution: When CBS reports processing or seed use but the supply-use table does not fully account for it, the unmatched quantity is redistributed to the food category in Y. This prevents demand from vanishing.
Loss endogenization (optional, via endogenize_losses = TRUE): If the CBS contains a losses column, losses are moved from Y to the diagonal of Z. This treats losses as self-use within each sector (the commodity is “consumed” by its own production process) rather than as final demand.
Diagonal rebalancing: When $\text{diag}(Z)_i \geq X_i$ (the diagonal accounts for all or more than total output, usually due to reporting errors), 80% of the diagonal value is moved to final demand and 20% is kept on the diagonal, proportionally distributed across the sector’s demand categories.
Negative output fixing: Any sectors with $X_i < 0$ (which can arise from trade data inconsistencies) are corrected by adjusting negative entries in Y.

The Leontief inverse (L)

The Leontief inverse $L = (I - A)^{-1}$ is the key analytical tool. The technical coefficients matrix $A$ is defined as $A_{ij} = Z_{ij} / X_j$ , representing how much input from sector $i$ is needed per unit of output from sector $j$ .

Before inversion, two corrections are applied:

Negative entries in $A$ are set to zero: these arise from data inconsistencies and would distort the inverse.
Column sums of $A$ are capped at 1: this ensures $(I - A)$ is invertible even when reported inputs exceed reported output.

The Leontief inverse captures the total requirements — both direct and indirect — needed to deliver one unit of final demand. Entry $L_{ij}$ tells us: “To deliver one extra unit of product $j$ to final consumers, how much total output does sector $i$ need to produce (including all the intermediate steps)?”

This is what makes footprint tracing possible. A consumer buying bread in France triggers not just French bakeries, but also French mills, French wheat farmers, Brazilian soybean exporters (if feed is involved), and so on. The Leontief inverse encodes all of these ripple effects in a single matrix.

Environmental extensions

An environmental extension is any pressure you want to trace through the supply chain. Common examples:

Land use (hectares): How many hectares of cropland does each sector use?
Water use (cubic meters): How much blue water is consumed?
Greenhouse gas emissions (tonnes CO2-eq): How much does each sector emit?
Nitrogen surplus (tonnes N): How much excess nitrogen enters the environment?

You provide these as a vector $e$ with one value per (country, product) sector.

The footprint

The footprint matrix is computed as:

$FP = \hat{f} \cdot L \times Y$

where $\hat{f}$ is a diagonal matrix of extension intensities $f_i = e_i / X_i$ (pressure per unit of output). Each entry $FP_{ij}$ tells us: “How much environmental pressure occurring in sector $i$ is driven by final demand category $j$ ?”

When fd_labels is provided, the footprint uses the FABIO diagonal approach: for each final demand column $j$ , the function computes $MP \cdot \text{diag}(y_j) \cdot G$ , where $G$ groups sectors by item. This decomposes the footprint by both the target item being consumed and the demand category (food, other uses, etc.).

The resulting tidy tibble decomposes this by origin country/product and target country/product, giving you a full bilateral footprint map.

Running the analysis

Step 1: Build the IO model

The simplest call uses default arguments — the function fetches all input data internally:

library(whep)

io <- build_io_model()

This downloads supply-use tables, bilateral trade, and CBS data, then builds the IO model for all available years. To restrict to specific years:

io <- build_io_model(years = c(2010, 2015))

If you want to inspect or pre-filter the input data, you can build them separately and pass them in:

supply_use <- build_supply_use()
bilateral_trade <- get_bilateral_trade()
cbs <- get_wide_cbs()

io <- build_io_model(supply_use, bilateral_trade, cbs, years = 2010)

To endogenize losses (move them from final demand to the diagonal of Z):

io <- build_io_model(endogenize_losses = TRUE)

The result is a tibble with one row per year. Each row contains:

Z: The inter-industry flow matrix (sparse).
Y: The final demand matrix (sparse).
X: The total output vector.
labels: A tibble mapping row/column indices to area_code and item_cbs_code.
fd_labels: A tibble mapping each Y column to its area_code (consuming country) and fd_col (demand category).

You can inspect a single year:

# Extract matrices for one year
z_mat <- io$Z[[1]]
y_mat <- io$Y[[1]]
x_vec <- io$X[[1]]
labels <- io$labels[[1]]
fd_labels <- io$fd_labels[[1]]
selected_year <- io$year[[1]]

# Dimensions: (n_countries * n_products) x (n_countries * n_products)
dim(z_mat)

# Label mapping
labels

# Final demand column labels
fd_labels

Step 2: Compute the Leontief inverse

For small systems (< 5000 sectors), you can pre-compute L:

l_inv <- compute_leontief_inverse(z_mat, x_vec)
dim(l_inv)

This inverts the $(I - A)$ matrix. Column sums of $A$ are internally capped at 1 to prevent singularity, and negative entries are zeroed.

For large systems

For systems larger than ~5000 sectors (e.g. 192 countries x 125+ products = 24 000 sectors), computing the explicit Leontief inverse requires too much memory (e.g. 4.8 GiB for 25 000 sectors). In those cases, pass z_mat directly to compute_footprint(), which solves $(I - A)x = Y$ without ever materialising the full L.

Step 3: Compute the footprint

Now provide your environmental extension vector. This must have one value per sector, in the same order as the labels. For land use, build a native land extension by combining the per-crop physical cropland (build_cropgrids_land_extension()) with grassland (build_grassland_land_extension()):

crop_land <- build_cropgrids_land_extension(source = "cropgrids_fallow")
grass_land <- build_grassland_land_extension() |>
  dplyr::select(year, area_code, item_cbs_code, impact_u)

land_use_data <- dplyr::bind_rows(crop_land, grass_land) |>
  dplyr::filter(year == selected_year) |>
  dplyr::select(area_code, item_cbs_code, hectares = impact_u)

# Match it to the IO model's label order
extensions <- labels |>
  dplyr::left_join(
    land_use_data,
    by = c("area_code", "item_cbs_code")
  ) |>
  dplyr::mutate(
    hectares = tidyr::replace_na(hectares, 0)
  ) |>
  dplyr::pull(hectares)

Dense path (pre-computed L, small systems)

If you pre-computed L:

footprint <- compute_footprint(
  l_inv, x_vec, y_mat, extensions, labels,
  fd_labels = fd_labels
)
footprint

Using Z directly (large systems)

Alternatively, pass the Z matrix and compute_footprint() will derive the Leontief inverse for you:

footprint <- compute_footprint(
  z_mat = z_mat,
  x_vec = x_vec,
  y_mat = y_mat,
  extensions = extensions,
  labels = labels,
  fd_labels = fd_labels
)
footprint

With and without fd_labels

When fd_labels is provided (recommended), the result includes target_item (the consumed product) and target_fd (the demand category like "food" or "other_uses"). This uses the FABIO diagonal decomposition approach.

When fd_labels is omitted, columns of Y are treated as individual sectors. This is appropriate only for square Y matrices and does not produce a target_fd column.

The result is a tidy tibble with columns:

Column	Meaning
`origin_area`	Country where the pressure physically occurs
`origin_item`	Product causing the pressure
`target_area`	Country whose consumption drives the pressure
`target_item`	Product being consumed
`target_fd`	Demand category (only with `fd_labels`)
`value`	Footprint in extension units (e.g. hectares)

Analysing the results

Because the output is a tidy tibble, you can use standard dplyr operations to answer questions:

Which country’s consumption drives the most land use?

footprint |>
  dplyr::summarise(
    total_land = sum(value), .by = target_area
  ) |>
  dplyr::arrange(dplyr::desc(total_land))

How much of Spain’s land use is driven by foreign consumption?

spain_code <- 203L

footprint |>
  dplyr::filter(origin_area == spain_code) |>
  dplyr::mutate(
    destination = ifelse(
      target_area == spain_code, "domestic", "exported"
    )
  ) |>
  dplyr::summarise(
    total_land = sum(value), .by = destination
  )

What are the top bilateral flows?

footprint |>
  dplyr::summarise(
    total = sum(value),
    .by = c(origin_area, target_area)
  ) |>
  dplyr::arrange(dplyr::desc(total)) |>
  head(20)

Which products drive the most land use in food consumption?

footprint |>
  dplyr::filter(target_fd == "food") |>
  dplyr::summarise(
    total_land = sum(value), .by = target_item
  ) |>
  add_item_cbs_name(code_column = "target_item") |>
  dplyr::arrange(dplyr::desc(total_land)) |>
  head(20)

A conservation check

A useful sanity check: the total footprint should equal the total extensions. All pressure must be attributed somewhere.

# These two numbers should be equal (up to floating-point tolerance)
sum(footprint$value)
sum(extensions)

Small differences can arise from negative-output corrections or column-sum capping in the technical coefficients matrix.

Processing multiple years

Since build_io_model() returns one row per year, you can loop over them:

io <- build_io_model(years = 2010:2013)

land <- dplyr::bind_rows(
  build_cropgrids_land_extension(source = "cropgrids_fallow"),
  build_grassland_land_extension() |>
    dplyr::select(year, area_code, item_cbs_code, impact_u)
)

all_footprints <- purrr::pmap_dfr(
  list(
    io$Z, io$X, io$Y, io$labels,
    io$fd_labels, io$year
  ),
  function(z, x, y, lab, fdl, yr) {
    ext <- lab |>
      dplyr::left_join(
        land |>
          dplyr::filter(year == yr) |>
          dplyr::select(area_code, item_cbs_code, impact_u),
        by = c("area_code", "item_cbs_code")
      ) |>
      dplyr::mutate(impact_u = tidyr::replace_na(impact_u, 0)) |>
      dplyr::pull(impact_u)

    compute_footprint(
      z_mat = z, x_vec = x, y_mat = y,
      extensions = ext, labels = lab,
      fd_labels = fdl
    ) |>
      dplyr::mutate(year = yr)
  }
)

Parallel execution (optional)

Since each year is independent, you can parallelise using furrr:

library(furrr)
future::plan(multisession, workers = 4)

all_footprints <- furrr::future_pmap_dfr(
  list(
    io$Z, io$X, io$Y, io$labels,
    io$fd_labels, io$year
  ),
  function(z, x, y, lab, fdl, yr) {
    ext <- lab |>
      dplyr::left_join(
        land |>
          dplyr::filter(year == yr) |>
          dplyr::select(area_code, item_cbs_code, impact_u),
        by = c("area_code", "item_cbs_code")
      ) |>
      dplyr::mutate(impact_u = tidyr::replace_na(impact_u, 0)) |>
      dplyr::pull(impact_u)

    compute_footprint(
      z_mat = z, x_vec = x, y_mat = y,
      extensions = ext, labels = lab,
      fd_labels = fdl
    ) |>
      dplyr::mutate(year = yr)
  }
)

Summary

Step	Function	Input	Output
1	`build_io_model()`	Supply-use, trade, CBS (all optional — fetched by default)	Z, Y, X, labels, fd_labels per year
2	`compute_leontief_inverse()`	Z, X	Leontief inverse L (small systems only)
3	`compute_footprint()`	L or Z, X, Y, extensions, labels, fd_labels	Tidy bilateral footprint

References

Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., & Borner, J. (2019). FABIO — The Construction of the Food and Agriculture Biomass Input-Output Model. Environmental Science & Technology, 53(19), 11302–11312. https://doi.org/10.1021/acs.est.8b03704