Height/weight/BMI Data

Weight, height, and BMI measurements harmonized from FinnGen registers (March 2026).

The file contains anthropometric measurements (weight, height, BMI) from FinnGen registers in longitudinal format, with one row per measurement event per person. Multiple data sources have been integrated, deduplicated by prioritizing higher-quality sources, and quality controlled using robust outlier detection methods.

Location: /finngen/library-red/finngen_R13/harmonized_data/weight_height/

File structure

Data

File
Description

Harmonized_weight_height_bmi_filtered.tsv

Weight, height, and BMI measurements harmonized from FinnGen registers

Data description

Number of persons: 479,097 Number of measurement events (post-QC): 7,504,253 Number of measurement events (pre-QC): 9,962,597 Measurements removed by QC / vast majority duplicates across sources: 2,458,344 (24.7%)

Measurement breakdown

Measurement Type
Number of Events

WEIGHT

3,190,090

HEIGHT

2,083,191

BMI

2,230,972

Source breakdown

Source
Number of Events

HILMO_AVOHILMO_extended

3,666,682

EA3_FLD_physiological_measurement

1,339,220

minimum_info

1,074,080

EA3_PULMO_physiological_measurement

568,155

EA3_PCOS_physiological_measurement

430,861

Spirometry_raw

222,899

Rheumatology_register

112,684

MINIMUM_longitudinal

85,619

KANTA_unmapped

3,910

KANTA_mapped

143

Data fields

Field
Description

FINNGENID

FinnGen individual identifier

MEASUREMENT_TYPE

Type of measurement: WEIGHT, HEIGHT, or BMI

MEASUREMENT_SOURCE

The source register/dataset of the measurement (see Data sources below)

AGE

Age at measurement (years)

VALUE

Measurement value (units: kg for weight, cm for height, kg/m² for BMI)

APPROX_DATE

Approximate date of measurement (YYYY-MM-DD format). May be NA for some sources.

SEX

Sex of the individual (male/female)

Data sources

Data has been integrated from 10 different FinnGen data sources:

MEASUREMENT_SOURCE value
Register source
Description

HILMO_AVOHILMO_extended

Hospital discharge register (HILMO) and primary care register (AvoHILMO)

Extended anthropometric measurements extracted from structured fields in inpatient, outpatient, and primary care records

EA3_FLD_physiological_measurement

FinnGen expansion area 3 - Fatty Liver Disease (EA3 Projectsarrow-up-right)

Longitudinal weight, height, and BMI measurements collected during FLD research visits

minimum_info

FinnGen minimum dataset - baseline measurements

Weight, height, and BMI recorded at biobank recruitment (single timepoint per individual)

EA3_PULMO_physiological_measurement

FinnGen expansion area 3 - Pulmonary diseases (EA3 Projectsarrow-up-right)

Longitudinal measurements from pulmonary disease research visits. Height/weight swaps corrected, extreme outliers removed.

EA3_PCOS_physiological_measurement

FinnGen expansion area 3 - Polycystic Ovary Syndrome (EA3 Projectsarrow-up-right)

Measurements from women's health research visits. Records where height equals weight removed as data entry errors.

Spirometry_raw

Biobank spirometry tests (Spirometry Dataarrow-up-right)

Height and weight recorded during spirometry procedures from multiple biobanks

Rheumatology_register

Finnish rheumatology register (Rheumatology Registerarrow-up-right)

BMI measurements only from rheumatology patient visits

MINIMUM_longitudinal

FinnGen minimum dataset - longitudinal measurements (Minimum Longitudinal Dataarrow-up-right)

Multiple weight and height measurements over time from biobank records. Infant weights converted from grams to kg.

KANTA_unmapped

Kanta laboratory system - manually mapped entries (Kanta Lab Valuesarrow-up-right)

Laboratory measurements not automatically mapped to OMOP codes, manually identified as anthropometric measurements

KANTA_mapped

Kanta laboratory system - OMOP-mapped entries (Kanta Lab Valuesarrow-up-right)

Laboratory measurements automatically mapped to OMOP concept IDs for weight (3025315, 3013762), height (3036277, 3023540, 3019171), and BMI (3038553)

Methods

Data integration

  1. Source loading: All measurement sources were loaded and standardized to common format (FINNGENID, MEASUREMENT_TYPE, AGE, VALUE, VALUE_UNIT, MEASUREMENT_SOURCE, APPROX_DATE)

  2. Unit standardization: Values converted to standard units (weight: kg, height: cm, BMI: kg/m²)

  3. Data cleaning: Source-specific quality issues resolved (e.g., height/weight swaps in PULMO data, infant weight gram-to-kg conversion, identical height/weight values in PCOS removed etc.)

  4. Filtering to R13 participants: Only individuals present in FinnGen R13 data freeze included

Deduplication

When multiple sources provided the same measurement (same individual, measurement type, and age), measurements were deduplicated by prioritizing sources in this order:

  1. HILMO_AVOHILMO_extended

  2. KANTA_mapped

  3. KANTA_unmapped

  4. MINIMUM_longitudinal

  5. minimum_info

  6. Spirometry_raw

  7. Rheumatology_register

  8. EA3_FLD_physiological_measurement

  9. EA3_PCOS_physiological_measurement

  10. EA3_PULMO_physiological_measurement

Age was rounded to 2 decimal places for deduplication purposes.

Quality control

Population-level outlier detection:

  • Measurements were binned by age windows (granular 1-year bins for ages 0-20, broader bins for adults), sex, and measurement type

  • Within each bin, median and Median Absolute Deviation (MAD) were calculated

  • Sided MAD scores were computed separately for values above and below the median

  • Measurements with sided MAD score > 5 were flagged as potential outliers

Outlier rescue:

  • Individuals with ≥4 measurements where ≥90% were flagged as outliers were "rescued" (flags removed)

  • Rationale: these likely represent true biological extremes rather than measurement errors

Final filtering:

  • Measurements flagged as outliers (after rescue) were excluded from the final dataset

Validation

  • Summary statistics and distributions were generated for each source before and after QC

  • UpSet plots created to visualize source overlaps

  • Plots generated showing distributions by age, sex, and source

Data notes

Duplicate measurements

Duplicate measurements from the same source and timepoint have been removed. When multiple sources provide the same measurement, the highest-priority source (see deduplication order above) was retained.

Missing dates

APPROX_DATE is missing (NA) for measurements from: MINIMUM_longitudinal, minimum_info, EA3_FLD, EA3_PULMO, and EA3_DIAB sources.

Unit conversions

  • Heights originally in meters (PULMO EA3) were converted to centimeters

  • Infant weights >500 recorded before age 1 year (MINIMUM_longitudinal) were converted from grams to kilograms

Data quality issues resolved

  1. PULMO EA3: Height and weight values were swapped in some records; these were automatically detected (height < weight) and corrected

  2. PCOS EA3: Measurements where height and weight were identical (data entry error) were removed

  3. PULMO EA3: Impossible values (height <100cm for age ≥10, weight <20kg for age >10, BMI <10 for age >2 or BMI >100) were removed after attempting height/weight swap correction

Age precision

Age is recorded as a continuous variable (decimal years) calculated from birth date and measurement date when available.

Coverage

  • Most comprehensive coverage in adults (HILMO/AvoHILMO captures most healthcare visits)

  • Pediatric measurements primarily from MINIMUM longitudinal and specialized studies

  • Single timepoint measurements available for most individuals from biobank recruitment (minimum_info)

  • Longitudinal measurements available for individuals with multiple healthcare encounters or study visits

Limitations

  • Missing data is not random; measurements are more likely to be recorded for individuals with healthcare encounters

  • Different sources can have different measurement quality and protocols

  • Some sources (e.g., rheumatology register) only provide BMI, not separate weight and height

Last updated

Was this helpful?