Height/weight/BMI Data
Weight, height, and BMI measurements harmonized from FinnGen registers (March 2026).
The file contains anthropometric measurements (weight, height, BMI) from FinnGen registers in longitudinal format, with one row per measurement event per person. Multiple data sources have been integrated, deduplicated by prioritizing higher-quality sources, and quality controlled using robust outlier detection methods.
Location: /finngen/library-red/finngen_R13/harmonized_data/weight_height/
File structure
Data
Harmonized_weight_height_bmi_filtered.tsv
Weight, height, and BMI measurements harmonized from FinnGen registers
Data description
Number of persons: 479,097 Number of measurement events (post-QC): 7,504,253 Number of measurement events (pre-QC): 9,962,597 Measurements removed by QC / vast majority duplicates across sources: 2,458,344 (24.7%)
Measurement breakdown
WEIGHT
3,190,090
HEIGHT
2,083,191
BMI
2,230,972
Source breakdown
HILMO_AVOHILMO_extended
3,666,682
EA3_FLD_physiological_measurement
1,339,220
minimum_info
1,074,080
EA3_PULMO_physiological_measurement
568,155
EA3_PCOS_physiological_measurement
430,861
Spirometry_raw
222,899
Rheumatology_register
112,684
MINIMUM_longitudinal
85,619
KANTA_unmapped
3,910
KANTA_mapped
143
Data fields
FINNGENID
FinnGen individual identifier
MEASUREMENT_TYPE
Type of measurement: WEIGHT, HEIGHT, or BMI
MEASUREMENT_SOURCE
The source register/dataset of the measurement (see Data sources below)
AGE
Age at measurement (years)
VALUE
Measurement value (units: kg for weight, cm for height, kg/m² for BMI)
APPROX_DATE
Approximate date of measurement (YYYY-MM-DD format). May be NA for some sources.
SEX
Sex of the individual (male/female)
Data sources
Data has been integrated from 10 different FinnGen data sources:
HILMO_AVOHILMO_extended
Hospital discharge register (HILMO) and primary care register (AvoHILMO)
Extended anthropometric measurements extracted from structured fields in inpatient, outpatient, and primary care records
EA3_FLD_physiological_measurement
FinnGen expansion area 3 - Fatty Liver Disease (EA3 Projects)
Longitudinal weight, height, and BMI measurements collected during FLD research visits
minimum_info
FinnGen minimum dataset - baseline measurements
Weight, height, and BMI recorded at biobank recruitment (single timepoint per individual)
EA3_PULMO_physiological_measurement
FinnGen expansion area 3 - Pulmonary diseases (EA3 Projects)
Longitudinal measurements from pulmonary disease research visits. Height/weight swaps corrected, extreme outliers removed.
EA3_PCOS_physiological_measurement
FinnGen expansion area 3 - Polycystic Ovary Syndrome (EA3 Projects)
Measurements from women's health research visits. Records where height equals weight removed as data entry errors.
Spirometry_raw
Biobank spirometry tests (Spirometry Data)
Height and weight recorded during spirometry procedures from multiple biobanks
Rheumatology_register
Finnish rheumatology register (Rheumatology Register)
BMI measurements only from rheumatology patient visits
MINIMUM_longitudinal
FinnGen minimum dataset - longitudinal measurements (Minimum Longitudinal Data)
Multiple weight and height measurements over time from biobank records. Infant weights converted from grams to kg.
KANTA_unmapped
Kanta laboratory system - manually mapped entries (Kanta Lab Values)
Laboratory measurements not automatically mapped to OMOP codes, manually identified as anthropometric measurements
KANTA_mapped
Kanta laboratory system - OMOP-mapped entries (Kanta Lab Values)
Laboratory measurements automatically mapped to OMOP concept IDs for weight (3025315, 3013762), height (3036277, 3023540, 3019171), and BMI (3038553)
Methods
Data integration
Source loading: All measurement sources were loaded and standardized to common format (FINNGENID, MEASUREMENT_TYPE, AGE, VALUE, VALUE_UNIT, MEASUREMENT_SOURCE, APPROX_DATE)
Unit standardization: Values converted to standard units (weight: kg, height: cm, BMI: kg/m²)
Data cleaning: Source-specific quality issues resolved (e.g., height/weight swaps in PULMO data, infant weight gram-to-kg conversion, identical height/weight values in PCOS removed etc.)
Filtering to R13 participants: Only individuals present in FinnGen R13 data freeze included
Deduplication
When multiple sources provided the same measurement (same individual, measurement type, and age), measurements were deduplicated by prioritizing sources in this order:
HILMO_AVOHILMO_extended
KANTA_mapped
KANTA_unmapped
MINIMUM_longitudinal
minimum_info
Spirometry_raw
Rheumatology_register
EA3_FLD_physiological_measurement
EA3_PCOS_physiological_measurement
EA3_PULMO_physiological_measurement
Age was rounded to 2 decimal places for deduplication purposes.
Quality control
Population-level outlier detection:
Measurements were binned by age windows (granular 1-year bins for ages 0-20, broader bins for adults), sex, and measurement type
Within each bin, median and Median Absolute Deviation (MAD) were calculated
Sided MAD scores were computed separately for values above and below the median
Measurements with sided MAD score > 5 were flagged as potential outliers
Outlier rescue:
Individuals with ≥4 measurements where ≥90% were flagged as outliers were "rescued" (flags removed)
Rationale: these likely represent true biological extremes rather than measurement errors
Final filtering:
Measurements flagged as outliers (after rescue) were excluded from the final dataset
Validation
Summary statistics and distributions were generated for each source before and after QC
UpSet plots created to visualize source overlaps
Plots generated showing distributions by age, sex, and source
Data notes
Duplicate measurements
Duplicate measurements from the same source and timepoint have been removed. When multiple sources provide the same measurement, the highest-priority source (see deduplication order above) was retained.
Missing dates
APPROX_DATE is missing (NA) for measurements from: MINIMUM_longitudinal, minimum_info, EA3_FLD, EA3_PULMO, and EA3_DIAB sources.
Unit conversions
Heights originally in meters (PULMO EA3) were converted to centimeters
Infant weights >500 recorded before age 1 year (MINIMUM_longitudinal) were converted from grams to kilograms
Data quality issues resolved
PULMO EA3: Height and weight values were swapped in some records; these were automatically detected (height < weight) and corrected
PCOS EA3: Measurements where height and weight were identical (data entry error) were removed
PULMO EA3: Impossible values (height <100cm for age ≥10, weight <20kg for age >10, BMI <10 for age >2 or BMI >100) were removed after attempting height/weight swap correction
Age precision
Age is recorded as a continuous variable (decimal years) calculated from birth date and measurement date when available.
Coverage
Most comprehensive coverage in adults (HILMO/AvoHILMO captures most healthcare visits)
Pediatric measurements primarily from MINIMUM longitudinal and specialized studies
Single timepoint measurements available for most individuals from biobank recruitment (minimum_info)
Longitudinal measurements available for individuals with multiple healthcare encounters or study visits
Limitations
Missing data is not random; measurements are more likely to be recorded for individuals with healthcare encounters
Different sources can have different measurement quality and protocols
Some sources (e.g., rheumatology register) only provide BMI, not separate weight and height
Last updated
Was this helpful?