# Height/weight/BMI Data

The file contains anthropometric measurements (weight, height, BMI) from FinnGen registers in longitudinal format, with one row per measurement event per person. Multiple data sources have been integrated, deduplicated by prioritizing higher-quality sources, and quality controlled using robust outlier detection methods.

Location: /finngen/library-red/finngen\_R13/harmonized\_data/weight\_height/

### File structure

#### Data

| File                                          | Description                                                            |
| --------------------------------------------- | ---------------------------------------------------------------------- |
| Harmonized\_weight\_height\_bmi\_filtered.tsv | Weight, height, and BMI measurements harmonized from FinnGen registers |

### Data description

**Number of persons:** 479,097\
**Number of measurement events (post-QC):** 7,504,253\
**Number of measurement events (pre-QC):** 9,962,597\
**Measurements removed by QC / vast majority duplicates across sources:** 2,458,344 (24.7%)

#### Measurement breakdown

| Measurement Type | Number of Events |
| ---------------- | ---------------- |
| WEIGHT           | 3,190,090        |
| HEIGHT           | 2,083,191        |
| BMI              | 2,230,972        |

#### Source breakdown

| Source                                 | Number of Events |
| -------------------------------------- | ---------------- |
| HILMO\_AVOHILMO\_extended              | 3,666,682        |
| EA3\_FLD\_physiological\_measurement   | 1,339,220        |
| minimum\_info                          | 1,074,080        |
| EA3\_PULMO\_physiological\_measurement | 568,155          |
| EA3\_PCOS\_physiological\_measurement  | 430,861          |
| Spirometry\_raw                        | 222,899          |
| Rheumatology\_register                 | 112,684          |
| MINIMUM\_longitudinal                  | 85,619           |
| KANTA\_unmapped                        | 3,910            |
| KANTA\_mapped                          | 143              |

### Data fields

| Field               | Description                                                                      |
| ------------------- | -------------------------------------------------------------------------------- |
| FINNGENID           | FinnGen individual identifier                                                    |
| MEASUREMENT\_TYPE   | Type of measurement: WEIGHT, HEIGHT, or BMI                                      |
| MEASUREMENT\_SOURCE | The source register/dataset of the measurement (see Data sources below)          |
| AGE                 | Age at measurement (years)                                                       |
| VALUE               | Measurement value (units: kg for weight, cm for height, kg/m² for BMI)           |
| APPROX\_DATE        | Approximate date of measurement (YYYY-MM-DD format). May be NA for some sources. |
| SEX                 | Sex of the individual (male/female)                                              |

### Data sources

Data has been integrated from 10 different FinnGen data sources:

| MEASUREMENT\_SOURCE value              | Register source                                                                                                                                                                                                                                    | Description                                                                                                                                           |
| -------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| HILMO\_AVOHILMO\_extended              | Hospital discharge register (HILMO) and primary care register (AvoHILMO)                                                                                                                                                                           | Extended anthropometric measurements extracted from structured fields in inpatient, outpatient, and primary care records                              |
| EA3\_FLD\_physiological\_measurement   | FinnGen expansion area 3 - Fatty Liver Disease ([EA3 Projects](https://docs.finngen.fi/finngen-data-specifics/expansion-area-3-ea3-projects))                                                                                                      | Longitudinal weight, height, and BMI measurements collected during FLD research visits                                                                |
| minimum\_info                          | FinnGen minimum dataset - baseline measurements                                                                                                                                                                                                    | Weight, height, and BMI recorded at biobank recruitment (single timepoint per individual)                                                             |
| EA3\_PULMO\_physiological\_measurement | FinnGen expansion area 3 - Pulmonary diseases ([EA3 Projects](https://docs.finngen.fi/finngen-data-specifics/expansion-area-3-ea3-projects))                                                                                                       | Longitudinal measurements from pulmonary disease research visits. Height/weight swaps corrected, extreme outliers removed.                            |
| EA3\_PCOS\_physiological\_measurement  | FinnGen expansion area 3 - Polycystic Ovary Syndrome ([EA3 Projects](https://docs.finngen.fi/finngen-data-specifics/expansion-area-3-ea3-projects))                                                                                                | Measurements from women's health research visits. Records where height equals weight removed as data entry errors.                                    |
| Spirometry\_raw                        | Biobank spirometry tests ([Spirometry Data](https://docs.finngen.fi/finngen-data-specifics/disease-specific-task-force-data/spirometry-data-pulmonary-task-force))                                                                                 | Height and weight recorded during spirometry procedures from multiple biobanks                                                                        |
| Rheumatology\_register                 | Finnish rheumatology register ([Rheumatology Register](https://docs.finngen.fi/finngen-data-specifics/disease-specific-task-force-data/finnish-rheumatology-quality-register))                                                                     | BMI measurements only from rheumatology patient visits                                                                                                |
| MINIMUM\_longitudinal                  | FinnGen minimum dataset - longitudinal measurements ([Minimum Longitudinal Data](https://docs.finngen.fi/finngen-data-specifics/red-library-data-individual-level-data/what-phenotype-files-are-available-in-sandbox-1/minimum-longitudinal-data)) | Multiple weight and height measurements over time from biobank records. Infant weights converted from grams to kg.                                    |
| KANTA\_unmapped                        | Kanta laboratory system - manually mapped entries ([Kanta Lab Values](https://docs.finngen.fi/finngen-data-specifics/red-library-data-individual-level-data/what-phenotype-files-are-available-in-sandbox-1/kanta-lab-values))                     | Laboratory measurements not automatically mapped to OMOP codes, manually identified as anthropometric measurements                                    |
| KANTA\_mapped                          | Kanta laboratory system - OMOP-mapped entries ([Kanta Lab Values](https://docs.finngen.fi/finngen-data-specifics/red-library-data-individual-level-data/what-phenotype-files-are-available-in-sandbox-1/kanta-lab-values))                         | Laboratory measurements automatically mapped to OMOP concept IDs for weight (3025315, 3013762), height (3036277, 3023540, 3019171), and BMI (3038553) |

### Methods

#### Data integration

1. **Source loading:** All measurement sources were loaded and standardized to common format (FINNGENID, MEASUREMENT\_TYPE, AGE, VALUE, VALUE\_UNIT, MEASUREMENT\_SOURCE, APPROX\_DATE)
2. **Unit standardization:** Values converted to standard units (weight: kg, height: cm, BMI: kg/m²)
3. **Data cleaning:** Source-specific quality issues resolved (e.g., height/weight swaps in PULMO data, infant weight gram-to-kg conversion, identical height/weight values in PCOS removed etc.)
4. **Filtering to R13 participants:** Only individuals present in FinnGen R13 data freeze included

#### Deduplication

When multiple sources provided the same measurement (same individual, measurement type, and age), measurements were deduplicated by prioritizing sources in this order:

1. HILMO\_AVOHILMO\_extended
2. KANTA\_mapped
3. KANTA\_unmapped
4. MINIMUM\_longitudinal
5. minimum\_info
6. Spirometry\_raw
7. Rheumatology\_register
8. EA3\_FLD\_physiological\_measurement
9. EA3\_PCOS\_physiological\_measurement
10. EA3\_PULMO\_physiological\_measurement

Age was rounded to 2 decimal places for deduplication purposes.

#### Quality control

**Population-level outlier detection:**

* Measurements were binned by age windows (granular 1-year bins for ages 0-20, broader bins for adults), sex, and measurement type
* Within each bin, median and Median Absolute Deviation (MAD) were calculated
* **Sided MAD scores** were computed separately for values above and below the median
* Measurements with sided MAD score > 5 were flagged as potential outliers

**Outlier rescue:**

* Individuals with ≥4 measurements where ≥90% were flagged as outliers were "rescued" (flags removed)
* Rationale: these likely represent true biological extremes rather than measurement errors

**Final filtering:**

* Measurements flagged as outliers (after rescue) were excluded from the final dataset

#### Validation

* Summary statistics and distributions were generated for each source before and after QC
* UpSet plots created to visualize source overlaps
* Plots generated showing distributions by age, sex, and source

### Data notes

#### Duplicate measurements

Duplicate measurements from the same source and timepoint have been removed. When multiple sources provide the same measurement, the highest-priority source (see deduplication order above) was retained.

#### Missing dates

APPROX\_DATE is missing (NA) for measurements from: MINIMUM\_longitudinal, minimum\_info, EA3\_FLD, EA3\_PULMO, and EA3\_DIAB sources.

#### Unit conversions

* Heights originally in meters (PULMO EA3) were converted to centimeters
* Infant weights >500 recorded before age 1 year (MINIMUM\_longitudinal) were converted from grams to kilograms

#### Data quality issues resolved

1. **PULMO EA3:** Height and weight values were swapped in some records; these were automatically detected (height < weight) and corrected
2. **PCOS EA3:** Measurements where height and weight were identical (data entry error) were removed
3. **PULMO EA3:** Impossible values (height <100cm for age ≥10, weight <20kg for age >10, BMI <10 for age >2 or BMI >100) were removed after attempting height/weight swap correction

#### Age precision

Age is recorded as a continuous variable (decimal years) calculated from birth date and measurement date when available.

#### Coverage

* Most comprehensive coverage in adults (HILMO/AvoHILMO captures most healthcare visits)
* Pediatric measurements primarily from MINIMUM longitudinal and specialized studies
* Single timepoint measurements available for most individuals from biobank recruitment (minimum\_info)
* Longitudinal measurements available for individuals with multiple healthcare encounters or study visits

#### Limitations

* Missing data is not random; measurements are more likely to be recorded for individuals with healthcare encounters
* Different sources can have different measurement quality and protocols
* Some sources (e.g., rheumatology register) only provide BMI, not separate weight and height
