# FAQ

## Q: Which years are covered in the data?

The data covers the years 2014 to 2025. However, not all lab data providers sent their data right from 2014. Most of the data comes online starting in 2018.

<figure><img src="/files/OhI11JWS2JIutT80fh6T" alt=""><figcaption></figcaption></figure>

## Q: What part of the FinnGen cohort also has data in the Kanta Lab dataset?

The Kanta Lab data was requested for all FinnGen participants. However, the Kanta Lab data spans from 2014 to 2025 so only FinnGen participants alive during this period and having lab records in Kanta are in the dataset.

In total, there are 483k FinnGen participants with data in the Kanta Lab dataset and an average of 533 tests/individual.

## Q: How do we know the reference range for each test?

Reference range is a free text string that depends on:

* age of the individual
* sex of the individual
* chemistry/detection method used by the particular lab
* units of the test
* date of the test (ranges may change over time or due to updated public health guidelines for tests such as cholesterol)

Unfortunately, only 1/3 of the events include a reference range specific to these combined values. Because so many tests lack this information we have created an imputed range for each OMOP ID. We have computed these based on events where the `MEASUREMENT_VALUE`, `MEASUREMENT_UNIT` and `TEST_OUTCOME` are present for the event. Please see below for more about this.

## Q: How reliable are the time measurements from `APPROX_EVENT_DATETIME` in the data?

The most common times for a test are 7:00 a.m. and 7:01 a.m., this is likely the time at which tests are ordered, but may be carried out throughout the morning. Other test times should be a more reliable indicator of when the test was actually taken.

While the date component of the `APPROX_EVENT_DATETIME` column [is randomized for privacy blurring](/finngen-data-specifics/finnish-health-registers-and-medical-coding/data-masking-blurring-of-visit-dates.md) (stable number of days per FINNGENID), the hour:minutes component comes from the raw data and is not randomized.

## Q: Which measurement value column should I use for my analysis?

The data contains two families of values:

* harmonized data
* extracted data

Harmonized data is shown in the column `MEASUREMENT_VALUE_HARMONIZED` and is the final product of the harmonization step where source values and source units are taken into considerations in order to harmonize the values to a target unit. `MEASUREMENT_VALUE_EXTRACTED` values instead come from a different column where value are reported in a free text format (most often without unit, which anyways is not considered). For these values the numerical part of the text is extracted and kept as is. The two columns are then merged into `MEASUREMENT_VALUE_MERGED` into a single numerical column that can be, in principle, used for analysis. Here is a breakdown of the relative composition of the column based on year and harmonized/extracted origin. Extracted values were more prevalent at the beginning of the data history and now represent \~15% of the yearly values.

<figure><img src="/files/WHsrtUEMMXriEV9FR50v" alt=""><figcaption></figcaption></figure>

We have extensively analyized statistically the extracted value columns and we have found that the distribution of values is generally compatible with the target unit distribution. Sometimes even it turned out that there was a bias in how values were reported so that for certain familes of IDs (`Base excess`) where negative values are possible, the harmonized column would be populated with (almost) exclusively positive values and the extracted column with (almost) exclusively negative values. In most other scenarios the difference in distributions were due to sampling bias as the extracted values where a tiny fraction of the harmonized ones and thus could not show the same statistical features to a high level of conformity. For other types of mismatches (e.g. mix of units) we resort to other methods (see [QC section in the Data description](/finngen-data-specifics/red-library-data-individual-level-data/what-phenotype-files-are-available-in-sandbox-1/kanta-lab-values/data.md#qc)) to further harmonize the data in order to have it as standardized as possible. Therefore, ultimately, we recommend using the `MEASUREMENT_VALUE_MERGED` column, but with a caveat to check the `QC_PASS` and `QC_NOTES` columns to see if the ID in question required heavy work to make it usable in the first place and in such cases it might be beneficial to choose in a case by case scenario whether or not to include the extracted values in the first place.

## Q: Can I differentiate between the time a test is ordered vs. the time is test is taken?

No. The original data is coming from different sources (different lab centers, different IT systems, etc.) and has undergone several data processing stages before reaching FinnGen. So, unfortunately, when the raw data has reached FinnGen we do not know to what the time information relates to, and is most likely inconsitent depending on the original data source.

## Q: What is the meaning of the different `TEST_OUTCOME` and `TEST_OUTCOME_IMPUTED` values?

`TEST_OUTCOME` values may be recorded as:

(**N**) ormal

(**L**) & (**LL**) for low or very low

(**H**) & (**HH**) for high and very high

(**A**) & (**AA**) for abnormal and very abnormal. Note that if a test has a normal range between 10–20 that (**A**)bnormal can be marked for results <10 as well as >20.

(**NA**) in our QC of the data when the outcome is missing and listed as NA, most of the measured values are in the normal range; however, there is no guarantee of that.

`TEST_OUTCOME_IMPUTED`

Unfortunately, the raw data did not include defined reference ranges for all the lab tests. To approximate the ranges we have looked at tests that have the trio of measurement value, measurement units and outcomes and approximated a low and high for each type of lab tests. We have then applied these "imputed" high and low values to score tests which did not have an outcome. This means that potentially there can be a mismatch between the two columns as the original column can possibly reflect the patient's conditions where the normality depends on the history of the individual, while the imputed one is a more general and statistical definition.

Positive and negative are not concepts that exist in the raw data. In most cases you would want to look for N (Normal) vs. A (abnormal), AA (very abnormal) or H (High), HH (Very high) as analogous to negative vs. positive.\
\
Thresholds used to define H/L for the imputed column can be found [here](https://github.com/FINNGEN/kanta_lab_preprocessing/blob/master/core/data/abnormality_estimation.table.tsv).

## Q: What is "vierimittaus"?

Vierimittaus refers to point-of-care testing (POCT), also called near-patient testing or bedside testing. It refers to sampling, whether the sample was taken at laboratory, bedside or by patient themself.

## Q: When will the Kanta Lab dataset be updated?

The data is, by nature, updated yearly when new data is received. If we collect enough corrections/bugs/upgrades we might have a 2.0 release in the summer.

## Q: Is it possible to know which test center processed the sample and made the test measurement?

No. Unfortunately we do not know which test center performed the test in the data that FinnGen received, so this information is also not currently available for FinnGen users. The column `CODING_SYSTEM_ORG` is somewhat related (we think it has to do with the uploader of the data), but it's mostly empty and seems to be relevant only for regional data (i.e. `TEST_ID_IS_NATIONAL=0` ). In practice, if a lot of "strange" values share the same `CODING_SYSTEM_OID|ORG` it might be an indicator that it's a batch issue and not a single value problem.

## Q: Why are some rows duplicates by `FINNGENID`, `APPROX_EVENT_DATETIME`, `OMOP_CONCEPT_ID` and `MEASUREMENT_VALUE`?

There is a tiny fraction of data (<.1%) where the same result, technically speaking, in uploaded in two different formats. One via the "regular" source data that gets harmonized and the other through measurement free text, leading to ultimately duplicate entries in the merged column. This is part of a more general problem therefore when some entries might coincide but others (e.g. `TEST_OUTCOME` may not) and we haven't yet come up with a final solution in how to deal with these entries that would work with all use cases. Specifically for numerical entries, it's relatively easy to perform a check on FINNGENID,datetime and merged value is performed to see how much one's cohort is impacted.

## Q: Some of the units are unfamiliar to me, how do I know what they are?

One common one you will see is e9 - in blood cell counts, "e9/L" means "10^9 per liter" or "billion per liter". For example, a white blood cell count of 5.2 e9/L means 5.2 billion cells per liter. This unit is part of the International System of Units (SI) and is widely used in many countries for standardizing laboratory results. It's equivalent to the older unit "G/L" (giga per liter) or "10^9/L". Similarly, e12/L would mean "10^12 per liter" or "trillion per liter".

The full list of units can be found here - <https://github.com/FINNGEN/kanta_lab_harmonisation_public/blob/main/MAPPING_TABLES/UNITSfi.usagi.csv>

## Q: I have found some negative measurement values, are they a mistake?

Actually, there are a number of lab tests that are expected to have negative values! Checking for negative values is something that our OHDSI Achilles QC testing suite examines and any negative values you see in the data have been checked and should be realistic. Here are some tests that are expected to have negative values:

* Calcium Balance: calcium balance studies can show negative values when there's more calcium excretion than intake.
* Acid-Base Balance: Tests like Base Excess (BE) can be negative, indicating metabolic acidosis.
* Iron Balance Studies: Like calcium, iron balance can be negative if more iron is lost than absorbed.
* Anion Gap: While usually positive, it can be negative in rare cases like bromide intoxication.

## Q: How was the Kanta data preprocessed before it got to FinnGen users?

The data is originally extracted from Kanta and sent THL for pseudonymesation. Then, THL send the pseudonymised data to FinnGen. At this stage, FinnGen does a preprocessing step that includes: row deduplication, column subsetting, data cleaning, harmonization with OMOP, and more. Extensive documentation about the preprocessing of the Kanta lab data by FinnGen can be found here: <https://github.com/FINNGEN/kanta_lab_preprocessing/tree/master>.

<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.finngen.fi/finngen-data-specifics/red-library-data-individual-level-data/what-phenotype-files-are-available-in-sandbox-1/kanta-lab-values/faq.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
