Analysis
On top of the core kanta data, another data set is released, meant for analysis. The idea behind this file is to focus more on numerical values and to manipulate/remove entries for downstream analysis, like flagging/removing problematic values. This file misses some columns from the original data but also contains new ones for analysis purposes. In this way we can keep separate the pure data munging/harmonization from the numerical elaboration of the data for analysis purposes
Summary
The pipeline is available in github (https://github.com/FINNGEN/kanta_lab_preprocessing/) where technical information on how the raw data was processed can be found.
A quick summary:
the
MEASUREMENT_FREE_TEXT
column is manipulated to extract shareable informationWhere the original measurement value is missing and the free text is available, we attempt to extract numerical values from it if they match certain patterns. After some string manipulation if we're left with a pure number we cast it from string to float and is merged with the original valuesto the
MEASUREMENT_VALUE_EXTRACTED
A boolean column with information about where the data was extracted
IS_VALUE_EXTRACTED
is addedThe text is scanned for pos/neg substrings and through a manual mapping, values are mapped to 1 (pos) or 0 (neg) in a new
OUTCOME_POS_EXTRACTED
column
QCing takes place to remove extracted values that are formatted as dates
Location
/finngen/library-red/finngen_R13/kanta_analysis_1.0/
Like for the full munged data, there are two files:
finngen_R13_kanta_analysis_1.0.parquet finngen_R13_kanta_analysis_1.0.txt.gz
finngen_R13_kanta_analysis_1.0.parquet finngen_R13_kanta_analysis_1.0.parquet
Extraction Summary
In the following table one can find a summary of the extraction process.
OMOP ID
N of extracted numerical values
Percentage of numerical values extracted
Percentage of extracted values that had NA in raw data measurement
N of extracted POS/NEG values
Percentage of POS/NEG extracted values
Percentage of extracted values that had NA in raw data outcome
Concept Name
3026361
2095662
22.6799
100.0000
2
0.0000
100.0000
Erythrocytes [#/volume] in Blood
3018095
118284
22.3749
100.0000
67950
12.8536
6.2384
Leukocytes [#/volume] in Urine
Last updated
Was this helpful?