Public health issues are not easy to deal with as it requires gathering of large meaningful datasets and cannot be directly used the way it was obtained because of privacy issues. Data collection is planned to be done by providing easy access to common population to provide information about their health using mobile phones. A user enters the details on a mobile application and this data will be stored in a cloud-based backend. To tackle the problem of analyzing heterogeneous datasets, a statistical pipeline is built to harmonize data across various cohorts. The code analyses data and applies statistical data standardization techniques like re-normalization, covariates identification and dimensionality reduction. The use of Geospatial feature further helps in analyzing the data obtained. The correlation between variables is calculated and, since the multicollinearity is high in the dataset, PCA is performed. The multidimensional dataset is then analyzed to cross-correlate the blood metabolite measurements in racially diverse women to find factors that contribute to breast cancer risk disparities. This will help in clinical translation by targeting novel biomarkers and pathways and facilitate developing biosensor-based companion diagnostic tools for early detection and individualized treatment. This project also aims to
Code an R package to implement data harmonization pipeline in a flexible and adaptive manner. Scale, automate, and containerize the package to deploy on cloud. Publish the package on CRAN and make it open-source.
|