Integrated transcriptomics explored the cancer-promoting genes CDKN3 in esophageal squamous cell cancer

Background and objectives Each individual studies is limited to multi-factors and potentially lead to a significant difference of results among them. The present study aim to explore the critical genes related to the development of Esophageal squamous cell carcinoma (ESCC) by integrated transcriptomics and to investigate the clinical significance by experimental validation. Methods Datasets of protein-coding genes expression which involved in ESCC were downloaded from Gene Expression Omnibus (GEO) database. The “Robustrankaggreg” package in language was used for data integration, and the different expression genes (DEGs) were identified based the cut-off criteria as follows: adjust p-value < 0.05, |fold change (FC)| ≥ 1.5; The protein expression of seed gene in 184 cases of primary ESCC tissues and 50 tumor adjacent normal tissues (at least 5 cm away from the tumor, and defind as the controls) were detected by immunohistochemistry; The relationship between the expression level of seed genes and clinical parameter were analyze. Enumeration data were represented by frequency or percentage (%) and were tested by x2 test. The P value of less than 0.05 was considered statistically significant. Results A total of 244 DEGs were identified by comparing gene expression patterns between ESCC patients and the controls based on integrating dataset of GSE77861, GSE77861, GSE100942, GSE26886, GSE17351, GSE38129, GSE33426, GSE20347 and GSE23400; The Cyclin-dependent kinase inhibitor 3 (CDKN3) were identified the top 1 seed gene of top cluster by use of protein-protein Interaction network and plug-in Molecular Complex Detection; The level of CDKN3 mRNA was significantly increased in ESCC patients compared to controls; The positive expression rate of CDKN3 protein in ESCC tissue samples was 32 and 61.4% in control, respectively. The correlations between the expression level of CDKN3 and lymph node metastasis or clinical staging of ESCC patients are statistically significant. Conclusion Integrated transcriptomics is an efficient approach to system biology. By this procedure, our study improved the understanding of the transcriptome status of ESCC.


Introduction
Esophageal squamous cell carcinoma (ESCC) is a dominant malignant tumor, which accounts for mostly 90% of esophageal carcinoma [1]. Previous studies indicated that a synergistic contribution of pathological stages and genetic backgrounds on the progress of ESCC but the concrete molecular mechanism is elusive [2,3]. Currently, a number of sample data of cancer genomics are accessible on professional network and provides a huge of benefits for further bio-analysis of those cancers [4]. Each individual study, however, is limited to multi-factors such as sample sizes, batch effects, experimental conditions or so on, and potentially lead to a significant result difference among them. This problem implied that an effective in silico method to integrate those individual study could provide a more profound and valuable conclusion to screen the crucial genes of ESCC [5].
For this reason, In this study, robust rank aggregation (RRA) method was performed to integrate ESCC data from different public platforms to obtain different expression genes (DEGs) that were used to construct protein-protein interaction (PPI) and screen the hub genes. RRA method uses a probabilistic model for aggregation that is robust to noise and also facilitates the calculation of significance probabilities for all the elements in the final ranking. Then immunohistochemistry analysis were performed to further verify hub genes. The objective of this study to further explore new biomarkers of ESCC.

Data source
Gene expression profiles were obtained by a systematic retrieval on the GEO (http://www.ncbi.nlm.nih.gov/geo/) database with keywords. A total of 9 series (GSEs) with more than 3 cases of ESCC samples and matched normal controls, respectively, were downloaded for further study and their general information of each data sets were shown in Table 1.

Data preprocessing and integration of differentially expressed genes
The raw data of GEO Series (GSE) were preprocessed using R package "Affy", including background corrections, normalization, missing data imputation and calculation of gene expression. The R package "limma" [6] was utilized to screen and compare the preprocessed data of ESCC samples with matched controls samples using Bayes test. Corrected P value and absolute values of Fold Chang (|Log 2 FC|) from each data sets were obtained and formed matrix of 9 differential expression matrix. Besides, the R package "Robustrankaggreg" [5,7] was utilized to integrate the matrix based RRA method. Genes with |Fold Change| > 1.5 and P < 0.05 were considered to be DEGs.

Protein-protein interaction (PPI) network construction and module mining
DEGs were further analyzed by STRING (https://stringdb.org/) to predicts PPI network and a confidence score of 0.4 was set as the threshold value. Then the PPI network was visualized using Cytoscape (V3.5.1). And Molecular Complex Detection (MCODE) plug-in were performed the module analysis, which can finds gene modules (highly interconnected regions) in a network. Modules mean in a PPI network are often protein complexes and parts of pathways. Parameters setting: a degree cut-off > 5, k-core> 5 and the rest are default settings.

The verification of mRNA level of hub genes
The mRNA level of hub genes was tested via ESCC data from TCGA. Briefly, expression gene data of ESCC samples and collaterally clinic information were downloaded (http://xena.ucsc.edu/welcome-to-ucsc-xena/). The data set was based on IlluminaHiSeq_RNASeqV2 highthroughput RNA sequencing platform, and the expression values were all relative values normalized by computer programming language. The hub genes transcriptase sequencing data of 81 ESCC patients with Tumor adjacent normal tissues (at least 5 cm away from the tumor) were defined as the controls.

Immunohistochemistry staining
Paraffifin-embedded sections (4 μm) of ESCC and matched normal tissues, saved in our pathology department, were used for CDKN3 immunostaining (Abcam Group, Inc.;). After dewaxing, washing and incubating with the primary antibody (1:200) and secondary antibody in turn, the slides were coloured with DAB and then counterstained with hematoxylin and dehydrated and mounted. Two experienced pathologists were independently evaluated the immunostaining slides by recording the staining intensity of tumor cells and the rate of percentage of positive cells. Concrete criteria were previous article [8].

Statistical analysis
The SPSS 22.0 was used for statistical analysis and the Graphpad Prime 5 was used for drawing statistical pictures. Normal distribution data were indicated as the standard deviation of sample means and their groups were compared using t test. Skewness distribution data were indicated as inter quartile range and their groups were compared using Mann-Whitney test. Enumeration data were represented by frequency or percentage (%) and were tested by x 2 test. The P value of less than 0.05 was considered statistically significant.

DEGs screening
A total of 244 DEGs from 9 series of gene expression profiles were found after performing integrated analysis, of which 93 were upregulated and 151 were downregulated P < 0.05 and |Fold Change| > 1.5. The top 10 upregulated and downregulated DEGs are shown in Fig. 1.

PPI network construction and module mining
To explore the biological functions of DEGs, a PPI network included 194 nodes and 864 edges was established via STRING ( Fig. 2A). Then, modules with core significance were obtained via modules mining and analysis using MCODE app from cytoscape software. Results show that the module with the highest score (23.304) contain 24 nodes and 268 edges (Fig. 2B). Among which, the cyclin dependent kinase inhibitor 3 (CDKN3) was identified the seed gene with the highest degree compared to other genes, and was selected to further study.

Immunohistochemical analysis for CDKN3 protein
Immunohistochemical analysis was used to detect CDKN3 expression in 184 ESCC tissue and 50 matched normal tissues. We found that the rate of positive expression of CDKN3 protein in ESCC tissues (61.4%, 113/ 184) were higher than that in matched normal tissues (32%, 16/50) with statistically significance (x 2 = 13.75, p < 0.001) (Fig. 4A-D).

Correlation between between CDKN3 and ESCC patients
Correlation between the protein expression of CDKN3 and clinicopathological features of ESCC patients are shown in Table 2

Discussion
As the outputs of individual experiments can be rather noisy, it is essential to look for findings that are supported by several pieces of evidence to increase the signal and lessen the fraction of false positive findings. Current dominant in silico methods of integrated transcriptomics include: 1) to analysis each expression profile and make an intersection between each DEGs. 2) to remove batch effects via 'combat' function of sva package. The former method is supposed to be limited in batch effects according to our previous experience in other study [9]. However, the latter method cannot be conducted in cross-platform analysis due to its deep reliance on similar experiment backgrounds [10]. Data integration plays an important role in the analysis of high throughput data.
In this study, we performed RRA to integrate transcriptomics because this method is not only avoid the interference of cross-platform, but also enlarge the simple size. Our results indicated that there were 244 DEGs were screened via this method. Besides, many genes among DEGs such as MMP1 [11], MAGEA6 [12] and MAL [13] were closely associated with the progress of ESCC, which also implied the reliability of RRA.
The pathological mechanism of ESCC is complicated and involved a number of pathways and genes, which cause a deep restriction on traditional biological study. In this study, the PPI were constructed by DEGs to explore the crucial module of gene-gene interaction. The modules with the highest importance consist of 24 gene, of which, some genes such as FOXM1 [14] or DTL [15] were considered as crucial genes in ESCC. The Cyclin-dependent protein kinase (CDK), a central gene in module, encodes a cell cycle regulatory protein which is associated with multitumors [16]. Our results indicated that compared with control group, the mRNA level of CDKN3 is significantly higher. Besides, our immunohistochemical study indicated that there is an abnormal expression of CDKN3 protein in ESCC patients, which confirmed its association with the progress of ESCC. Meanwhile, recent studies suggested that CDKN3 was upregulated in ESCC cell lines. Functional assays revealed that CDKN3 knockdown with small interfering RNA decreased the ability of ESCC cells to proliferate, invade and migrate and suppressed G1/S transition. Further mechanistic analyses demonstrated that CDKN3 promoted cell proliferation and invasion by activating the AKT signaling pathway in ESCC cells [17,18].

Conclusions
In conclusion, our method is to explore the pathogenesis of ESCC and its candidate bio-markers of diagnose and prognosis at the molecule level. This study is also of instructive value for other cancer studies.