A bet­ter way to ex­plore TC­GA da­ta

If you work in can­cer bio­mark­er and tar­get re­search, chances are you use da­ta from The Can­cer Genome At­las (TC­GA) to help you make dis­cov­er­ies. This com­pre­hen­sive and co­or­di­nat­ed ef­fort helps ac­cel­er­ate our un­der­stand­ing of the mol­e­c­u­lar caus­es of can­cer through ge­nom­ic analy­ses, in­clud­ing large-scale genome se­quenc­ing. TC­GA cov­ers 33 types of can­cer with mul­ti-omics da­ta, such as RNA-seq, DNA-seq, copy num­ber, mi­croR­NA-seq, and oth­ers. De­tailed analy­ses of in­di­vid­ual TC­GA datasets, as well as pan-can­cer meta-analy­sis, have re­vealed new can­cer sub­types with im­por­tant ther­a­peu­tic im­pli­ca­tions. A key val­ue here is the TC­GA meta­da­ta.  TC­GA sam­ples in­clude ex­ten­sive clin­i­cal meta­da­ta for di­verse can­cers. How­ev­er, in­con­sis­tent ter­mi­nol­o­gy and for­mat­ting lim­it the util­i­ty of these da­ta for pan-can­cer analy­ses.

TC­GA da­ta with­in QI­A­GEN Omic­Soft On­coLand is rig­or­ous­ly cu­rat­ed by ex­perts who ap­ply ex­ten­sive on­tolo­gies and for­mat­ting rules to max­i­mize con­sis­ten­cy. This al­lows re­searchers to more eas­i­ly find and un­der­stand pa­tient char­ac­ter­is­tics, dis­cov­er re­lat­ed co­vari­ates and ex­plore pat­terns of clin­i­cal pa­ra­me­ters across can­cers in the con­text of mul­ti-omics da­ta. QI­A­GEN Omic­Soft es­tab­lished strict stan­dards through our cu­ra­tion of over 600,000 dis­ease-rel­e­vant ‘omics sam­ples. QI­A­GEN Omic­Soft Lands pro­vide ac­cess to uni­form­ly processed datasets, in-depth meta­da­ta cu­ra­tion, and da­ta ex­plo­ration tools that en­able quick in­sights from thou­sands of deeply-cu­rat­ed ‘omics stud­ies across ther­a­peu­tic ar­eas. QI­A­GEN Omic­Soft Lands cen­tral­ize da­ta from Gene Ex­pres­sion Om­nibus (GEO), NCBI Se­quence Read Archive (SRA), Ar­ray­Ex­press, TC­GA, Can­cer Cell Line En­cy­clo­pe­dia (CCLE), Geno­type-Tis­sue Ex­pres­sion (GTEx), Blue­print, In­ter­na­tion­al Can­cer Genome Con­sor­tium (ICGC), Ther­a­peu­ti­cal­ly Ap­plic­a­ble Re­search to Gen­er­ate Ef­fec­tive Treat­ments (TAR­GET) and oth­ers.


Fig­ure 1. QI­A­GEN Omic­Soft On­coLand col­lects and in­te­grates datasets not on­ly from TC­GA but al­so from many oth­er pub­lic da­ta sources such as Can­cer Cell Line En­cy­clo­pe­dia (CCLE), Blue­print, Geno­type-Tis­sue Ex­pres­sion (GTEx), TRAck­ing Can­cer Evo­lu­tion through ther­a­py (TRAC­ERx), and many more. Here you see the dis­tri­b­u­tion of re­cent­ly-added sam­ples grouped on the y-ax­is by Dis­ease­Cat­e­go­ry and sub­grouped by Can­cer­Type.

Omic­Soft’s cu­ra­tion process for TC­GA

To give you an idea of the ex­ten­sive time and care QI­A­GEN cu­ra­tors in­vest in man­u­al cu­ra­tion of pub­lic ‘omics da­ta, they re­cent­ly spent over 1400 hours per­form­ing a com­pre­hen­sive up­date of TC­GA meta­da­ta with­in QI­A­GEN Omic­Soft On­coLand, re­view­ing over 1200 source files. Clin­i­cal meta­da­ta are now com­pre­hen­sive­ly doc­u­ment­ed to clar­i­fy the mean­ing of fields in align­ment with the lat­est Omic­Soft cu­ra­tion stan­dards. When TC­GA meta­da­ta fields are re­dun­dant or un­clear, new field names are used to clar­i­fy the mean­ing. In ad­di­tion, new meta­da­ta from re­cent TC­GA pub­li­ca­tions are matched to TC­GA da­ta to ap­ply re­cent dis­cov­er­ies about mol­e­c­u­lar­ly de­fined can­cer sub­types.


Fig­ure 2. Ex­am­ple of a QI­A­GEN Omic­Soft query and vi­su­al­iza­tion: BMP2 ex­pres­sion in tis­sue sam­ples from CCLE, GTEx, and TC­GA, us­ing the lat­est Hu­man.B38/Gen­code.V33 re­leas­es. The y-ax­is is pro­filed on Tis­sue­Cat­e­go­ry, Source­Land, and Tu­mor or Nor­mal.

At the core of the Omic­Soft cu­ra­tion process, cu­ra­tors ap­ply sci­en­tif­ic ex­per­tise, con­trolled vo­cab­u­lary, and stan­dard­ized for­mat­ting to all ap­plic­a­ble meta­da­ta, ei­ther as a Ful­ly Con­trolled Field (key clin­i­cal pa­ra­me­ters use terms from QI­A­GEN-de­fined on­tolo­gies) or a For­mat Con­trolled Field (where a QI­A­GEN Omic­Soft on­tol­ogy is not ap­plic­a­ble, terms are for­mat­ted con­sis­tent­ly to max­i­mize uni­for­mi­ty from se­mi-struc­tured da­ta). This means you can quick­ly and eas­i­ly find all ap­plic­a­ble sam­ples us­ing sim­pli­fied search cri­te­ria.

Uni­fi­ca­tion of re­lat­ed TC­GA meta­da­ta fields

With da­ta sub­mis­sions from dozens of labs, groups adopt in­con­sis­tent stan­dards to rep­re­sent the same da­ta. Where pos­si­ble, Omic­Soft cu­ra­tors iden­ti­fied hun­dreds of columns con­tain­ing the same in­for­ma­tion for var­i­ous tu­mors and com­bined the da­ta in­to uni­fied columns to en­hance pan-can­cer analy­ses and com­pu­ta­tion­al analy­sis. As an ex­am­ple, the can­cer di­ag­no­sis of a first-de­gree fam­i­ly mem­ber with a his­to­ry of can­cer was cap­tured in TC­GA across five fields from four can­cers; QI­A­GEN Omic­Soft TC­GA cu­ra­tion unites these in­to the sin­gle field “Fam­i­ly His­to­ry [Can­cer] [Type]”.


Fig­ure 3. QI­A­GEN Omic­Soft pro­vides over 100,000 sta­tis­ti­cal com­par­isons of the cu­rat­ed da­ta. Here you can see the ex­pres­sion of DAZ2 grouped by Treat­mentSta­tus+Re­sponse+Treat­men­tHis­to­ry. Sam­ples in­clud­ed are the com­par­i­son con­trast of Treat­men­tHis­to­ry: ip­il­i­mum­ab vs. none in Treat­mentSta­tus: nivolum­ab and Re­sponse: none col­ored blue.

Syn­ony­mous terms and ty­po­graph­i­cal er­rors

QI­A­GEN Omic­Soft man­u­al­ly ap­plies ex­ten­sive treat­ment on­tolo­gies to en­sure prop­er and un­am­bigu­ous la­bel­ing of sam­ples with treat­ment terms. Be­cause of the many sub­mit­ting groups, dif­fer­ent stan­dards were used for well-es­tab­lished terms, such as drug and ra­di­a­tion treat­ments, with oc­ca­sion­al ty­pos es­cap­ing sub­mit­ter qual­i­ty con­trol checks. For ex­am­ple, over 20 dif­fer­ent terms were used to de­scribe treat­ment with dox­oru­bicin!

Want to learn more about how you can boost your TC­GA ex­plo­ration to get quick­er and more mean­ing­ful in­sights?

Read our white pa­per to get the full de­tails of how QI­A­GEN Omic­Soft On­coLand helps boost TC­GA ex­plo­ration. Down­load our unique and com­pre­hen­sive meta­da­ta dic­tio­nary of clin­i­cal co­vari­ates to quick­ly dis­cov­er the mean­ing of over 1000 rel­e­vant fields for deep­er TC­GA da­ta ex­plo­ration across can­cers.

Learn more about the costs of free da­ta in our in­dus­try re­port. Check out our in­fo­graph­ic that de­tails the var­i­ous QI­A­GEN Omic­Soft soft­ware tools for in­te­grat­ed ‘omics da­ta, to see which so­lu­tions should help you trans­form your bio­mark­er and tar­get dis­cov­ery.