您可以使用 BigQuery 對包含基因體片段間隔或重疊所述資料的變異基因段執行 JOIN
查詢。本頁面說明如何使用複雜的 JOIN
查詢取得基因名稱清單並執行下列操作:
- 尋找重疊基因的罕見 SNP。
- 在全基因體樣本的每個基因端尋找 100,000 個鹼基對。
本指南將列舉三個查詢的範例。每個查詢都示範 BigQuery 如何針對不同大小的基因體資料進行資源調度:
資料來自擁有近 90 億個資料列的 Tute Genomics 註解資料表與 Illumina Platinum Genomes 資料集。如果您對這些資料集並不熟悉,請參閱下列連結:
- Google Cloud 網誌中的「使用 Google Genomics 與 Tute 探索基因變異」一文
- 公開資料集相關章節中的 Illumina Platinum Genomes
查詢內嵌資料表
下列範例使用在查詢中定義的間隔資料表 (名為 intervals
),並顯示如何使用包含 Illumina Platinum Genomes 中變異基因段的資料表執行 JOIN
查詢:
前往 Google Cloud 控制台的「BigQuery」頁面。
按一下 [Compose query] (撰寫查詢)。
在「New query」欄位中,執行下列查詢:
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- Define an inline table that uses five rows -- selected from silver-wall-555.TuteTable.hg19. intervals AS ( SELECT * FROM UNNEST ([ STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64> ('PRCC', '1', 156736274, 156771607, 156636274, 156871607), ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640), ('PAX8', '2', 113972574, 114037496, 113872574, 114137496), ('FHIT', '3', 59734036, 61238131, 59634036, 61338131), ('PPARG', '3', 12328349, 12476853, 12228349, 12576853) ])), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
按一下 [Run query] (執行查詢)。查詢大約會執行十秒鐘的時間,並會處理約 334 GB 的資料。查詢的結果可識別同類群組內會重疊感興趣區域的罕見變異基因段。
展開下列部分,查看查詢結果:
查詢結果
Chr 開始 參照 Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles 1 156699757 T C intronic RRNAD1 0.002 2 4 1 156705390 C T intronic RRNAD1 8.0E-4 0 2 1 156714207 T C intronic HDGF 0.003 0 6 1 156714440 A C intronic HDGF 0.0068 0 12 1 156723870 C T 跨基因 HDGF、PRCC 0.006 1 2 1 156724456 C T 跨基因 HDGF、PRCC 0.002 2 4 1 156733988 C T 跨基因 HDGF、PRCC 0.001 1 2 1 156742258 T G intronic PRCC 0.001 2 4 1 156744826 T G intronic PRCC 0.002 0 8 1 156779764 G A intronic SH2D2A 0.001 2 4 1 156783454 A C intronic SH2D2A 0.0014 1 2 1 156786144 C T intronic NTRK1、SH2D2A 0.0031 2 4 1 156790510 A T intronic NTRK1 0.002 1 2 1 156815332 A C intronic INSRR、NTRK1 0.003 0 2 1 156830778 G A 外顯 NTRK1 0.0067 錯義 2 4 1 156842064 C T intronic NTRK1 0.0014 1 2 1 156843438 C A 外顯 NTRK1 0.0032 錯義 1 2 1 156845773 C T intronic NTRK1 0.001 2 4 1 156873318 T C intronic PEAR1 0.01 4 8 1 156922740 G A intronic ARHGEF11 0.007 1 2 1 156930100 C T intronic ARHGEF11 0.001 2 4 2 113901230 G A 跨基因 IL1RN、PSD4 0.0082 1 2 2 113953418 C A intronic PSD4 0.001 2 4 2 113967621 G C 跨基因 PSD4、PAX8 0.002 0 6 2 113967624 T C 跨基因 PSD4、PAX8 0.002 0 2 2 113980967 G A intronic PAX8 0.002 2 4 2 113994010 A C ncRNA_exonic PAX8-AS1 0.001 0 4 2 113997745 C A ncRNA_exonic PAX8-AS1 0.001 2 4 2 114061327 T C 跨基因 PAX8,CBWD2 0.001 2 4 2 114084018 A C 跨基因 PAX8,CBWD2 0.0045 0 4 2 114099037 G A 跨基因 PAX8,CBWD2 0.0051 1 2 2 114105670 A T 跨基因 PAX8,CBWD2 0.001 1 2 2 114111325 G T 跨基因 PAX8,CBWD2 0.001 1 2 3 12265797 C T 跨基因 SYN2,PPARG 0.0089 2 4 3 12277958 A G 跨基因 SYN2,PPARG 0.002 1 2 3 12296019 G A 跨基因 SYN2,PPARG 0.002 2 4 3 12316549 G C 跨基因 SYN2,PPARG 0.002 1 2 3 12335681 T G intronic PPARG 0.0092 2 4 3 12348795 T C intronic PPARG 0.0014 1 2 3 12353106 T C intronic PPARG 0.001 2 4 3 12403825 G A intronic PPARG 0.0051 2 4 3 12404394 G A intronic PPARG 0.001 1 2 3 12410289 G A intronic PPARG 0.008 2 4 3 12431381 C T intronic PPARG 0.0061 2 4 3 12447267 G A intronic PPARG 0.0089 2 4 3 12449379 C T intronic PPARG 0.0092 2 4 3 12450848 C A intronic PPARG 0.0092 2 4 3 12462847 T C intronic PPARG 0.002 1 2 3 12492797 G A 跨基因 PPARG,TSEN2 0.01 1 2 3 12503201 G A 跨基因 PPARG,TSEN2 0.0099 2 4 3 12530460 A G intronic TSEN2 0.0092 2 4 3 12531167 A G intronic TSEN2 0.0099 2 4 3 12557737 A G intronic TSEN2 0.001 2 4 3 59636143 A G 跨基因 C3orf67、FHIT 0.003 3 6 3 59645934 A C 跨基因 C3orf67、FHIT 0.004 1 2 3 59646893 G A 跨基因 C3orf67、FHIT 0.002 1 2 3 59697024 A G 跨基因 C3orf67、FHIT 0.0072 1 2 3 59701013 G A 跨基因 C3orf67、FHIT 0.004 2 4 3 59733945 A G 跨基因 C3orf67、FHIT 0.001 2 4 3 59747482 C T intronic FHIT 0.001 2 4 3 59750635 A G intronic FHIT 0.003 1 2 3 59757776 C T intronic FHIT 0.001 2 4 3 59770612 G A intronic FHIT 0.001 2 4 3 59804444 G C intronic FHIT 0.001 2 4 3 59819769 T C intronic FHIT 0.001 2 4 3 59884396 C T intronic FHIT 0.001 2 4 3 59960728 A C intronic FHIT 0.01 1 2 3 59970345 G A intronic FHIT 0.002 1 2 3 59972417 T A intronic FHIT 0.0072 0 2 3 60104328 C A intronic FHIT 0.01 2 4 3 60139062 G A intronic FHIT 0.01 0 2 3 60158066 C T intronic FHIT 0.001 1 2 3 60169285 C T intronic FHIT 0.005 1 2 3 60216185 T C intronic FHIT 0.002 1 2 3 60226380 G A intronic FHIT 0.007 2 4 3 60234539 C A intronic FHIT 0.002 1 2 3 60247464 A C intronic FHIT 0.004 2 4 3 60269926 A G intronic FHIT 0.007 2 4 3 60271228 G T intronic FHIT 0.007 2 4 3 60286972 T C intronic FHIT 0.001 2 4 3 60301412 C G intronic FHIT 0.001 1 2 3 60312251 C T intronic FHIT 0.0099 1 2 3 60317682 A G intronic FHIT 0.008 1 2 3 60328557 C G intronic FHIT 0.0043 2 4 3 60342562 C T intronic FHIT 0.006 1 2 3 60400033 G A intronic FHIT 0.004 2 4 3 60435819 C T intronic FHIT 0.006 2 4 3 60435820 G T intronic FHIT 0.004 1 2 3 60441288 T C intronic FHIT 0.006 2 4 3 60444465 C A intronic FHIT 0.01 1 2 3 60444575 C T intronic FHIT 0.001 1 2 3 60450581 T C intronic FHIT 0.01 1 2 3 60456571 G A intronic FHIT 0.001 2 4 3 60473568 C G intronic FHIT 0.001 1 2 3 60487557 T C intronic FHIT 0.001 1 2 3 60559705 A G intronic FHIT 0.002 2 4 3 60570764 T C intronic FHIT 0.008 2 4 3 60582100 C T intronic FHIT 0.001 1 2 3 60587192 G A intronic FHIT 0.004 1 2 3 60599869 G A intronic FHIT 0.0086 2 4 3 60603091 C T intronic FHIT 0.001 2 4 3 60603250 A T intronic FHIT 0.0099 1 2 3 60609831 T G intronic FHIT 0.001 2 4 3 60619756 G T intronic FHIT 0.0015 2 4 3 60680758 C T intronic FHIT 0.0089 2 4 3 60702243 G C intronic FHIT 0.001 2 4 3 60702532 A G intronic FHIT 0.001 1 2 3 60714328 A T intronic FHIT 0.004 1 2 3 60725297 G A intronic FHIT 0.001 1 2 3 60726640 G A intronic FHIT 0.01 2 4 3 60795144 A G intronic FHIT 0.001 2 4 3 60807171 A G intronic FHIT 0.001 1 2 3 60813868 T C intronic FHIT 0.001 1 2 3 60826546 C G intronic FHIT 0.0023 1 2 3 60837392 C T intronic FHIT 0.001 1 2 3 60846310 A G intronic FHIT 0.01 0 2 3 60850985 C T intronic FHIT 0.004 1 2 3 60852559 T C intronic FHIT 0.008 1 2 3 60871759 T C intronic FHIT 0.004 1 2 3 60884396 C T intronic FHIT 0.002 2 4 3 60897092 C A intronic FHIT 0.001 2 4 3 60940759 C T intronic FHIT 0.0089 1 2 3 60982595 A G intronic FHIT 0.003 2 4 3 60999283 G A intronic FHIT 0.001 1 2 3 61042977 A G intronic FHIT 0.001 2 4 3 61043349 T C intronic FHIT 0.001 2 4 3 61044789 A C intronic FHIT 0.001 2 4 3 61141621 G A intronic FHIT 0.003 1 2 3 61148655 G C intronic FHIT 0.001 2 4 3 61170747 C T intronic FHIT 0.003 1 2 3 61189473 C G intronic FHIT 0.0099 1 2 3 61190425 C T intronic FHIT 0.0023 2 4 3 61193853 C T intronic FHIT 0.0099 0 2 3 61194793 C T intronic FHIT 0.007 0 2 3 61194840 A G intronic FHIT 0.0099 0 2 3 61194886 T A intronic FHIT 0.0099 0 2 3 61201777 C T intronic FHIT 0.001 2 4 3 61202292 T C intronic FHIT 0.007 1 2 3 61232806 G C intronic FHIT 0.0099 1 2 3 61232910 C T intronic FHIT 0.0099 1 2 3 61235824 A T intronic FHIT 0.001 2 4 3 61283810 A C 跨基因 FHIT、PTPRG 0.0089 1 2 3 61293731 T A 跨基因 FHIT、PTPRG 0.0089 2 4 3 61296730 C T 跨基因 FHIT、PTPRG 0.001 1 2 3 61326341 C T 跨基因 FHIT、PTPRG 0.004 2 4 3 61326620 T C 跨基因 FHIT、PTPRG 0.01 1 2 3 61327649 G C 跨基因 FHIT、PTPRG 0.001 2 4 3 61330545 G C 跨基因 FHIT、PTPRG 0.001 2 4 3 61335803 G A 跨基因 FHIT、PTPRG 0.001 2 4 使用 1,000 個基因體階段 3 中資料的類似查詢,大約會執行 90 秒的時間,並會處理約 3.38 TB 的資料。
使用具體化資料表
處理大規模的大數據時,您可以具體化間隔資料表,並針對新資料表執行 JOIN
查詢。繼續閱讀本節的其餘內容前,請按照下列步驟建立資料集:
在 Google Cloud 控制台開啟「BigQuery」頁面。
在「Explorer」面板中,選取要建立資料集的專案。
展開
「動作」選項,然後按一下「建立資料集」。在「Create dataset」(建立資料集) 頁面:
- 在「Dataset ID」(資料集 ID) 中輸入
genomics
。 - 保留其他預設設定。
- 點選「建立資料集」。
- 在「Dataset ID」(資料集 ID) 中輸入
查詢具有特定基因的具體化資料表
下列步驟說明如何將包含 silver-wall-555:TuteTable.hg19 資料表中特定基因清單的新間隔資料表具體化。
如要建立間隔資料表,請按照下列步驟操作:
在 Google Cloud 控制台開啟「BigQuery」頁面。
按一下 [Compose query] (撰寫查詢)。
在「New query」欄位中,執行下列查詢。這項查詢會將部分 silver-wall-555:TuteTable.hg19 資料表具體化為新的
genomics.myIntervalTable
間隔資料表。#standardSQL CREATE TABLE `genomics.myIntervalTable` AS ( SELECT Gene, Chr, MIN(Start) AS gene_start, MAX(`End`) AS gene_end, MIN(Start)-100000 AS region_start, MAX(`End`)+100000 AS region_end FROM `silver-wall-555.TuteTable.hg19` WHERE Gene IN ('APC', 'ATM', 'BMPR1A', 'BRCA1', 'BRCA2', 'CDK4', 'CDKN2A', 'CREBBP', 'EGFR', 'EP300', 'ETV6', 'FHIT', 'FLT3', 'HRAS', 'KIT', 'MET', 'MLH1', 'NTRK1', 'PAX8', 'PDGFRA', 'PPARG', 'PRCC', 'PRKAR1A', 'PTEN', 'RET', 'STK11', 'TFE3', 'TGFB1', 'TGFBR2', 'TP53', 'WWOX') GROUP BY Chr, Gene );
按一下 [Run query] (執行查詢)。查詢會傳回以下結果:
This statement created a new table named PROJECT_ID:genomics.myIntervalTable.
在「New query」欄位中,執行下列查詢:
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN `genomics.myIntervalTable` AS intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
按一下 [Run query] (執行查詢)。查詢大約會執行十秒鐘的時間,並會處理約 334 GB 的資料。查詢的結果可識別同類群組內會重疊感興趣區域的罕見變異基因段。
展開下列部分,查看查詢結果:
查詢結果
Chr 開始 參照 Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles 1 156699757 T C intronic RRNAD1 0.002 2 4 1 156705390 C T intronic RRNAD1 8.0E-4 0 2 1 156714207 T C intronic HDGF 0.003 0 6 1 156714440 A C intronic HDGF 0.0068 0 12 1 156723870 C T 跨基因 HDGF、PRCC 0.006 1 2 1 156724456 C T 跨基因 HDGF、PRCC 0.002 2 4 1 156733988 C T 跨基因 HDGF、PRCC 0.001 1 2 1 156742258 T G intronic PRCC 0.001 2 4 1 156744826 T G intronic PRCC 0.002 0 8 1 156779764 G A intronic SH2D2A 0.001 2 4 1 156783454 A C intronic SH2D2A 0.0014 1 2 1 156786144 C T intronic NTRK1、SH2D2A 0.0031 2 4 1 156790510 A T intronic NTRK1 0.002 1 2 1 156815332 A C intronic INSRR、NTRK1 0.003 0 2 1 156830778 G A 外顯 NTRK1 0.0067 錯義 2 4 1 156842064 C T intronic NTRK1 0.0014 1 2 1 156843438 C A 外顯 NTRK1 0.0032 錯義 1 2 1 156845773 C T intronic NTRK1 0.001 2 4 1 156873318 T C intronic PEAR1 0.01 4 8 1 156922740 G A intronic ARHGEF11 0.007 1 2 1 156930100 C T intronic ARHGEF11 0.001 2 4 2 113901230 G A 跨基因 IL1RN、PSD4 0.0082 1 2 2 113953418 C A intronic PSD4 0.001 2 4 2 113967621 G C 跨基因 PSD4、PAX8 0.002 0 6 2 113967624 T C 跨基因 PSD4、PAX8 0.002 0 2 2 113980967 G A intronic PAX8 0.002 2 4 2 113994010 A C ncRNA_exonic PAX8-AS1 0.001 0 4 2 113997745 C A ncRNA_exonic PAX8-AS1 0.001 2 4 2 114061327 T C 跨基因 PAX8,CBWD2 0.001 2 4 2 114084018 A C 跨基因 PAX8,CBWD2 0.0045 0 4 2 114099037 G A 跨基因 PAX8,CBWD2 0.0051 1 2 2 114105670 A T 跨基因 PAX8,CBWD2 0.001 1 2 2 114111325 G T 跨基因 PAX8,CBWD2 0.001 1 2 3 12265797 C T 跨基因 SYN2,PPARG 0.0089 2 4 3 12277958 A G 跨基因 SYN2,PPARG 0.002 1 2 3 12296019 G A 跨基因 SYN2,PPARG 0.002 2 4 3 12316549 G C 跨基因 SYN2,PPARG 0.002 1 2 3 12335681 T G intronic PPARG 0.0092 2 4 3 12348795 T C intronic PPARG 0.0014 1 2 3 12353106 T C intronic PPARG 0.001 2 4 3 12403825 G A intronic PPARG 0.0051 2 4 3 12404394 G A intronic PPARG 0.001 1 2 3 12410289 G A intronic PPARG 0.008 2 4 3 12431381 C T intronic PPARG 0.0061 2 4 3 12447267 G A intronic PPARG 0.0089 2 4 3 12449379 C T intronic PPARG 0.0092 2 4 3 12450848 C A intronic PPARG 0.0092 2 4 3 12462847 T C intronic PPARG 0.002 1 2 3 12492797 G A 跨基因 PPARG,TSEN2 0.01 1 2 3 12503201 G A 跨基因 PPARG,TSEN2 0.0099 2 4 3 12530460 A G intronic TSEN2 0.0092 2 4 3 12531167 A G intronic TSEN2 0.0099 2 4 3 12557737 A G intronic TSEN2 0.001 2 4 3 59636143 A G 跨基因 C3orf67、FHIT 0.003 3 6 3 59645934 A C 跨基因 C3orf67、FHIT 0.004 1 2 3 59646893 G A 跨基因 C3orf67、FHIT 0.002 1 2 3 59697024 A G 跨基因 C3orf67、FHIT 0.0072 1 2 3 59701013 G A 跨基因 C3orf67、FHIT 0.004 2 4 3 59733945 A G 跨基因 C3orf67、FHIT 0.001 2 4 3 59747482 C T intronic FHIT 0.001 2 4 3 59750635 A G intronic FHIT 0.003 1 2 3 59757776 C T intronic FHIT 0.001 2 4 3 59770612 G A intronic FHIT 0.001 2 4 3 59804444 G C intronic FHIT 0.001 2 4 3 59819769 T C intronic FHIT 0.001 2 4 3 59884396 C T intronic FHIT 0.001 2 4 3 59960728 A C intronic FHIT 0.01 1 2 3 59970345 G A intronic FHIT 0.002 1 2 3 59972417 T A intronic FHIT 0.0072 0 2 3 60104328 C A intronic FHIT 0.01 2 4 3 60139062 G A intronic FHIT 0.01 0 2 3 60158066 C T intronic FHIT 0.001 1 2 3 60169285 C T intronic FHIT 0.005 1 2 3 60216185 T C intronic FHIT 0.002 1 2 3 60226380 G A intronic FHIT 0.007 2 4 3 60234539 C A intronic FHIT 0.002 1 2 3 60247464 A C intronic FHIT 0.004 2 4 3 60269926 A G intronic FHIT 0.007 2 4 3 60271228 G T intronic FHIT 0.007 2 4 3 60286972 T C intronic FHIT 0.001 2 4 3 60301412 C G intronic FHIT 0.001 1 2 3 60312251 C T intronic FHIT 0.0099 1 2 3 60317682 A G intronic FHIT 0.008 1 2 3 60328557 C G intronic FHIT 0.0043 2 4 3 60342562 C T intronic FHIT 0.006 1 2 3 60400033 G A intronic FHIT 0.004 2 4 3 60435819 C T intronic FHIT 0.006 2 4 3 60435820 G T intronic FHIT 0.004 1 2 3 60441288 T C intronic FHIT 0.006 2 4 3 60444465 C A intronic FHIT 0.01 1 2 3 60444575 C T intronic FHIT 0.001 1 2 3 60450581 T C intronic FHIT 0.01 1 2 3 60456571 G A intronic FHIT 0.001 2 4 3 60473568 C G intronic FHIT 0.001 1 2 3 60487557 T C intronic FHIT 0.001 1 2 3 60559705 A G intronic FHIT 0.002 2 4 3 60570764 T C intronic FHIT 0.008 2 4 3 60582100 C T intronic FHIT 0.001 1 2 3 60587192 G A intronic FHIT 0.004 1 2 3 60599869 G A intronic FHIT 0.0086 2 4 3 60603091 C T intronic FHIT 0.001 2 4 3 60603250 A T intronic FHIT 0.0099 1 2 3 60609831 T G intronic FHIT 0.001 2 4 3 60619756 G T intronic FHIT 0.0015 2 4 3 60680758 C T intronic FHIT 0.0089 2 4 3 60702243 G C intronic FHIT 0.001 2 4 3 60702532 A G intronic FHIT 0.001 1 2 3 60714328 A T intronic FHIT 0.004 1 2 3 60725297 G A intronic FHIT 0.001 1 2 3 60726640 G A intronic FHIT 0.01 2 4 3 60795144 A G intronic FHIT 0.001 2 4 3 60807171 A G intronic FHIT 0.001 1 2 3 60813868 T C intronic FHIT 0.001 1 2 3 60826546 C G intronic FHIT 0.0023 1 2 3 60837392 C T intronic FHIT 0.001 1 2 3 60846310 A G intronic FHIT 0.01 0 2 3 60850985 C T intronic FHIT 0.004 1 2 3 60852559 T C intronic FHIT 0.008 1 2 3 60871759 T C intronic FHIT 0.004 1 2 3 60884396 C T intronic FHIT 0.002 2 4 3 60897092 C A intronic FHIT 0.001 2 4 3 60940759 C T intronic FHIT 0.0089 1 2 3 60982595 A G intronic FHIT 0.003 2 4 3 60999283 G A intronic FHIT 0.001 1 2 3 61042977 A G intronic FHIT 0.001 2 4 3 61043349 T C intronic FHIT 0.001 2 4 3 61044789 A C intronic FHIT 0.001 2 4 3 61141621 G A intronic FHIT 0.003 1 2 3 61148655 G C intronic FHIT 0.001 2 4 3 61170747 C T intronic FHIT 0.003 1 2 3 61189473 C G intronic FHIT 0.0099 1 2 3 61190425 C T intronic FHIT 0.0023 2 4 3 61193853 C T intronic FHIT 0.0099 0 2 3 61194793 C T intronic FHIT 0.007 0 2 3 61194840 A G intronic FHIT 0.0099 0 2 3 61194886 T A intronic FHIT 0.0099 0 2 3 61201777 C T intronic FHIT 0.001 2 4 3 61202292 T C intronic FHIT 0.007 1 2 3 61232806 G C intronic FHIT 0.0099 1 2 3 61232910 C T intronic FHIT 0.0099 1 2 3 61235824 A T intronic FHIT 0.001 2 4 3 61283810 A C 跨基因 FHIT、PTPRG 0.0089 1 2 3 61293731 T A 跨基因 FHIT、PTPRG 0.0089 2 4 3 61296730 C T 跨基因 FHIT、PTPRG 0.001 1 2 3 61326341 C T 跨基因 FHIT、PTPRG 0.004 2 4 3 61326620 T C 跨基因 FHIT、PTPRG 0.01 1 2 3 61327649 G C 跨基因 FHIT、PTPRG 0.001 2 4 3 61330545 G C 跨基因 FHIT、PTPRG 0.001 2 4 3 61335803 G A 跨基因 FHIT、PTPRG 0.001 2 4 使用 1,000 個基因體階段 3 中資料的類似查詢,大約會執行 90 秒的時間,並會處理約 3.38 TB 的資料。
查詢具有 250 個隨機基因的具體化資料表
下列範例顯示如何針對包含從 silver-wall-555:TuteTable.hg19 資料表隨機選取的 250 個基因的具體化資料表執行間隔 JOIN
。
如要建立間隔資料表,請按照下列步驟操作:
在 Google Cloud 控制台開啟「BigQuery」頁面。
按一下 [Compose query] (撰寫查詢)。
在「New query」(新查詢) 欄位中,執行下列查詢,將部分 silver-wall-555:TuteTable.hg19 資料表具體化為新的
genomics.randomGenesIntervalTable
間隔資料表。#standardSQL CREATE TABLE `genomics.randomGenesIntervalTable` AS ( SELECT Gene, Chr, MIN(Start) AS gene_start, MAX(`End`) AS gene_end, MIN(Start) - 100000 AS region_start, MAX(`End`) + 100000 AS region_end FROM `silver-wall-555.TuteTable.hg19` WHERE Gene IN (SELECT Gene FROM `silver-wall-555.TuteTable.hg19` GROUP BY Gene LIMIT 250) GROUP BY Chr, Gene );
- 按一下 [Run query] (執行查詢)。查詢會傳回以下結果:
This statement created a new table named PROJECT_ID:genomics.randomGenesIntervalTable.
在「New query」欄位中,執行下列查詢:
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN `genomics.randomGenesIntervalTable` AS intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
按一下 [Run query] (執行查詢)。查詢大約會執行十秒鐘的時間,並會處理約 334 GB 的資料。查詢的結果可識別同類群組內會重疊感興趣區域的罕見變異基因段。
展開以下部分,查看截斷的查詢結果:
查詢結果
Chr 開始 參照 Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles 1 2925355 C A 跨基因 TTC34、ACTRT2 0.001 2 4 1 2933170 G A 跨基因 TTC34、ACTRT2 0.0083 0 4 1 2944477 G A 跨基因 ACTRT2,LINC00982 0.003 4 6 1 2967591 A T 跨基因 ACTRT2,LINC00982 0.0092 1 2 1 2975255 T C 下游 LINC00982 0.0082 1 2 1 2977223 C T ncRNA_intronic LINC00982 0.0072 1 2 1 2978803 G C ncRNA_exonic LINC00982 0.002 4 6 1 3006466 G A intronic PRDM16 0.0098 1 2 1 3011333 G T intronic PRDM16 0.004 1 2 1 3019659 C T intronic PRDM16 0.0031 1 2 1 3036896 G A intronic PRDM16 0.001 1 2 1 3037388 G A intronic PRDM16 0.002 2 4 1 3041250 T G intronic PRDM16 0.006 2 4 1 3042502 A T intronic PRDM16 0.003 4 6 1 3053713 A C intronic PRDM16 0.002 1 2 1 3063109 C T intronic PRDM16 0.002 0 2 1 3063593 T C intronic PRDM16 0.003 1 2 1 3076439 C T intronic PRDM16 0.001 2 4 1 3078960 G A intronic PRDM16 0.007 2 4 1 3084268 A C intronic PRDM16 0.005 0 2 1 3084492 T C intronic PRDM16 0.0015 0 2 1 3084786 T C intronic PRDM16 0.0015 0 4 1 3111119 G A intronic PRDM16 0.003 1 2 1 3111643 C T intronic PRDM16 0.0041 1 2 1 3114807 G A intronic PRDM16 0.0041 1 2 1 3165530 C T intronic PRDM16 0.0089 1 2 1 3169325 G A intronic PRDM16 0.008 2 4 1 3179623 C T intronic PRDM16 0.003 2 4 1 3181097 C T intronic PRDM16 0.001 2 4 1 3194000 G C intronic PRDM16 0.005 2 4 1 3195769 T C intronic PRDM16 0.002 1 2 1 3197351 C T intronic PRDM16 0.0061 1 2 1 3224100 C A intronic PRDM16 0.003 2 4 1 3228644 G T intronic PRDM16 0.001 2 4 1 3234045 G A intronic PRDM16 0.002 1 2 1 3235971 G A intronic PRDM16 0.0089 1 2 1 3274115 C T intronic PRDM16 0.001 2 4 1 3291388 G A intronic PRDM16 0.002 2 4 1 3295658 A C intronic PRDM16 0.0068 0 6 1 3295937 A C intronic PRDM16 0.0068 0 2 1 3296205 T C intronic PRDM16 0.0083 0 2 1 3315690 G A intronic PRDM16 0.001 2 4 1 3329212 G A 外顯 PRDM16 0.0031 錯義 1 2 1 3331787 C T intronic PRDM16 0.0099 1 2 1 3370316 G C 上游 ARHGEF16 0.001 2 4 1 3379560 A G intronic ARHGEF16 0.0051 0 6 1 3391174 C T intronic ARHGEF16 0.006 1 2 1 3413873 G A 外顯 MEGF6 0.003 錯義 1 2 1 3416272 C T 外顯 MEGF6 0.0072 靜音 2 4 1 3417122 G A intronic MEGF6 0.0038 2 4 1 3436219 G A intronic MEGF6 0.0046 2 4 1 12907456 A G 外顯 HNRNPCL1,LOC649330 0.006 錯義 0 10 1 12907518 C A 外顯 HNRNPCL1,LOC649330 1.0E-4 錯義 0 10 1 12908499 G C intronic HNRNPCL1 0.0031 0 8 1 12931660 G C 跨基因 PRAMEF2、PRAMEF4 0.004 1 2 1 12937721 G T 跨基因 PRAMEF2、PRAMEF4 0.0038 0 2 1 12940827 G T intronic PRAMEF4 0.007 2 4 1 12942759 T G intronic PRAMEF4 0.0076 0 10 1 12942805 T G intronic PRAMEF4 0.0061 0 12 1 12942812 G A intronic PRAMEF4 0.0061 0 12 1 12942875 A G intronic PRAMEF4 0.0068 0 6 1 12942912 G C intronic PRAMEF4 2.0E-4 0 2 1 12942937 A T 外顯 PRAMEF4 0.0029 錯義 0 2 1 12942940 T G 外顯 PRAMEF4 0.0038 錯義 0 2 1 12943940 T C intronic PRAMEF4 0.0015 0 12 1 12944138 A G intronic PRAMEF4 8.0E-4 0 12 1 12944234 G A intronic PRAMEF4 0.0015 0 12 1 12944589 T G intronic PRAMEF4 0.003 0 4 1 12944845 A C intronic PRAMEF4 0.0014 0 6 1 12946439 T C 上游 PRAMEF4 0.0029 0 10 1 12946833 G A 上游 PRAMEF4 0.001 0 8 1 12946835 T A 上游 PRAMEF4 0.004 0 12 1 12995204 G T 跨基因 PRAMEF8、PRAMEF6 0.003 1 4 1 12997638 T C 下游 PRAMEF6、PRAMEF9 0.003 2 4 1 13007841 G C 上游 PRAMEF6 0.0043 0 8 1 13019228 T A 跨基因 PRAMEF6、LOC391003 0.0015 0 10 1 13038503 G A UTR3 LOC391003 0.0072 1 2 1 13051650 C T 跨基因 LOC391003,PRAMEF5 0.002 2 4 1 15706063 G A intronic FHAD1 0.0029 1 2 1 15713292 C T intronic FHAD1 0.001 1 2 1 15766541 G C intronic CTRC 0.001 1 2 1 15782601 T C 上游 CELA2A 0.0038 1 2 1 15828125 G A intronic CASP9 0.0014 2 4 1 15831037 G A intronic CASP9 0.0099 1 2 1 15840513 T G intronic CASP9 0.0043 2 4 1 15868742 G A intronic DNAJC16 0.001 1 2 1 15876704 G A intronic DNAJC16 0.001 1 2 1 15900342 C A intronic AGMAT 0.001 1 2 1 15906257 T C intronic AGMAT 8.0E-4 1 2 1 15911897 A G 上游 AGMAT 0.0043 2 4 1 22764178 C T 跨基因 WNT4、ZBTB40 0.001 2 4 1 22791939 C T intronic ZBTB40 0.0089 2 4 1 22874394 C G 跨基因 ZBTB40、EPHA8 0.007 1 2 1 22875103 C G 跨基因 ZBTB40、EPHA8 0.007 1 2 1 22906403 C T intronic EPHA8 0.008 2 4 1 22912956 G A intronic EPHA8 0.001 1 2 1 22917007 C T intronic EPHA8 0.001 2 4 1 22927240 G A 外顯 EPHA8 0.0013 錯義 2 4 1 22932265 G A 跨基因 EPHA8、MIR6127 0.0089 2 4 1 22944057 C T 跨基因 EPHA8、MIR6127 0.0089 2 4 1 22978799 A G 上游 C1QB 0.0099 2 4 1 35170588 C T 跨基因 C1orf94、GJB5 0.01 1 2 1 35172426 C T 跨基因 C1orf94、GJB5 0.008 1 2 1 35172447 G A 跨基因 C1orf94、GJB5 0.001 1 2 1 35175302 C T 跨基因 C1orf94、GJB5 0.008 1 2 1 35177410 A T 跨基因 C1orf94、GJB5 0.001 1 2 1 35178768 C T 跨基因 C1orf94、GJB5 0.0014 2 4 1 35179362 G A 跨基因 C1orf94、GJB5 0.0014 2 4 1 35186166 G A 跨基因 C1orf94、GJB5 0.0099 2 4 1 35186520 A C 跨基因 C1orf94、GJB5 0.002 2 4 1 35196361 G A 跨基因 C1orf94、GJB5 0.0099 2 4 1 35223545 C T 外顯 GJB5 0.001 靜音 1 2 1 35224029 G A UTR3 GJB5 0.003 1 2 1 35227895 T C UTR3 GJB4 5.0E-4 1 2 1 35230455 G T 跨基因 GJB4、GJB3 0.0043 1 2 1 35232954 T C 跨基因 GJB4、GJB3 0.003 1 2 1 35237986 G A 跨基因 GJB4、GJB3 0.0014 1 2 1 35245522 C T 跨基因 GJB4、GJB3 0.001 1 2 1 35256979 C T 跨基因 GJB3、GJA4 0.002 2 4 1 35263872 C T 跨基因 GJA4、SMIM12 5.0E-4 2 4 1 35323895 A C intronic SMIM12 0.0027 2 4 1 35369676 G A intronic DLGAP3 0.007 2 4 1 35371634 T A 上游 DLGAP3 0.0015 0 4 1 39253519 G A 跨基因 LINC01343,RRAGC 0.005 1 2 1 39288829 G A 跨基因 LINC01343,RRAGC 0.0051 1 2 1 39289832 A C 跨基因 LINC01343,RRAGC 0.002 0 2 1 39312638 G A intronic RRAGC 0.0038 2 4 1 39361372 G A intronic RHBDL2 0.005 1 2 1 39363826 T G intronic RHBDL2 0.0029 1 2 1 39367555 T C intronic RHBDL2 0.007 2 4 1 39369531 T C intronic RHBDL2 0.001 2 4 1 39370202 T C intronic RHBDL2 0.01 1 2 1 39449101 A G 跨基因 RHBDL2、AKIRIN1 0.001 2 4 1 39475057 G A 跨基因 AKIRIN1、NDUFS5 0.01 1 2 1 39485016 C T 跨基因 AKIRIN1、NDUFS5 0.001 2 4 1 39488137 A G 跨基因 AKIRIN1、NDUFS5 0.001 2 4 1 39499212 A C intronic NDUFS5 0.001 0 2 1 39500605 C G 下游 NDUFS5 0.002 0 10 1 46813814 T C intronic NSUN4 0.0014 1 2 1 46817258 A G intronic NSUN4 0.005 0 2 1 46843158 T C 跨基因 NSUN4,FAAH 0.001 1 2 1 46933509 A G 跨基因 LINC01398,DMBX1 0.002 1 2 1 46935021 G A 跨基因 LINC01398,DMBX1 0.004 2 4 1 46939253 T A 跨基因 LINC01398,DMBX1 0.004 2 4 1 46951788 C A 跨基因 LINC01398,DMBX1 0.002 2 4 1 46980864 G C 下游 DMBX1 0.003 1 2 1 46989657 T C 跨基因 DMBX1、MKNK1-AS1 0.007 1 2 1 46994678 C T 跨基因 DMBX1、MKNK1-AS1 0.002 1 2 1 46999438 T C 跨基因 DMBX1、MKNK1-AS1 0.002 1 2 1 92761505 A G intronic GLMN 0.001 2 4 1 92764270 G C intronic GLMN 0.001 2 4 1 92802210 G A intronic RPAP2 0.0072 1 2 1 92820663 T A intronic RPAP2 0.0058 1 2 1 92820664 G T intronic RPAP2 0.0058 1 2 1 92820953 G A intronic RPAP2 0.007 2 4 1 92824766 A G intronic RPAP2 0.0058 1 2 1 92849183 C A intronic RPAP2 0.01 2 4 1 92850696 C G intronic RPAP2 0.0023 1 2 1 92861357 T C 跨基因 RPAP2、GFI1 0.01 2 4 1 92877460 C G 跨基因 RPAP2、GFI1 0.002 1 2 1 92880643 A G 跨基因 RPAP2、GFI1 0.001 2 4 1 92911540 G A 跨基因 RPAP2、GFI1 0.004 2 4 1 92911721 A C 跨基因 RPAP2、GFI1 0.0031 0 8 1 92918277 C T 跨基因 RPAP2、GFI1 0.001 2 4 1 92950920 G A intronic GFI1 0.008 2 4 1 92964788 G A 跨基因 GFI1、EVI5 0.0023 1 2 1 92977480 C T UTR3 EVI5 0.002 1 2 1 92985213 C T intronic EVI5 0.001 2 4 1 92988342 C T intronic EVI5 0.008 2 4 1 92992283 G A intronic EVI5 0.01 2 4 1 92999760 C T intronic EVI5 0.003 1 2 1 93005149 G C intronic EVI5 0.003 0 4 1 93018543 A T intronic EVI5 0.01 2 4 1 93033744 C T intronic EVI5 0.001 2 4 1 111400296 G A 跨基因 KCNA3,CD53 0.0014 2 4 1 111411924 C T 跨基因 KCNA3,CD53 0.003 1 2 1 111441850 C G UTR3 CD53 0.003 2 4 1 111451527 C T 跨基因 CD53、LRIF1 0.008 2 4 1 111454082 C A 跨基因 CD53、LRIF1 0.001 2 4 1 111466506 A G 跨基因 CD53、LRIF1 0.001 2 4 1 111525974 G A 跨基因 LRIF1、DRAM2 0.002 2 4 1 111574573 G T 跨基因 LRIF1、DRAM2 0.0072 2 4 1 111574594 T A 跨基因 LRIF1、DRAM2 0.005 1 2 1 111574647 G A 跨基因 LRIF1、DRAM2 0.005 1 2 1 111591746 T A 跨基因 LRIF1、DRAM2 0.005 1 2 1 111601459 A G 跨基因 LRIF1、DRAM2 0.005 1 2 1 111604748 G C 跨基因 LRIF1、DRAM2 0.005 1 2 1 112191526 T G intronic RAP1A 0.001 2 4 1 112206765 A G intronic RAP1A 0.0043 1 2 1 112226517 G A intronic RAP1A 0.001 0 2 1 112263324 G T 跨基因 RAP1A、FAM212B 0.003 2 4 1 112264843 G A UTR3 FAM212B 0.001 1 2 1 112285810 C T ncRNA_intronic FAM212B-AS1 0.004 1 2 1 112304285 T C intronic DDX20 0.0043 1 2 1 112307213 A C intronic DDX20 0.0043 1 2 1 112309436 G T 外顯 DDX20 0.0 錯義 1 2 1 112317384 T C 跨基因 DDX20、KCND3 0.0014 1 2 1 112381367 C T intronic KCND3 0.002 1 2 1 112396571 G T ncRNA_exonic KCND3-IT1 0.001 1 2 1 113520038 G A 跨基因 SLC16A1-AS1,LOC100996251 0.0023 1 2 使用 1,000 個基因體階段 3 中資料的類似查詢,大約會執行 90 秒的時間,並會處理約 3.38 TB 的資料。