使用 BigQuery 執行間隔 JOIN

您可以使用 BigQuery 對包含基因體片段間隔或重疊所述資料的變異基因段執行 JOIN 查詢。本頁面說明如何使用複雜的 JOIN 查詢取得基因名稱清單並執行下列操作:

  • 尋找重疊基因的罕見 SNP。
  • 在全基因體樣本的每個基因端尋找 100,000 個鹼基對。

本指南將列舉三個查詢的範例。每個查詢都示範 BigQuery 如何針對不同大小的基因體資料進行資源調度:

資料來自擁有近 90 億個資料列的 Tute Genomics 註解資料表與 Illumina Platinum Genomes 資料集。如果您對這些資料集並不熟悉,請參閱下列連結:

查詢內嵌資料表

下列範例使用在查詢中定義的間隔資料表 (名為 intervals),並顯示如何使用包含 Illumina Platinum Genomes 中變異基因段的資料表執行 JOIN 查詢:

  1. 前往 Google Cloud 控制台的「BigQuery」頁面。

    前往「BigQuery」頁面

  2. 按一下 [Compose query] (撰寫查詢)

  3. 在「New query」欄位中,執行下列查詢:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- Define an inline table that uses five rows
      -- selected from silver-wall-555.TuteTable.hg19.
      intervals AS (
        SELECT * FROM UNNEST ([
        STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64>
        ('PRCC', '1', 156736274, 156771607, 156636274, 156871607),
        ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640),
        ('PAX8', '2', 113972574, 114037496, 113872574, 114137496),
        ('FHIT', '3', 59734036, 61238131, 59634036, 61338131),
        ('PPARG', '3', 12328349, 12476853, 12228349, 12576853)
      ])),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
  4. 按一下 [Run query] (執行查詢)。查詢大約會執行十秒鐘的時間,並會處理約 334 GB 的資料。查詢的結果可識別同類群組內會重疊感興趣區域的罕見變異基因段。

    展開下列部分,查看查詢結果:

    查詢結果

    Chr 開始 參照 Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 156699757 T C intronic RRNAD1 0.002 2 4
    1 156705390 C T intronic RRNAD1 8.0E-4 0 2
    1 156714207 T C intronic HDGF 0.003 0 6
    1 156714440 A C intronic HDGF 0.0068 0 12
    1 156723870 C T 跨基因 HDGF、PRCC 0.006 1 2
    1 156724456 C T 跨基因 HDGF、PRCC 0.002 2 4
    1 156733988 C T 跨基因 HDGF、PRCC 0.001 1 2
    1 156742258 T G intronic PRCC 0.001 2 4
    1 156744826 T G intronic PRCC 0.002 0 8
    1 156779764 G A intronic SH2D2A 0.001 2 4
    1 156783454 A C intronic SH2D2A 0.0014 1 2
    1 156786144 C T intronic NTRK1、SH2D2A 0.0031 2 4
    1 156790510 A T intronic NTRK1 0.002 1 2
    1 156815332 A C intronic INSRR、NTRK1 0.003 0 2
    1 156830778 G A 外顯 NTRK1 0.0067 錯義 2 4
    1 156842064 C T intronic NTRK1 0.0014 1 2
    1 156843438 C A 外顯 NTRK1 0.0032 錯義 1 2
    1 156845773 C T intronic NTRK1 0.001 2 4
    1 156873318 T C intronic PEAR1 0.01 4 8
    1 156922740 G A intronic ARHGEF11 0.007 1 2
    1 156930100 C T intronic ARHGEF11 0.001 2 4
    2 113901230 G A 跨基因 IL1RN、PSD4 0.0082 1 2
    2 113953418 C A intronic PSD4 0.001 2 4
    2 113967621 G C 跨基因 PSD4、PAX8 0.002 0 6
    2 113967624 T C 跨基因 PSD4、PAX8 0.002 0 2
    2 113980967 G A intronic PAX8 0.002 2 4
    2 113994010 A C ncRNA_exonic PAX8-AS1 0.001 0 4
    2 113997745 C A ncRNA_exonic PAX8-AS1 0.001 2 4
    2 114061327 T C 跨基因 PAX8,CBWD2 0.001 2 4
    2 114084018 A C 跨基因 PAX8,CBWD2 0.0045 0 4
    2 114099037 G A 跨基因 PAX8,CBWD2 0.0051 1 2
    2 114105670 A T 跨基因 PAX8,CBWD2 0.001 1 2
    2 114111325 G T 跨基因 PAX8,CBWD2 0.001 1 2
    3 12265797 C T 跨基因 SYN2,PPARG 0.0089 2 4
    3 12277958 A G 跨基因 SYN2,PPARG 0.002 1 2
    3 12296019 G A 跨基因 SYN2,PPARG 0.002 2 4
    3 12316549 G C 跨基因 SYN2,PPARG 0.002 1 2
    3 12335681 T G intronic PPARG 0.0092 2 4
    3 12348795 T C intronic PPARG 0.0014 1 2
    3 12353106 T C intronic PPARG 0.001 2 4
    3 12403825 G A intronic PPARG 0.0051 2 4
    3 12404394 G A intronic PPARG 0.001 1 2
    3 12410289 G A intronic PPARG 0.008 2 4
    3 12431381 C T intronic PPARG 0.0061 2 4
    3 12447267 G A intronic PPARG 0.0089 2 4
    3 12449379 C T intronic PPARG 0.0092 2 4
    3 12450848 C A intronic PPARG 0.0092 2 4
    3 12462847 T C intronic PPARG 0.002 1 2
    3 12492797 G A 跨基因 PPARG,TSEN2 0.01 1 2
    3 12503201 G A 跨基因 PPARG,TSEN2 0.0099 2 4
    3 12530460 A G intronic TSEN2 0.0092 2 4
    3 12531167 A G intronic TSEN2 0.0099 2 4
    3 12557737 A G intronic TSEN2 0.001 2 4
    3 59636143 A G 跨基因 C3orf67、FHIT 0.003 3 6
    3 59645934 A C 跨基因 C3orf67、FHIT 0.004 1 2
    3 59646893 G A 跨基因 C3orf67、FHIT 0.002 1 2
    3 59697024 A G 跨基因 C3orf67、FHIT 0.0072 1 2
    3 59701013 G A 跨基因 C3orf67、FHIT 0.004 2 4
    3 59733945 A G 跨基因 C3orf67、FHIT 0.001 2 4
    3 59747482 C T intronic FHIT 0.001 2 4
    3 59750635 A G intronic FHIT 0.003 1 2
    3 59757776 C T intronic FHIT 0.001 2 4
    3 59770612 G A intronic FHIT 0.001 2 4
    3 59804444 G C intronic FHIT 0.001 2 4
    3 59819769 T C intronic FHIT 0.001 2 4
    3 59884396 C T intronic FHIT 0.001 2 4
    3 59960728 A C intronic FHIT 0.01 1 2
    3 59970345 G A intronic FHIT 0.002 1 2
    3 59972417 T A intronic FHIT 0.0072 0 2
    3 60104328 C A intronic FHIT 0.01 2 4
    3 60139062 G A intronic FHIT 0.01 0 2
    3 60158066 C T intronic FHIT 0.001 1 2
    3 60169285 C T intronic FHIT 0.005 1 2
    3 60216185 T C intronic FHIT 0.002 1 2
    3 60226380 G A intronic FHIT 0.007 2 4
    3 60234539 C A intronic FHIT 0.002 1 2
    3 60247464 A C intronic FHIT 0.004 2 4
    3 60269926 A G intronic FHIT 0.007 2 4
    3 60271228 G T intronic FHIT 0.007 2 4
    3 60286972 T C intronic FHIT 0.001 2 4
    3 60301412 C G intronic FHIT 0.001 1 2
    3 60312251 C T intronic FHIT 0.0099 1 2
    3 60317682 A G intronic FHIT 0.008 1 2
    3 60328557 C G intronic FHIT 0.0043 2 4
    3 60342562 C T intronic FHIT 0.006 1 2
    3 60400033 G A intronic FHIT 0.004 2 4
    3 60435819 C T intronic FHIT 0.006 2 4
    3 60435820 G T intronic FHIT 0.004 1 2
    3 60441288 T C intronic FHIT 0.006 2 4
    3 60444465 C A intronic FHIT 0.01 1 2
    3 60444575 C T intronic FHIT 0.001 1 2
    3 60450581 T C intronic FHIT 0.01 1 2
    3 60456571 G A intronic FHIT 0.001 2 4
    3 60473568 C G intronic FHIT 0.001 1 2
    3 60487557 T C intronic FHIT 0.001 1 2
    3 60559705 A G intronic FHIT 0.002 2 4
    3 60570764 T C intronic FHIT 0.008 2 4
    3 60582100 C T intronic FHIT 0.001 1 2
    3 60587192 G A intronic FHIT 0.004 1 2
    3 60599869 G A intronic FHIT 0.0086 2 4
    3 60603091 C T intronic FHIT 0.001 2 4
    3 60603250 A T intronic FHIT 0.0099 1 2
    3 60609831 T G intronic FHIT 0.001 2 4
    3 60619756 G T intronic FHIT 0.0015 2 4
    3 60680758 C T intronic FHIT 0.0089 2 4
    3 60702243 G C intronic FHIT 0.001 2 4
    3 60702532 A G intronic FHIT 0.001 1 2
    3 60714328 A T intronic FHIT 0.004 1 2
    3 60725297 G A intronic FHIT 0.001 1 2
    3 60726640 G A intronic FHIT 0.01 2 4
    3 60795144 A G intronic FHIT 0.001 2 4
    3 60807171 A G intronic FHIT 0.001 1 2
    3 60813868 T C intronic FHIT 0.001 1 2
    3 60826546 C G intronic FHIT 0.0023 1 2
    3 60837392 C T intronic FHIT 0.001 1 2
    3 60846310 A G intronic FHIT 0.01 0 2
    3 60850985 C T intronic FHIT 0.004 1 2
    3 60852559 T C intronic FHIT 0.008 1 2
    3 60871759 T C intronic FHIT 0.004 1 2
    3 60884396 C T intronic FHIT 0.002 2 4
    3 60897092 C A intronic FHIT 0.001 2 4
    3 60940759 C T intronic FHIT 0.0089 1 2
    3 60982595 A G intronic FHIT 0.003 2 4
    3 60999283 G A intronic FHIT 0.001 1 2
    3 61042977 A G intronic FHIT 0.001 2 4
    3 61043349 T C intronic FHIT 0.001 2 4
    3 61044789 A C intronic FHIT 0.001 2 4
    3 61141621 G A intronic FHIT 0.003 1 2
    3 61148655 G C intronic FHIT 0.001 2 4
    3 61170747 C T intronic FHIT 0.003 1 2
    3 61189473 C G intronic FHIT 0.0099 1 2
    3 61190425 C T intronic FHIT 0.0023 2 4
    3 61193853 C T intronic FHIT 0.0099 0 2
    3 61194793 C T intronic FHIT 0.007 0 2
    3 61194840 A G intronic FHIT 0.0099 0 2
    3 61194886 T A intronic FHIT 0.0099 0 2
    3 61201777 C T intronic FHIT 0.001 2 4
    3 61202292 T C intronic FHIT 0.007 1 2
    3 61232806 G C intronic FHIT 0.0099 1 2
    3 61232910 C T intronic FHIT 0.0099 1 2
    3 61235824 A T intronic FHIT 0.001 2 4
    3 61283810 A C 跨基因 FHIT、PTPRG 0.0089 1 2
    3 61293731 T A 跨基因 FHIT、PTPRG 0.0089 2 4
    3 61296730 C T 跨基因 FHIT、PTPRG 0.001 1 2
    3 61326341 C T 跨基因 FHIT、PTPRG 0.004 2 4
    3 61326620 T C 跨基因 FHIT、PTPRG 0.01 1 2
    3 61327649 G C 跨基因 FHIT、PTPRG 0.001 2 4
    3 61330545 G C 跨基因 FHIT、PTPRG 0.001 2 4
    3 61335803 G A 跨基因 FHIT、PTPRG 0.001 2 4

    使用 1,000 個基因體階段 3 中資料的類似查詢,大約會執行 90 秒的時間,並會處理約 3.38 TB 的資料。

使用具體化資料表

處理大規模的大數據時,您可以具體化間隔資料表,並針對新資料表執行 JOIN 查詢。繼續閱讀本節的其餘內容前,請按照下列步驟建立資料集:

  1. 在 Google Cloud 控制台開啟「BigQuery」頁面。

    前往「BigQuery」頁面

  2. 在「Explorer」面板中,選取要建立資料集的專案。

  3. 展開 「動作」選項,然後按一下「建立資料集」

  4. 在「Create dataset」(建立資料集) 頁面:

    1. 在「Dataset ID」(資料集 ID) 中輸入 genomics
    2. 保留其他預設設定。
    3. 點選「建立資料集」

查詢具有特定基因的具體化資料表

下列步驟說明如何將包含 silver-wall-555:TuteTable.hg19 資料表中特定基因清單的新間隔資料表具體化。

  1. 如要建立間隔資料表,請按照下列步驟操作:

    1. 在 Google Cloud 控制台開啟「BigQuery」頁面。

      前往「BigQuery」頁面

    2. 按一下 [Compose query] (撰寫查詢)

    3. 在「New query」欄位中,執行下列查詢。這項查詢會將部分 silver-wall-555:TuteTable.hg19 資料表具體化為新的 genomics.myIntervalTable 間隔資料表。

      #standardSQL
      CREATE TABLE `genomics.myIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start)-100000 AS region_start,
        MAX(`End`)+100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN ('APC', 'ATM', 'BMPR1A', 'BRCA1', 'BRCA2', 'CDK4',
        'CDKN2A', 'CREBBP', 'EGFR', 'EP300', 'ETV6', 'FHIT', 'FLT3',
        'HRAS', 'KIT', 'MET', 'MLH1', 'NTRK1', 'PAX8', 'PDGFRA',
        'PPARG', 'PRCC', 'PRKAR1A', 'PTEN', 'RET', 'STK11',
        'TFE3', 'TGFB1', 'TGFBR2', 'TP53', 'WWOX')
      GROUP BY
        Chr,
        Gene );
    4. 按一下 [Run query] (執行查詢)。查詢會傳回以下結果:

    This statement created a new table named PROJECT_ID:genomics.myIntervalTable.
  2. 在「New query」欄位中,執行下列查詢:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.myIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
  3. 按一下 [Run query] (執行查詢)。查詢大約會執行十秒鐘的時間,並會處理約 334 GB 的資料。查詢的結果可識別同類群組內會重疊感興趣區域的罕見變異基因段。

    展開下列部分,查看查詢結果:

    查詢結果

    Chr 開始 參照 Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 156699757 T C intronic RRNAD1 0.002 2 4
    1 156705390 C T intronic RRNAD1 8.0E-4 0 2
    1 156714207 T C intronic HDGF 0.003 0 6
    1 156714440 A C intronic HDGF 0.0068 0 12
    1 156723870 C T 跨基因 HDGF、PRCC 0.006 1 2
    1 156724456 C T 跨基因 HDGF、PRCC 0.002 2 4
    1 156733988 C T 跨基因 HDGF、PRCC 0.001 1 2
    1 156742258 T G intronic PRCC 0.001 2 4
    1 156744826 T G intronic PRCC 0.002 0 8
    1 156779764 G A intronic SH2D2A 0.001 2 4
    1 156783454 A C intronic SH2D2A 0.0014 1 2
    1 156786144 C T intronic NTRK1、SH2D2A 0.0031 2 4
    1 156790510 A T intronic NTRK1 0.002 1 2
    1 156815332 A C intronic INSRR、NTRK1 0.003 0 2
    1 156830778 G A 外顯 NTRK1 0.0067 錯義 2 4
    1 156842064 C T intronic NTRK1 0.0014 1 2
    1 156843438 C A 外顯 NTRK1 0.0032 錯義 1 2
    1 156845773 C T intronic NTRK1 0.001 2 4
    1 156873318 T C intronic PEAR1 0.01 4 8
    1 156922740 G A intronic ARHGEF11 0.007 1 2
    1 156930100 C T intronic ARHGEF11 0.001 2 4
    2 113901230 G A 跨基因 IL1RN、PSD4 0.0082 1 2
    2 113953418 C A intronic PSD4 0.001 2 4
    2 113967621 G C 跨基因 PSD4、PAX8 0.002 0 6
    2 113967624 T C 跨基因 PSD4、PAX8 0.002 0 2
    2 113980967 G A intronic PAX8 0.002 2 4
    2 113994010 A C ncRNA_exonic PAX8-AS1 0.001 0 4
    2 113997745 C A ncRNA_exonic PAX8-AS1 0.001 2 4
    2 114061327 T C 跨基因 PAX8,CBWD2 0.001 2 4
    2 114084018 A C 跨基因 PAX8,CBWD2 0.0045 0 4
    2 114099037 G A 跨基因 PAX8,CBWD2 0.0051 1 2
    2 114105670 A T 跨基因 PAX8,CBWD2 0.001 1 2
    2 114111325 G T 跨基因 PAX8,CBWD2 0.001 1 2
    3 12265797 C T 跨基因 SYN2,PPARG 0.0089 2 4
    3 12277958 A G 跨基因 SYN2,PPARG 0.002 1 2
    3 12296019 G A 跨基因 SYN2,PPARG 0.002 2 4
    3 12316549 G C 跨基因 SYN2,PPARG 0.002 1 2
    3 12335681 T G intronic PPARG 0.0092 2 4
    3 12348795 T C intronic PPARG 0.0014 1 2
    3 12353106 T C intronic PPARG 0.001 2 4
    3 12403825 G A intronic PPARG 0.0051 2 4
    3 12404394 G A intronic PPARG 0.001 1 2
    3 12410289 G A intronic PPARG 0.008 2 4
    3 12431381 C T intronic PPARG 0.0061 2 4
    3 12447267 G A intronic PPARG 0.0089 2 4
    3 12449379 C T intronic PPARG 0.0092 2 4
    3 12450848 C A intronic PPARG 0.0092 2 4
    3 12462847 T C intronic PPARG 0.002 1 2
    3 12492797 G A 跨基因 PPARG,TSEN2 0.01 1 2
    3 12503201 G A 跨基因 PPARG,TSEN2 0.0099 2 4
    3 12530460 A G intronic TSEN2 0.0092 2 4
    3 12531167 A G intronic TSEN2 0.0099 2 4
    3 12557737 A G intronic TSEN2 0.001 2 4
    3 59636143 A G 跨基因 C3orf67、FHIT 0.003 3 6
    3 59645934 A C 跨基因 C3orf67、FHIT 0.004 1 2
    3 59646893 G A 跨基因 C3orf67、FHIT 0.002 1 2
    3 59697024 A G 跨基因 C3orf67、FHIT 0.0072 1 2
    3 59701013 G A 跨基因 C3orf67、FHIT 0.004 2 4
    3 59733945 A G 跨基因 C3orf67、FHIT 0.001 2 4
    3 59747482 C T intronic FHIT 0.001 2 4
    3 59750635 A G intronic FHIT 0.003 1 2
    3 59757776 C T intronic FHIT 0.001 2 4
    3 59770612 G A intronic FHIT 0.001 2 4
    3 59804444 G C intronic FHIT 0.001 2 4
    3 59819769 T C intronic FHIT 0.001 2 4
    3 59884396 C T intronic FHIT 0.001 2 4
    3 59960728 A C intronic FHIT 0.01 1 2
    3 59970345 G A intronic FHIT 0.002 1 2
    3 59972417 T A intronic FHIT 0.0072 0 2
    3 60104328 C A intronic FHIT 0.01 2 4
    3 60139062 G A intronic FHIT 0.01 0 2
    3 60158066 C T intronic FHIT 0.001 1 2
    3 60169285 C T intronic FHIT 0.005 1 2
    3 60216185 T C intronic FHIT 0.002 1 2
    3 60226380 G A intronic FHIT 0.007 2 4
    3 60234539 C A intronic FHIT 0.002 1 2
    3 60247464 A C intronic FHIT 0.004 2 4
    3 60269926 A G intronic FHIT 0.007 2 4
    3 60271228 G T intronic FHIT 0.007 2 4
    3 60286972 T C intronic FHIT 0.001 2 4
    3 60301412 C G intronic FHIT 0.001 1 2
    3 60312251 C T intronic FHIT 0.0099 1 2
    3 60317682 A G intronic FHIT 0.008 1 2
    3 60328557 C G intronic FHIT 0.0043 2 4
    3 60342562 C T intronic FHIT 0.006 1 2
    3 60400033 G A intronic FHIT 0.004 2 4
    3 60435819 C T intronic FHIT 0.006 2 4
    3 60435820 G T intronic FHIT 0.004 1 2
    3 60441288 T C intronic FHIT 0.006 2 4
    3 60444465 C A intronic FHIT 0.01 1 2
    3 60444575 C T intronic FHIT 0.001 1 2
    3 60450581 T C intronic FHIT 0.01 1 2
    3 60456571 G A intronic FHIT 0.001 2 4
    3 60473568 C G intronic FHIT 0.001 1 2
    3 60487557 T C intronic FHIT 0.001 1 2
    3 60559705 A G intronic FHIT 0.002 2 4
    3 60570764 T C intronic FHIT 0.008 2 4
    3 60582100 C T intronic FHIT 0.001 1 2
    3 60587192 G A intronic FHIT 0.004 1 2
    3 60599869 G A intronic FHIT 0.0086 2 4
    3 60603091 C T intronic FHIT 0.001 2 4
    3 60603250 A T intronic FHIT 0.0099 1 2
    3 60609831 T G intronic FHIT 0.001 2 4
    3 60619756 G T intronic FHIT 0.0015 2 4
    3 60680758 C T intronic FHIT 0.0089 2 4
    3 60702243 G C intronic FHIT 0.001 2 4
    3 60702532 A G intronic FHIT 0.001 1 2
    3 60714328 A T intronic FHIT 0.004 1 2
    3 60725297 G A intronic FHIT 0.001 1 2
    3 60726640 G A intronic FHIT 0.01 2 4
    3 60795144 A G intronic FHIT 0.001 2 4
    3 60807171 A G intronic FHIT 0.001 1 2
    3 60813868 T C intronic FHIT 0.001 1 2
    3 60826546 C G intronic FHIT 0.0023 1 2
    3 60837392 C T intronic FHIT 0.001 1 2
    3 60846310 A G intronic FHIT 0.01 0 2
    3 60850985 C T intronic FHIT 0.004 1 2
    3 60852559 T C intronic FHIT 0.008 1 2
    3 60871759 T C intronic FHIT 0.004 1 2
    3 60884396 C T intronic FHIT 0.002 2 4
    3 60897092 C A intronic FHIT 0.001 2 4
    3 60940759 C T intronic FHIT 0.0089 1 2
    3 60982595 A G intronic FHIT 0.003 2 4
    3 60999283 G A intronic FHIT 0.001 1 2
    3 61042977 A G intronic FHIT 0.001 2 4
    3 61043349 T C intronic FHIT 0.001 2 4
    3 61044789 A C intronic FHIT 0.001 2 4
    3 61141621 G A intronic FHIT 0.003 1 2
    3 61148655 G C intronic FHIT 0.001 2 4
    3 61170747 C T intronic FHIT 0.003 1 2
    3 61189473 C G intronic FHIT 0.0099 1 2
    3 61190425 C T intronic FHIT 0.0023 2 4
    3 61193853 C T intronic FHIT 0.0099 0 2
    3 61194793 C T intronic FHIT 0.007 0 2
    3 61194840 A G intronic FHIT 0.0099 0 2
    3 61194886 T A intronic FHIT 0.0099 0 2
    3 61201777 C T intronic FHIT 0.001 2 4
    3 61202292 T C intronic FHIT 0.007 1 2
    3 61232806 G C intronic FHIT 0.0099 1 2
    3 61232910 C T intronic FHIT 0.0099 1 2
    3 61235824 A T intronic FHIT 0.001 2 4
    3 61283810 A C 跨基因 FHIT、PTPRG 0.0089 1 2
    3 61293731 T A 跨基因 FHIT、PTPRG 0.0089 2 4
    3 61296730 C T 跨基因 FHIT、PTPRG 0.001 1 2
    3 61326341 C T 跨基因 FHIT、PTPRG 0.004 2 4
    3 61326620 T C 跨基因 FHIT、PTPRG 0.01 1 2
    3 61327649 G C 跨基因 FHIT、PTPRG 0.001 2 4
    3 61330545 G C 跨基因 FHIT、PTPRG 0.001 2 4
    3 61335803 G A 跨基因 FHIT、PTPRG 0.001 2 4

    使用 1,000 個基因體階段 3 中資料的類似查詢,大約會執行 90 秒的時間,並會處理約 3.38 TB 的資料。

查詢具有 250 個隨機基因的具體化資料表

下列範例顯示如何針對包含從 silver-wall-555:TuteTable.hg19 資料表隨機選取的 250 個基因的具體化資料表執行間隔 JOIN

  1. 如要建立間隔資料表,請按照下列步驟操作:

    1. 在 Google Cloud 控制台開啟「BigQuery」頁面。

      前往「BigQuery」頁面

    2. 按一下 [Compose query] (撰寫查詢)

    3. 在「New query」(新查詢) 欄位中,執行下列查詢,將部分 silver-wall-555:TuteTable.hg19 資料表具體化為新的 genomics.randomGenesIntervalTable 間隔資料表。

      #standardSQL
      CREATE TABLE `genomics.randomGenesIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start) - 100000 AS region_start,
        MAX(`End`) + 100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN (SELECT Gene FROM `silver-wall-555.TuteTable.hg19` GROUP BY Gene LIMIT 250)
      GROUP BY
        Chr,
        Gene );
      1. 按一下 [Run query] (執行查詢)。查詢會傳回以下結果:
      This statement created a new table named PROJECT_ID:genomics.randomGenesIntervalTable.
  2. 在「New query」欄位中,執行下列查詢:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.randomGenesIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
  3. 按一下 [Run query] (執行查詢)。查詢大約會執行十秒鐘的時間,並會處理約 334 GB 的資料。查詢的結果可識別同類群組內會重疊感興趣區域的罕見變異基因段。

    展開以下部分,查看截斷的查詢結果:

    查詢結果

    Chr 開始 參照 Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 2925355 C A 跨基因 TTC34、ACTRT2 0.001 2 4
    1 2933170 G A 跨基因 TTC34、ACTRT2 0.0083 0 4
    1 2944477 G A 跨基因 ACTRT2,LINC00982 0.003 4 6
    1 2967591 A T 跨基因 ACTRT2,LINC00982 0.0092 1 2
    1 2975255 T C 下游 LINC00982 0.0082 1 2
    1 2977223 C T ncRNA_intronic LINC00982 0.0072 1 2
    1 2978803 G C ncRNA_exonic LINC00982 0.002 4 6
    1 3006466 G A intronic PRDM16 0.0098 1 2
    1 3011333 G T intronic PRDM16 0.004 1 2
    1 3019659 C T intronic PRDM16 0.0031 1 2
    1 3036896 G A intronic PRDM16 0.001 1 2
    1 3037388 G A intronic PRDM16 0.002 2 4
    1 3041250 T G intronic PRDM16 0.006 2 4
    1 3042502 A T intronic PRDM16 0.003 4 6
    1 3053713 A C intronic PRDM16 0.002 1 2
    1 3063109 C T intronic PRDM16 0.002 0 2
    1 3063593 T C intronic PRDM16 0.003 1 2
    1 3076439 C T intronic PRDM16 0.001 2 4
    1 3078960 G A intronic PRDM16 0.007 2 4
    1 3084268 A C intronic PRDM16 0.005 0 2
    1 3084492 T C intronic PRDM16 0.0015 0 2
    1 3084786 T C intronic PRDM16 0.0015 0 4
    1 3111119 G A intronic PRDM16 0.003 1 2
    1 3111643 C T intronic PRDM16 0.0041 1 2
    1 3114807 G A intronic PRDM16 0.0041 1 2
    1 3165530 C T intronic PRDM16 0.0089 1 2
    1 3169325 G A intronic PRDM16 0.008 2 4
    1 3179623 C T intronic PRDM16 0.003 2 4
    1 3181097 C T intronic PRDM16 0.001 2 4
    1 3194000 G C intronic PRDM16 0.005 2 4
    1 3195769 T C intronic PRDM16 0.002 1 2
    1 3197351 C T intronic PRDM16 0.0061 1 2
    1 3224100 C A intronic PRDM16 0.003 2 4
    1 3228644 G T intronic PRDM16 0.001 2 4
    1 3234045 G A intronic PRDM16 0.002 1 2
    1 3235971 G A intronic PRDM16 0.0089 1 2
    1 3274115 C T intronic PRDM16 0.001 2 4
    1 3291388 G A intronic PRDM16 0.002 2 4
    1 3295658 A C intronic PRDM16 0.0068 0 6
    1 3295937 A C intronic PRDM16 0.0068 0 2
    1 3296205 T C intronic PRDM16 0.0083 0 2
    1 3315690 G A intronic PRDM16 0.001 2 4
    1 3329212 G A 外顯 PRDM16 0.0031 錯義 1 2
    1 3331787 C T intronic PRDM16 0.0099 1 2
    1 3370316 G C 上游 ARHGEF16 0.001 2 4
    1 3379560 A G intronic ARHGEF16 0.0051 0 6
    1 3391174 C T intronic ARHGEF16 0.006 1 2
    1 3413873 G A 外顯 MEGF6 0.003 錯義 1 2
    1 3416272 C T 外顯 MEGF6 0.0072 靜音 2 4
    1 3417122 G A intronic MEGF6 0.0038 2 4
    1 3436219 G A intronic MEGF6 0.0046 2 4
    1 12907456 A G 外顯 HNRNPCL1,LOC649330 0.006 錯義 0 10
    1 12907518 C A 外顯 HNRNPCL1,LOC649330 1.0E-4 錯義 0 10
    1 12908499 G C intronic HNRNPCL1 0.0031 0 8
    1 12931660 G C 跨基因 PRAMEF2、PRAMEF4 0.004 1 2
    1 12937721 G T 跨基因 PRAMEF2、PRAMEF4 0.0038 0 2
    1 12940827 G T intronic PRAMEF4 0.007 2 4
    1 12942759 T G intronic PRAMEF4 0.0076 0 10
    1 12942805 T G intronic PRAMEF4 0.0061 0 12
    1 12942812 G A intronic PRAMEF4 0.0061 0 12
    1 12942875 A G intronic PRAMEF4 0.0068 0 6
    1 12942912 G C intronic PRAMEF4 2.0E-4 0 2
    1 12942937 A T 外顯 PRAMEF4 0.0029 錯義 0 2
    1 12942940 T G 外顯 PRAMEF4 0.0038 錯義 0 2
    1 12943940 T C intronic PRAMEF4 0.0015 0 12
    1 12944138 A G intronic PRAMEF4 8.0E-4 0 12
    1 12944234 G A intronic PRAMEF4 0.0015 0 12
    1 12944589 T G intronic PRAMEF4 0.003 0 4
    1 12944845 A C intronic PRAMEF4 0.0014 0 6
    1 12946439 T C 上游 PRAMEF4 0.0029 0 10
    1 12946833 G A 上游 PRAMEF4 0.001 0 8
    1 12946835 T A 上游 PRAMEF4 0.004 0 12
    1 12995204 G T 跨基因 PRAMEF8、PRAMEF6 0.003 1 4
    1 12997638 T C 下游 PRAMEF6、PRAMEF9 0.003 2 4
    1 13007841 G C 上游 PRAMEF6 0.0043 0 8
    1 13019228 T A 跨基因 PRAMEF6、LOC391003 0.0015 0 10
    1 13038503 G A UTR3 LOC391003 0.0072 1 2
    1 13051650 C T 跨基因 LOC391003,PRAMEF5 0.002 2 4
    1 15706063 G A intronic FHAD1 0.0029 1 2
    1 15713292 C T intronic FHAD1 0.001 1 2
    1 15766541 G C intronic CTRC 0.001 1 2
    1 15782601 T C 上游 CELA2A 0.0038 1 2
    1 15828125 G A intronic CASP9 0.0014 2 4
    1 15831037 G A intronic CASP9 0.0099 1 2
    1 15840513 T G intronic CASP9 0.0043 2 4
    1 15868742 G A intronic DNAJC16 0.001 1 2
    1 15876704 G A intronic DNAJC16 0.001 1 2
    1 15900342 C A intronic AGMAT 0.001 1 2
    1 15906257 T C intronic AGMAT 8.0E-4 1 2
    1 15911897 A G 上游 AGMAT 0.0043 2 4
    1 22764178 C T 跨基因 WNT4、ZBTB40 0.001 2 4
    1 22791939 C T intronic ZBTB40 0.0089 2 4
    1 22874394 C G 跨基因 ZBTB40、EPHA8 0.007 1 2
    1 22875103 C G 跨基因 ZBTB40、EPHA8 0.007 1 2
    1 22906403 C T intronic EPHA8 0.008 2 4
    1 22912956 G A intronic EPHA8 0.001 1 2
    1 22917007 C T intronic EPHA8 0.001 2 4
    1 22927240 G A 外顯 EPHA8 0.0013 錯義 2 4
    1 22932265 G A 跨基因 EPHA8、MIR6127 0.0089 2 4
    1 22944057 C T 跨基因 EPHA8、MIR6127 0.0089 2 4
    1 22978799 A G 上游 C1QB 0.0099 2 4
    1 35170588 C T 跨基因 C1orf94、GJB5 0.01 1 2
    1 35172426 C T 跨基因 C1orf94、GJB5 0.008 1 2
    1 35172447 G A 跨基因 C1orf94、GJB5 0.001 1 2
    1 35175302 C T 跨基因 C1orf94、GJB5 0.008 1 2
    1 35177410 A T 跨基因 C1orf94、GJB5 0.001 1 2
    1 35178768 C T 跨基因 C1orf94、GJB5 0.0014 2 4
    1 35179362 G A 跨基因 C1orf94、GJB5 0.0014 2 4
    1 35186166 G A 跨基因 C1orf94、GJB5 0.0099 2 4
    1 35186520 A C 跨基因 C1orf94、GJB5 0.002 2 4
    1 35196361 G A 跨基因 C1orf94、GJB5 0.0099 2 4
    1 35223545 C T 外顯 GJB5 0.001 靜音 1 2
    1 35224029 G A UTR3 GJB5 0.003 1 2
    1 35227895 T C UTR3 GJB4 5.0E-4 1 2
    1 35230455 G T 跨基因 GJB4、GJB3 0.0043 1 2
    1 35232954 T C 跨基因 GJB4、GJB3 0.003 1 2
    1 35237986 G A 跨基因 GJB4、GJB3 0.0014 1 2
    1 35245522 C T 跨基因 GJB4、GJB3 0.001 1 2
    1 35256979 C T 跨基因 GJB3、GJA4 0.002 2 4
    1 35263872 C T 跨基因 GJA4、SMIM12 5.0E-4 2 4
    1 35323895 A C intronic SMIM12 0.0027 2 4
    1 35369676 G A intronic DLGAP3 0.007 2 4
    1 35371634 T A 上游 DLGAP3 0.0015 0 4
    1 39253519 G A 跨基因 LINC01343,RRAGC 0.005 1 2
    1 39288829 G A 跨基因 LINC01343,RRAGC 0.0051 1 2
    1 39289832 A C 跨基因 LINC01343,RRAGC 0.002 0 2
    1 39312638 G A intronic RRAGC 0.0038 2 4
    1 39361372 G A intronic RHBDL2 0.005 1 2
    1 39363826 T G intronic RHBDL2 0.0029 1 2
    1 39367555 T C intronic RHBDL2 0.007 2 4
    1 39369531 T C intronic RHBDL2 0.001 2 4
    1 39370202 T C intronic RHBDL2 0.01 1 2
    1 39449101 A G 跨基因 RHBDL2、AKIRIN1 0.001 2 4
    1 39475057 G A 跨基因 AKIRIN1、NDUFS5 0.01 1 2
    1 39485016 C T 跨基因 AKIRIN1、NDUFS5 0.001 2 4
    1 39488137 A G 跨基因 AKIRIN1、NDUFS5 0.001 2 4
    1 39499212 A C intronic NDUFS5 0.001 0 2
    1 39500605 C G 下游 NDUFS5 0.002 0 10
    1 46813814 T C intronic NSUN4 0.0014 1 2
    1 46817258 A G intronic NSUN4 0.005 0 2
    1 46843158 T C 跨基因 NSUN4,FAAH 0.001 1 2
    1 46933509 A G 跨基因 LINC01398,DMBX1 0.002 1 2
    1 46935021 G A 跨基因 LINC01398,DMBX1 0.004 2 4
    1 46939253 T A 跨基因 LINC01398,DMBX1 0.004 2 4
    1 46951788 C A 跨基因 LINC01398,DMBX1 0.002 2 4
    1 46980864 G C 下游 DMBX1 0.003 1 2
    1 46989657 T C 跨基因 DMBX1、MKNK1-AS1 0.007 1 2
    1 46994678 C T 跨基因 DMBX1、MKNK1-AS1 0.002 1 2
    1 46999438 T C 跨基因 DMBX1、MKNK1-AS1 0.002 1 2
    1 92761505 A G intronic GLMN 0.001 2 4
    1 92764270 G C intronic GLMN 0.001 2 4
    1 92802210 G A intronic RPAP2 0.0072 1 2
    1 92820663 T A intronic RPAP2 0.0058 1 2
    1 92820664 G T intronic RPAP2 0.0058 1 2
    1 92820953 G A intronic RPAP2 0.007 2 4
    1 92824766 A G intronic RPAP2 0.0058 1 2
    1 92849183 C A intronic RPAP2 0.01 2 4
    1 92850696 C G intronic RPAP2 0.0023 1 2
    1 92861357 T C 跨基因 RPAP2、GFI1 0.01 2 4
    1 92877460 C G 跨基因 RPAP2、GFI1 0.002 1 2
    1 92880643 A G 跨基因 RPAP2、GFI1 0.001 2 4
    1 92911540 G A 跨基因 RPAP2、GFI1 0.004 2 4
    1 92911721 A C 跨基因 RPAP2、GFI1 0.0031 0 8
    1 92918277 C T 跨基因 RPAP2、GFI1 0.001 2 4
    1 92950920 G A intronic GFI1 0.008 2 4
    1 92964788 G A 跨基因 GFI1、EVI5 0.0023 1 2
    1 92977480 C T UTR3 EVI5 0.002 1 2
    1 92985213 C T intronic EVI5 0.001 2 4
    1 92988342 C T intronic EVI5 0.008 2 4
    1 92992283 G A intronic EVI5 0.01 2 4
    1 92999760 C T intronic EVI5 0.003 1 2
    1 93005149 G C intronic EVI5 0.003 0 4
    1 93018543 A T intronic EVI5 0.01 2 4
    1 93033744 C T intronic EVI5 0.001 2 4
    1 111400296 G A 跨基因 KCNA3,CD53 0.0014 2 4
    1 111411924 C T 跨基因 KCNA3,CD53 0.003 1 2
    1 111441850 C G UTR3 CD53 0.003 2 4
    1 111451527 C T 跨基因 CD53、LRIF1 0.008 2 4
    1 111454082 C A 跨基因 CD53、LRIF1 0.001 2 4
    1 111466506 A G 跨基因 CD53、LRIF1 0.001 2 4
    1 111525974 G A 跨基因 LRIF1、DRAM2 0.002 2 4
    1 111574573 G T 跨基因 LRIF1、DRAM2 0.0072 2 4
    1 111574594 T A 跨基因 LRIF1、DRAM2 0.005 1 2
    1 111574647 G A 跨基因 LRIF1、DRAM2 0.005 1 2
    1 111591746 T A 跨基因 LRIF1、DRAM2 0.005 1 2
    1 111601459 A G 跨基因 LRIF1、DRAM2 0.005 1 2
    1 111604748 G C 跨基因 LRIF1、DRAM2 0.005 1 2
    1 112191526 T G intronic RAP1A 0.001 2 4
    1 112206765 A G intronic RAP1A 0.0043 1 2
    1 112226517 G A intronic RAP1A 0.001 0 2
    1 112263324 G T 跨基因 RAP1A、FAM212B 0.003 2 4
    1 112264843 G A UTR3 FAM212B 0.001 1 2
    1 112285810 C T ncRNA_intronic FAM212B-AS1 0.004 1 2
    1 112304285 T C intronic DDX20 0.0043 1 2
    1 112307213 A C intronic DDX20 0.0043 1 2
    1 112309436 G T 外顯 DDX20 0.0 錯義 1 2
    1 112317384 T C 跨基因 DDX20、KCND3 0.0014 1 2
    1 112381367 C T intronic KCND3 0.002 1 2
    1 112396571 G T ncRNA_exonic KCND3-IT1 0.001 1 2
    1 113520038 G A 跨基因 SLC16A1-AS1,LOC100996251 0.0023 1 2

    使用 1,000 個基因體階段 3 中資料的類似查詢,大約會執行 90 秒的時間,並會處理約 3.38 TB 的資料。