groupGenes - Group sequences by gene assignment
groupGenes will group rows by shared V and J gene assignments,
and optionally also by junction lengths. IGH:IGK/IGL, TRB:TRA, and TRD:TRG
paired single-cell BCR/TCR sequencing and unpaired bulk sequencing
(IGH, TRB, TRD chain only) are supported. In the case of ambiguous (multiple)
gene assignments, the grouping may be specified to be a union across all
ambiguous V and J gene pairs, analogous to single-linkage clustering
(i.e., allowing for chaining).
groupGenes( data, v_call = "v_call", j_call = "j_call", junc_len = NULL, cell_id = NULL, locus = "locus", only_heavy = TRUE, first = FALSE )
- data.frame containing sequence data.
- name of the column containing the heavy/long chain V-segment allele calls.
- name of the column containing the heavy/long chain J-segment allele calls.
- name of column containing the junction length.
NULLthen 1-stage partitioning is perform considering only the V and J genes is performed. See Details for further clarification.
- name of the column containing cell identifiers or barcodes.
If specified, grouping will be performed in single-cell mode
with the behavior governed by the
only_heavyarguments. If set to
NULLthen the bulk sequencing data is assumed.
- name of the column containing locus information.
Only applicable to single-cell data.
- use only the IGH (BCR) or TRB/TRD (TCR) sequences
for grouping. Only applicable to single-cell data.
TRUEonly the first call of the gene assignments is used. if
FALSEthe union of ambiguous gene assignments is used to group all sequences with any overlapping gene calls.
Returns a modified data.frame with disjoint union indices
in a new
junc_len is supplied, the grouping this
will have been based on V, J, and junction length simultaneously. However,
the output column name will remain
columns will be converted to type
character if they were of type
factor in the input
To invoke single-cell mode the
cell_id argument must be specified and the
column must be correct. Otherwise,
groupGenes will be run with bulk sequencing assumptions,
using all input sequences regardless of the values in the
Values in the
locus column must be one of
c("IGH", "IGI", "IGK", "IGL") for BCR
c("TRA", "TRB", "TRD", "TRG") for TCR sequences. Otherwise, the function returns an
error message and stops.
Under single-cell mode with paired chained sequences, there is a choice of whether
grouping should be done by (a) using IGH (BCR) or TRB/TRD (TCR) sequences only or
(b) using IGH plus IGK/IGL (BCR) or TRB/TRD plus TRA/TRG (TCR).
This is governed by the
junc_len will force
groupGenes to perform a 1-stage partitioning of the
sequences/cells based on V gene, J gene, and junction length simultaneously.
junc_len=NULL (no column specified), then
groupGenes performs only the first
stage of a 2-stage partitioning in which sequences/cells are partitioned in the first stage
based on V gene and J gene, and then in the second stage further splits the groups based on
junction length (the second stage must be performed independently, as this only returns the
first stage results).
In the input
columns, if present, must be of type
character (as opposed to
It is assumed that ambiguous gene assignments are separated by commas.
All rows containing
NA values in any of the
junc_len != NULL) columns will be removed. A warning will be issued when a row
NA is removed.
Expectations for single-cell data¶
Single-cell paired chain data assumptions:
- every row represents a sequence (chain).
- heavy/long and light/short chains of the same cell are linked by
- the value in
locuscolumn indicates whether the chain is the heavy/long or light/short chain.
- each cell possibly contains multiple heavy/long and/or light/short chains.
- every chain has its own V(D)J annotation, in which ambiguous V(D)J annotations, if any, are separated by a comma.
- A cell has 1 heavy chain and 2 light chains.
- There should be 3 rows corresponding to this cell.
- One of the light chains may have an ambiguous V annotation which looks like
"Homsap IGKV1-39*01 F,Homsap IGKV1D-39*01 F".
# Group by genes db <- groupGenes(ExampleDb) head(db$vj_group)
 "G28" "G50" "G36" "G36" "G86" "G61"