groupGenes - Group sequences by gene assignment


groupGenes will group rows by shared V and J gene assignments, and optionally also by junction lengths. Both VH:VL paired single-cell BCR-seq and unpaired bulk-seq (heavy chain-only) are supported. In the case of ambiguous (multiple) gene assignments, the grouping may be specified to be a union across all ambiguous V and J gene pairs, analagous to single-linkage clustering (i.e., allowing for chaining).


v_call = "v_call",
j_call = "j_call",
junc_len = NULL,
cell_id = NULL,
locus = NULL,
only_igh = TRUE,
first = FALSE


data.frame containing sequence data.
name of the column containing the heavy chain V-segment allele calls.
name of the column containing the heavy chain J-segment allele calls.
name of column containing the junction length. Optional.
name of the column containing cell IDs. Only applicable and required for single-cell mode.
name of the column containing locus information. Only applicable and required for single-cell mode.
use only heavy chain (IGH) sequences for grouping, disregarding light chains. Only applicable and required for single-cell mode. Default is TRUE.
if TRUE only the first call of the gene assignments is used. if FALSE the union of ambiguous gene assignments is used to group all sequences with any overlapping gene calls.


Returns a modified data.frame with disjoint union indices in a new vj_group column.

Note that if junc_len is supplied, the grouping this vj_group will have been based on V, J, and L simultaneously despite the column name being vj_group.

Note that the output v_call, j_call, cell_id, and locus columns will be converted to character if they were factor in the input data.


To invoke single-cell mode, both cell_id and locus must be supplied. Otherwise, the function will run under non-single-cell mode, using all input sequences regardless of the value in the locus column.

Under single-cell mode for VH:VL paired sequences, there is a choice of whether grouping should be done using only heavy chain (IGH) sequences only, or using both heavy chain (IGH) and light chain (IGK, IGL) sequences. This is governed by only_igh.

Values in the locus column must be one of "IGH", "IGK", and "IGL".

By supplying junc_len, the call amounts to a 1-stage partitioning of the sequences/cells based on V annotation, J annotation, and junction length simultaneously. Without supplying this columns, the call amounts to the first stage of a 2-stage partitioning, in which sequences/cells are partitioned in the first stage based on V annotation and J annotation, and then in the second stage further split based on junction length.

It is assumed that ambiguous gene assignments are separated by commas.

In the input data, the v_call, j_call, cell_id, and locus columns, if present, must be character, as opposed to factor.

All rows containing NA values in their any of the v_call, j_call, and, if specified, junc_len, columns will be removed. A warning will be issued when a row containing an NA is removed.

Expectation for single-cell input

For single-cell BCR data with VH:VL pairing, it is assumed that

  • every row represents a sequence (chain)
  • heavy and light chains of the same cell are linked by cell_id
  • value in locus column indicates whether the chain is heavy or light
  • each cell possibly contains multiple heavy and/or light chains
  • every chain has its own V(D)J annotation, in which ambiguous V(D)J annotations, if any, are separated by , (comma)

An example:

  • A cell has 1 heavy chain and 2 light chains
  • There should be 3 rows corresponding to this cell
  • One of the light chain has ambiguous V annotation, which looks like Homsap IGKV1-39*01 F,Homsap IGKV1D-39*01 F.


# Group by genes
db <- groupGenes(ExampleDb, v_call="v_call", j_call="j_call")