groupGenes - Group sequences by gene assignment
groupGenes will group rows by shared V and J gene assignments,
and optionally also by junction lengths.
Both VH:VL paired single-cell BCR-seq and unpaired bulk-seq (heavy chain-only)
In the case of ambiguous (multiple) gene assignments, the grouping may
be specified to be a union across all ambiguous V and J gene pairs,
analagous to single-linkage clustering (i.e., allowing for chaining).
groupGenes( data, v_call = "v_call", j_call = "j_call", junc_len = NULL, cell_id = NULL, locus = NULL, only_igh = TRUE, first = FALSE )
- data.frame containing sequence data.
- name of the column containing the heavy chain V-segment allele calls.
- name of the column containing the heavy chain J-segment allele calls.
- name of column containing the junction length. Optional.
- name of the column containing cell IDs. Only applicable and required for single-cell mode.
- name of the column containing locus information. Only applicable and required for single-cell mode.
- use only heavy chain (
IGH) sequences for grouping, disregarding light chains. Only applicable and required for single-cell mode. Default is
TRUEonly the first call of the gene assignments is used. if
FALSEthe union of ambiguous gene assignments is used to group all sequences with any overlapping gene calls.
Returns a modified data.frame with disjoint union indices
in a new
Note that if
junc_len is supplied, the grouping this
will have been based on V, J, and L simultaneously despite the column name
Note that the output
columns will be converted to
character if they were
factor in the
To invoke single-cell mode, both
locus must be supplied. Otherwise,
the function will run under non-single-cell mode, using all input sequences regardless of the
value in the
Under single-cell mode for VH:VL paired sequences, there is a choice of whether grouping
should be done using only heavy chain (
IGH) sequences only, or using both heavy chain
IGH) and light chain (
IGL) sequences. This is governed by
Values in the
locus column must be one of
junc_len, the call amounts to a 1-stage partitioning of the sequences/cells
based on V annotation, J annotation, and junction length simultaneously. Without supplying this
columns, the call amounts to the first stage of a 2-stage partitioning, in which sequences/cells
are partitioned in the first stage based on V annotation and J annotation, and then in the second
stage further split based on junction length.
It is assumed that ambiguous gene assignments are separated by commas.
In the input
columns, if present, must be
character, as opposed to
All rows containing
NA values in their any of the
junc_len, columns will be removed. A warning will be issued when a row
NA is removed.
Expectation for single-cell input¶
For single-cell BCR data with VH:VL pairing, it is assumed that
- every row represents a sequence (chain)
- heavy and light chains of the same cell are linked by
- value in
locuscolumn indicates whether the chain is heavy or light
- each cell possibly contains multiple heavy and/or light chains
- every chain has its own V(D)J annotation, in which ambiguous V(D)J
annotations, if any, are separated by
- A cell has 1 heavy chain and 2 light chains
- There should be 3 rows corresponding to this cell
- One of the light chain has ambiguous V annotation, which looks like
Homsap IGKV1-39*01 F,Homsap IGKV1D-39*01 F.
# Group by genes db <- groupGenes(ExampleDb, v_call="v_call", j_call="j_call")