Test data set

After installing IgDiscover, you should run it once on a small test data that we provide, both to test your installation and to familiarize yourself with running the program.

  1. Download und unpack the test data set (version 0.5). To do this from the command-line, use these commands:

    wget https://bitbucket.org/igdiscover/testdata/downloads/igdiscover-testdata-0.5.tar.gz
    tar xvf igdiscover-testdata-0.5.tar.gz
    

The test data set contains some paired-end reads from human IgM heavy chain dataset ERR1760498 and a database of IGHV, IGHD, IGHJ sequences based on Ensembl annotations. You should use a database of higher quality for your own experiments.

  1. Initialize the IgDiscover pipeline directory:

    igdiscover init --db igdiscover-testdata/database/ --reads igdiscover-testdata/reads.1.fastq.gz discovertest
    

    The name discovertest is the name of the pipeline directory that will be created. Note that only the path to the first reads file needs to be given. The second file is found automatically. There may be a couple of messages “Skipping ‘x’ because it contains the same sequence as ‘y’”, which you can ignore.

    The command will have printed a message telling you that the pipeline directory has been initialized, that you should edit the configuration file, and how to actually run IgDiscover after that.

  2. The generated igdiscover.yaml configuration file does not actually need to be edited for the test dataset, but you may still want to have a read through it as you will need to do so for you own data. You may want to do this while the pipeline is running in the next step. The configuration is in YAML format. When editing the file, just follow the way it is already structured.

  3. Run the analysis. To do so, change into the pipeline directory and run this command:

    cd discovertest && igdiscover run
    

    On this small dataset, running the pipeline should take not more than about 5 minutes.

  4. Finally, inspect the results in the discovertest/iteration-01 or discovertest/final directories. The discovered V genes and extra information are listed in discovertest/iteration-01/new_V_germline.tab. Discovered J genes are in discovertest/iteration-01/new_J.tab. There are also corresponding .fasta files with the sequences only.

    See the explanation of final result files.

Other test data sets

ENA project PRJEB15295 contains the data for our Nature Communications paper from 2016, in particular ERR1760498, which is the data for the human “H1” sample (multiplex PCR, IgM heavy chain).

Data used for testing TCR detection (human, RACE): SRR2905677 and SRR2905710.