Lederhosen is a set of tools for OTU clustering rRNA amplicons using Robert Edgar's USEARCH and is simple, robust, and fast. Lederhosen was designed from the beginning to handle lots of data from lots of samples, specifically from data generated by multiplexed Illumina Hi/Mi-Seq sequencing.
No assumptions are made about the design of your experiment. Therefore, there are no tools for read pre-processing and data analysis or statistics. Insert reads, receive data.
Install Lederhosen by typing:
sudo gem install lederhosen
Check installation by typing
lederhosen. You should see some help text.
Bug me: @heyaudy (twitter)
Lederhosen is invoked by typing
Create UDB database required by usearch from TaxCollector
lederhosen make_udb \ --input=taxcollector.fa \ --output=taxcollector.udb
(not actually required but will make batch searching a lot faster)
Cluster reads using USEARCH. Output is a uc file.
lederhosen cluster \ --input=trimmed/sequences.fasta \ --identity=0.95 \ --output=clusters_95.uc \ --database=taxcollector.udb
--dry-run parameter outputs the usearch command to standard out.
This is useful if you want to run usearch on a cluster.
for reads_file in reads/*.fasta; do echo lederhosen cluster \ --input=$reads_file \ --identity=0.95 \ --output=$(basename $reads_file_ .fasta).95.uc \ --database=taxcollector.udb \ --threads 1 \ --dry-run end > jobs.sh # send jobs to queue system cat jobs.sh | parallel -j 24 # run 24 parallel jobs
Before generating OTU tables, you must generate taxonomy counts tables.
A taxonomy count table looks something like this
# taxonomy, number_of_reads Bacteria;...;Akkermansia_municipalia, 28 ...
From there, you can generate OTU abundance matrices at the different levels of classification (domain, phylum, ..., genus, species).
lederhosen count_taxonomies \ --input=clusters.uc \ --output=clusters_taxonomies.txt
If you did paired-end sequencing, you can generate strict taxonomy tables that only count reads when both pairs have the same taxonomic description at a certain taxonomic level. This is useful for leveraging the increased length of having pairs and also acts as a sort of chimera filter. You will, however, end up using less of your reads as the level goes from domain to species.
lederhosen count_taxonomies \ --input=clusters.uc \ --strict=genus \ --output=clusters_taxonomies.strict.genus.txt
Reads that do not have the same phylogeny at
level will become
Create an OTU abundance table where rows are samples and columns are clusters. The entries are the number of reads for that cluster in a sample.
lederhosen otu_table \ --files=clusters_taxonomies.strict.genus.*.txt \ --output=my_poop_samples_genus_strict.95.txt \ --level=genus
This will create the file
my_poop_samples_genus_strict.95.txt containing the clusters
as columns and the samples as rows.
You now will apply advanced data mining and statistical techniques to this table to make interesting biological inferences and cure diseases.
Sometimes, clustering high-throughput reads at stringent identities can create many, small clusters. In fact, these clusters represent the vast majority (>99%) of the created clusters but the minority () of the reads. In other words, 1% of the reads have 99% of the clusters.
If you want to filter out these small clusters which are composed of inseparable sequencing error or
actual biodiversity, you can do so with the
lederhosen otu_filter \ --input=table.csv \ --output=filtere.csv \ --reads=50 \ --samples=50
This will remove any clusters that do not appear in at least 10 samples with at least 50 reads. The read counts
for filtered clusters will be moved to the
You can get the representative sequences for each cluster using the
This will extract the representative sequence from the database you ran usearch with.
Make sure you use the same database that you used when running usearch.
lederhosen get_reps \ --input=clusters.uc \ --database=taxcollector.fa \ --output=representatives.fasta
You can get the representatives from more than one cluster file using a glob:
lederhosen get_reps \ --input=*.uc \ --database=taxcollector.fa \ --output=representatives.fasta
lederhosen separate_unclassified \ --uc-file=my_results.uc \ --reads=reads_that_were_used_to_generate_results.fasta --output=unclassified_reads.fasta
separate_unclassified has support for strict pairing
lederhosen separate_unclassified \ --uc-file=my_results.uc \ --reads=reads_that_were_used_to_generate_results.fasta --strict=phylum --output=unclassified_reads.fasta
Please cite this GitHub repo (https://github.com/audy/lederhosen) with the version you used (type
lederhosen version) unless I publish a paper. Then cite that.