Module Detail Information

Name:Estimate Library Complexity
Type: Module
Short URL:
Description:Attempts to estimate library complexity from sequence alone. Does so by sorting all reads by the first N bases (5 by default) of each read and then comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default). Reads of poor quality are filtered out so as to provide a more accurate estimate. The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than MIN_MEAN_QUALITY across either the first or second read. The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the calculation of library size. Also, since there is no alignment to screen out technical reads one further filter is applied on the data. After examining all reads a histogram is built of [#reads in duplicate set -> #of duplicate sets]; all bins that contain exactly one duplicate set are then removed from the histogram as outliers before library size is estimated.
Input Parameters:
 - Jar File
 - Temp Location
 - Verbosity
 - Quiet
 - Validation Stringency
 - Compression Level
 - Max Records in RAM
 - Create Index
 - Create MD5 File
 - Input
 - Minimum Identical Bases
 - Maximum Diff Rate
 - Minimum Mean Quality
 - Read Name Regex
 - Optical Duplicate Pixel Distance
Output Parameters:
 - Output
File size:16.65 KB
View Source    Download    Open