An unusual outbreak of the Escherichia coli O104:H4 was reported in 2011, centered on northern Germany but with cases throughout Europe and beyond. The causative agent was found to be a particularly aggressive form of E. coli that caused haemolytic uremic syndrome in 25% of infected patients. This outbreak provided the first opportunity for the scientific community to apply high-throughput whole genome sequencing to investigate a bacterial outbreak of unknown origin. Both during and after the outbreak, multiple research groups and public health agencies have examined the whole genomes of several E. coli O104:H4 isolated during and prior to the outbreak. Here we present a comparative analysis of all E. coli O104:H4 genomes that are currently publicly available.
A total of 58 E. coli O104:H4 genomes were compared, including four finished genomes and 54 draft or unassembled genomes, sourced from PATRIC. RedDog was used to map short read sequences generated from these genomes to reference genomes 55989 (Genbank NC_011748) and 2011C-3493 (Genbank NC_018658) and plasmids pAA-EA11, pESL-EA11 and pG-EA11 (Genbank NC_018666, NC_018659 and NC_018660); to identify single nucleotide polymorphisms (SNPs); and to construct a maximum likelihood phylogenetic tree. This analysis identified three clades of O104:H4, with one further divided into three sub-clades. Two of these sub-clades include pre-outbreak isolates either from Georgia or Tunisia, with all outbreak isolates contained in the third sub-clade. Five post-outbreak isolates from patients in France all resolved into either of the ‘Georgian’ (two isolates) or ‘Tunisian’ sub-clades (three). RedDog also generates gene content matrices, summarizing the presence and absence of genes among the input genome set. This revealed several variations in the accessory genome of E. coli O104:H4, including plasmid genes involved in aggregative adherence and a Shiga-toxin gene; important virulence factors in this strain.