Abstract
I worked together with the wonderful french artist Davide Balula to visualize his genome for an art exhibition.
I was responsible for preparing and processing the needed sequencing data and coming up with a technique that allows me to design and create PDF files that span tens of thousands of pages.
Definitely check out his other works too; I personally was deeply fascinated by the mimed sculptures.
Contents
- How it all began
- The project idea
- Here come the problems
- Original hypothesis
- Breakthrough
- First printing test
- Processing and assembling sequence data
- Processing pipeline step by step
- Conclusion
How it all began
Some time ago, fellow redditor charrcheese posted the following image where they made their computer "draw" DNA sequences.
I'm a real sucker for computer generated art. Naturally, I had to play around with this to create something of my own, so I went to the comment section and asked if the source code was available.
Long story short, I sat down and wrote this page to scratch this new itch of mine. You can change colors, and play around with different methods for "drawing" the sequences.
After all of this was over, I got contacted by contemporary artist Davide Balula.
The project idea
Davide imagined printing the DNA-sequences of his chromosomes in book form, potentially accompanying them with visualizations for his next exhibition in Paris.
I knew from my own work with large amounts of data that this won't be easy. Prior to Davides message, I'd already looked up the Human Genome Project over at NCBI for a personal project of mine.
Here come the problems
If you ever tried to open or copy/paste a file that's over a couple hundred mega-/gigabytes, you might have noticed that sometimes this happens:
Well, figuratively speaking of course.
Naturally this is not news and there are many programs (gedit, 010 Editor, ..) to look at and edit huge files - but - those didn't suit my needs.
In my case, I needed to be able to design the layout of the text for printing.
This is where huge files will get on your nerves. Word processors like LibreOffice, Word and Pages are simply not designed to support this much data.
They use too much RAM, load forever, eventually get unresponsive and even if you manage to open something big, try switching the font size now.
Yeah, not gonna happen.
Original hypothesis
I have worked with LaTeX before and I thought that I could easily design a page and \include my text. For those that don't know what I'm talking about, Wikipedia got your back:
LaTeX is a document preparation system. When writing, the writer uses plain text as opposed to the formatted text found in WYSIWYG ("what you see is what you get") word processors like Microsoft Word, LibreOffice Writer and Apple Pages.
In theory, this sounds exactly like something that I need. I don't have the graphical overhead of the word processors and I can generate arbitrary documents.
Did some digging and found the seqsplit and the dnaseq packages that enable me to do what I need to do.
seqsplit – Split long sequences of characters in a neutral way
When one needs to type long sequences of letters (such as in base-sequences in genes) [...]
dnaseq – Format DNA base sequences
Defines a means of specifying sequences of bases. The bases may be numbered (per line) and you may specify that subsequences be coloured. [...]
Standard procedure, I thought - Write your document, include the necessary packages, compile it - and then...
LaTeX hugs your RAM very tightly and dies miserably.
Uggghh oh well, that sucks.
I did some research and found that luatex, unlike pdflatex, allocates memory dynamically. (i.e. doesn't crash with big input)
Me: "This is awesome."
Narrator: "It really wasn't."
The processing time, even for small amounts of data, is insane. That is a big no-no.
While technically lualatex works, it figuratively takes forever. I needed a better solution.
Stackexchange had some useful comments which sounded great at first:
1. Use Enscript in conjunction with ps2pdf
Enscript
[...] converts ASCII files to PostScript [...]
ps2pdf
[...] converts PostScript files to Portable Document Format (PDF) files.
2. Use text2pdf, Pandoc or unoconv which all do similar things except that this caught my attention:
unoconv
a command line tool to convert any document format that LibreOffice can import to any document format that LibreOffice can export
LibreOffice? Yeah.
3. Use LibreOffice
Wait. Didn't I rule out this Word processor?
I was trying out the different solutions and they didn't work like I wanted them to, so I kept scrolling and found out that LibreOffice itself supports operating without a graphical user interface!
Me: "This is awesome."
Narrator: "It was. Kinda."
First progress
LibreOffice has a very neat command line option for converting a file into a different format, in my case: PDF.
Usage is very easy:
$ libreoffice --convert-to pdf [your_file]
Here I took the time for the 21st chromosome:
$ time libreoffice --convert-to pdf chromosome_21.fna
convert /tmp/chromosome_21.fna -> /tmp/chromosome_21.pdf using filter : writer_pdf_Export
libreoffice --convert-to pdf chromosome_21.fna
125.04s user
1.50s system
99% cpu
2:06.78 total
Only around 2 minutes for a 46MB file! That's what I'm talking about.
Here is the first page of the output PDF:
The output PDF is 9897 pages long. Could be worse. Let's run a longer sequence:
$ time libreoffice --convert-to pdf chromosome_01.fna
convert /tmp/chromosome_01.fna -> /tmp/chromosome_01.pdf using filter : writer_pdf_Export
libreoffice --convert-to pdf chromosome_01.fna
2328.58s user
8.64s system
99% cpu
39:01.99 total
39 minutes for a 242MB file results in 52903 pages.
From a time-cost perspective, this is a huge win.
These PDFs will cost real money though, so ~63000 pages for only 2/24 chromosomes doesn't fly. The inner margins and the font size are way too generous.
I need to be able to change the layout, which unfortunately the libreoffice command does not directly support.
What it does support however, is converting .odt files, its document file format.
(Similar to .docx of Microsoft Word)
.odt files contain, besides the content itself, layout information!
Breakthrough
Because I've been watching videos of the excellent IT Security Engineer Gynvael Coldwind, I learned quite a bit about the Zip file format which led me to this thought process:
Create a test.odt document, containing:
Document formats are pretty much always just zip files, which you can extract.
$ unzip -l test.odt
Archive: test.odt
Length Date Time Name
--------- ---------- ----- ----
39 2018-08-15 14:52 mimetype
353 2018-08-15 14:52 Thumbnails/thumbnail.png
0 2018-08-15 14:52 Configurations2/accelerator/
0 2018-08-15 14:52 Configurations2/popupmenu/
0 2018-08-15 14:52 Configurations2/toolpanel/
0 2018-08-15 14:52 Configurations2/menubar/
0 2018-08-15 14:52 Configurations2/images/Bitmaps/
0 2018-08-15 14:52 Configurations2/toolbar/
0 2018-08-15 14:52 Configurations2/floater/
0 2018-08-15 14:52 Configurations2/statusbar/
0 2018-08-15 14:52 Configurations2/progressbar/
3532 2018-08-15 14:52 content.xml
971 2018-08-15 14:52 meta.xml
899 2018-08-15 14:52 manifest.rdf
10834 2018-08-15 14:52 settings.xml
11157 2018-08-15 14:52 styles.xml
978 2018-08-15 14:52 META-INF/manifest.xml
--------- -------
28763 17 files
I personally have never done this before, but content.xml sounds interesting.
Let's have a look:
<?xml version="1.0" encoding="UTF-8"?>
... many lines omitted
for your viewing pleasure ...
<text:sequence-decl
text:display-outline-level="0" text:name="Illustration"/><text:sequence-decl text:display-outline-level="0" text:name="Table"/><text:sequence-decl
text:display-outline-level="0" text:name="Text"/><text:sequence-decl text:display-outline-level="0" text:name="Drawing"/></text:sequence-decls><text:p
text:style-name="P1">Test</text:p><text:p text:style-name="P1"/><text:p
text:style-name="P2">danielbiegler.de</text:p></office:text></office:body></office:document-content>
Aha!
There at the very bottom, are our test strings. How about we change Test
to does this work?
.
...
<text:p text:style-name="P1">does this work?</text:p><text:p text:style-name="P1"/><text:p
text:style-name="P2">danielbiegler.de</text:p></office:text></office:body></office:document-content>
Re-zip the content.xml into our test.odt:
$ zip test.odt content.xml
updating: content.xml
zip warning: Local Entry CRC does not match CD: content.xml
(deflated 75%)
And finally, try to open it:
Yes, it works.
This opens up the whole layout-power of the word processor without needing to open the large sequence data.
Bottleneck
While researching all of this, I was wondering why the processing takes so unbelievably long and it turns out that line-breaks and word-wrapping play a huge role. Here are some stats for y'all:
Here you can see that in the beginning, text with line breaks (orange) has a small advantage over text without line breaks (blue).
But what happens when we increase our character count, since our data is much bigger than a couple hundred thousand characters(?):
Whoopsie daisy.
Now that is pretty much the definition of a bottleneck, oh my..
Our data is a lot larger than a million characters, which means to be able to compile all the PDFs in a realistic time manner, line breaks are mandatory.
Font Kerning
In typography, kerning is the process of adjusting the spacing between characters in a proportional font, usually to achieve a visually pleasing result.
Theoretically, in our case, kerning would be useful to condense the sequence a bit more, saving paper.
However, this comes with the tradeoff that some lines in the book take more space than others. Some lines wont reach the end of the page, whereas some lines will need to wrap.
Remember, wrapping is bad - real bad.
To be able to properly set the line breaks, for higher performance, we'd need to calculate each line length and at that point we just do what the word processor does anyway.
To solve this, we can simply use a monospaced font i.e. a font with fixed spacing. Example:
With this, we know beforehand how many characters can fit on a single line. This allows us to optimally set line breaks! See here:
First printing test
For now it was important to figure out how small you can make the font while still having legible text. Here is an example of an earlier test Davide made:
The answer is around 4pt for font size. As you can see there's still very generous padding. We'll fix that later.
Now comes a little rant, stay strong. You'll get rewarded with the technical details afterwards!
Processing and assembling sequence data
Not gonna lie, this step was infuriating and frustrating to say the very least.
I do not have a background in bioinformatics but I consider myself to be capable of learning new things quickly and adapting to new situations quite easily. This step involves using software, so as a developer myself I thought, how hard can that be?
Writing software taught me to navigate in muddy areas; I assess and filter information quickly to achieve what I want.
Oh my sweet summer child.. sigh
As software devs we have the luxury of huge communities where pretty much every day-to-day topic can be found quickly and concisely. In the vast majority of cases you have a problem that somebody else had already encountered and fixed. This leads to very easily searchable answers.
But let's say you begin to dabble with very specific equipment and or very specific functionality. At some point it gets harder and harder to find useful information because the target audience shrinks drastically. It sounds a bit counter intuitive because having something specific should enable you to find specific things, right?
The number of relevant search results is already drastically reduced, but now you also have to find the specific wording to the question or discussion, otherwise this search result will be buried under junk. Think about the last time you needed to visit the second or third page of a google search? For me personally it's so rare that I genuinely do not remember, at all. I only remember being thoroughly disappointed every time because the results get extremely irrelevant really quickly.
Combine this fact with being absolutely new and ignorant to a highly specific and complicated area of higher studies. Information will start to feel like reading mandarin chinese to a foreigner (me).
Just as an example for this article, I randomly clicked a question that was on Biostars (a bioinformatics forum) and an answer reads like this, and I quote:
"We need to calculate the probability of three independent events happened together. The event is the choice of the genotype at the locus (SNP). The probability is estimated by the genotype frequency. Assuming the Hardy-Weinberg equilibrium is true, the frequency of genotype is p^2 for homozygous and 2pq for heterozygous genotypes, where p and q are the allele frequencies. Therefore, the frequency of AA-CC-AG profile would be" ...
Uhhh, yeah..
People that understood that paragraph probably cringe about me quoting this, but reading text where you only understand 50% of the words is an incredibly humbling experience. You should give it a try.
I knew it wouldn't be easy to digest all the info so I took my time to research different terminology and abbreviations, step by step, trying to inch forward.
After the lab got done processing and shipping the data to us, I began investigating on how to proceed with what data I had. The goal is to generate sequences, in FASTA format, from chromosome 1 through 22, X and Y.
I'm not going to detail every method that I tried, but what's important is, is that after some time passed I figured out that what I want to do is called a "Consensus Sequence".
There's a set of utilities, called SAMtools, that you can use to work with DNA sequence data.
After trying out a myriad of things and fixing several errors that occurred on the way I was still left with a message telling me the following:
[E::faidx_adjust_position] The sequence "chrM" was not found
Uhh, the what now? At this point I poured many hours into this whole ordeal and got scared that maybe the whole thing simply will not work.
But again, continuous trials, errors and coffee led me down a path of for some reason inspecting the header information of the BAM File.
$ samtools view -H data.bam
Which featured some output that I was not interested in.. except that at the very end a particular directory caught my eye.
/... └─ DNA/ └─ DNA_Human_WES/ └─ DNA_Human_WES_2016b/ └─ Database/ └─ hg19/ └─ fa/ └─ hg19.fasta
Some gears in my head started churning and it struck me like lightning. I have seen hg19
before!
I was really lucky to have not missed this detail because in reality it when I checked the headers out, it looked like this:
I mean, just look at it.
I felt such a big relieve when I accidentally spotted that.
When I first started playing with human sequences I found out you can download two different samples from NCBI: GRCh37
and GRCh38
. Right now while writing this, I can't even recall why I knew what hg19
meant, but I know that I specifically remembered that hg19
is kind of a "synonym" for GRCh37
and NOT GRCh38
.
I was trying to generate the consensus sequence with the GRCh38
assembly instead of the hg19
one. This was finally the breakthrough that I so desperately needed.
The last error I got was because of chrM
(Mitochondrial DNA) which I found out is "new" notation which wasn't used when hg19
got created(, plus some additional differences).
With that roadblock finally out of the way I can create some PDFs.
Processing pipeline step by step
A little headsup:
This part is very technical and if you don't care about specifics you can skip ahead via this link.
Since I encountered quite a few problems, I tried compiling bcftools myself to get the newest possible version. You can get the source very conveniently from Github.
After compilation run and check the version via
$ ./bcftools --version
bcftools 1.9-210-gf1f261b
Using htslib 1.9-279-g49058f4
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Alright seems to work.
In order to create the consensus sequence, we first need to call the variants.
This means we identify variants from sequence data i.e. where the aligned reads differ from the reference genome and write to a VCF file.
Here are the commands:
$ bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -Ou -mv | bcftools norm -f ref.fa Oz -o calls.vcf.gz
This takes quite a while and produces the .vcf file that we need for the consensus sequence. Like I said earlier, make sure you have the proper reference!
So now we are able to construct the consensus sequence via:
$ bcftools consensus -f reference.fa calls.vcf.gz > consensus.fa
This was a big first step. Next we need to massage the data into the shape we need.
Massaging in this context means:
- Cut a chromosome out of the consensus sequence
- Remove comments and unneeded symbols
- Fold the lines to a fixed length so we avoid line breaks (remember: performance!)
- Add some markup so we can insert the sequence data into the LibreOffice document
1. Find out where the different chromosomes start and end so that you can carve them out of the consensus sequence.
Take a look at the first three lines of the consensus sequence:
$ head -n 3 consensus.fa
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
All the sections are preceded by a comment featuring the greater-than character.
This makes it trivial to get the start plus the line number of every section via grep!
$ cat consensus.fa | grep --line-number "^>" | head -n 25 > comment_lines.txt
That output looks like this:
1:>chr1
4153951:>chr2
8207074:>chr3
11507294:>chr4
# ... and so forth
This enables us to precisely cut out each individual chromosome via stream editors like sed or awk. I used sed
for this one.
For example the following means we want to output the lines (including) from line number 1 to 4153951, then we quit execution when we reach the next line 4153952. This is just a little performance trick to speed the processing up for large files.
$ sed -n "1,4153951p;4153952q" consensus.fa
We just cut out the first chromosome - easy!
2. Remove comments and unneeded symbols
On our mission to reduce the output size Davide and I decided that we'll strip unknown dna bases from the output. Also since the ranges from sed are inclusive we also need to omit the comment lines. (We could have done that in the previous step but it really doesn't matter).
The following means we want to omit all lines starting with a greater-than character i.e. all the comment lines.
$ grep -v '^>'
This we follow up with handy tr for deleting some unneeded symbols like the newline, N and n characters
$ tr -d '\nNn'
3. Fold the lines to a fixed length so we avoid line breaks
Getting rid of the line breaks for LibreOffice is of the essence, luckily theres the fold command which let's us specify how long each line should be!
The following means we want each line to be 123 characters wide
$ fold -w 123
4. Add some markup so we can insert the sequence data into the LibreOffice document
We need to insert the sequence data into a LibreOffice document file, because that is how we are going to convert to PDF. Remember from earlier, this is how text looks like inside the document file:
...
<text:p text:style-name="P1">
does this work?</text:p><text:p text:style-name="P1"/>
<text:p text:style-name="P2">
danielbiegler.de</text:p></office:text></office:body></office:document-content>
So we need to prepend
<text:p text:style-name="P2">
and append </text:p>
to every single line of the sequence. That way LibreOffice knows which style (font size, font family, etc.) to apply to each line.
That's super easy with sed:
$ sed 's/^/<text:p text:style-name="P2">/' | sed 's/$/<\/text:p>/'
And now everything, properly in order, inside a little helper script:
#!/bin/bash
if [ $# -lt 5 ]; then
echo "You need 5 arguments: start_line, end_line, /path/genome.fna, fold-width, /path/output"
exit 1
fi
# $1: Start line, inclusive
# $2: End line, inclusive
# $3: Path to the DNA Data in fasta file format
# $4: Fold width to properly fit the page
# $5: Path to output content file
# cut out section # ignore comments # rm stuff # fold it # beginning of the line # end of the line # write to file
sed -n "$1,$2p;$(($2+1))q" $3 | grep -v '^>' | tr -d '\nNn' | fold -w $4 | sed 's/^/<text:p text:style-name="P2">/' | sed 's/$/<\/text:p>/' > $5
Now it is super convenient to prepare our data for further processing; Here's how to prepare the first chromosome:
$ ./prepare_lines.sh 1 4153951 consensus.fa 123 chr1_prepared.xml
Easy as pie!
Writing another helper script that generates all of the chromosomes is highly recommended, because of the need for regeneration after changing the style of the document:
#!/bin/bash
./prepare_lines.sh 1 4153951 consensus.fa 123 chr1_prepared.xml
./prepare_lines.sh 4153951 8207074 consensus.fa 123 chr2_prepared.xml
./prepare_lines.sh 8207074 11507294 consensus.fa 123 chr3_prepared.xml
# ... and so forth ...
After preparing all the chromosomes we are finally in a position to create some PDFs!
Remember from earlier:
- Retrieve the content.xml from our master.odt
- Insert the sequence into the content.xml file
- Rezip the content.xml into a document
- Convert the document to PDF
1. Retrieve the content.xml from our master.odt
We know that we can simply unzip the master.odt file since we already experimented with that earlier, but what I found out is that you can unzip to stdout with the -p
option! That way we don't need to store a temporary file. Neat.
# unpack to stdout
$ unzip -p master.odt content.xml
2. Insert the sequence into the content.xml file
This step is a little tricky. We use regular expressions to replace every line of text inside the document with a single REPLACEME
. After that we replace that REPLACEME
with the file contents of our prepared text. This allowed easier, visual sanity-checks while I developed the script.
# replace text with REPLACEME on seperate line
| perl -p -e 's/<text:p.*?(?=<\/o)/\nREPLACEME\n/'
# replace with file contents
| sed -e "/REPLACEME/{r chr1_prepared.xml" -e 'd}' > content.xml
3. Rezip the content.xml into a document
Now we have a content.xml file that holds our prepared sequence. Simply rezip it into a document.
Since I wanted to preserve the master.odt, I decided to make a temporary copy of it and zip the new content into that one instead.
So first make a copy
$ cp master.odt temp_doc.odt
And now zip our prepared content into that:
$ zip temp_doc.odt content.xml
4. Convert the document to PDF
At this point we have a full featured and complete document at temp_doc.odt which we can convert to pdf using
$ libreoffice --convert-to "pdf" temp_doc.odt
These steps I put into a helper script which looks like this:
#!/bin/bash
# 1. /path/master.odt
# 2. /path/sequence_content.txt
# 3. /path/output.odt
if [ $# -lt 3 ]; then
echo "You need 3 arguments: /path/master.odt, /path/sequence_content.txt, /path/output.odt"
exit 1
fi
echo "Creating content.xml"
# unpack to stdout # replace text with REPLACEME on seperate line # replace with file contents
unzip -p $1 content.xml | perl -p -e 's/<text:p.*?(?=<\/o)/\nREPLACEME\n/' | sed -e "/REPLACEME/{r $2" -e 'd}' > content.xml
echo "Copying $1 to $3"
cp $1 $3
echo "Adding content.xml to $3"
zip $3 content.xml
echo "Remove content.xml"
rm content.xml
echo "Creating PDF. This can take a while"
time libreoffice --convert-to "pdf" $3
echo "Remove $3"
rm $3
Here's how the current version looks like for the Y-Chromosome. Note the repeating patterns in our dna:
This post will be updated at a later time. Hopefully we can add some graphical visualisations using my site and hold the actual books in our hands! :)
Conclusion
This project taught me that in order to achieve your goal you have to persevere and iteratively improve. A good solution doesn't get good on the first try, step by step fine tuning is key. From ~53,000 pages down to ~5000 pages was the sum of it's individual parts.
Patience is another big thing. I wish I could show you how frustrated I felt with all the bioinformatics mess, oh boy. Trying to stay cool headed in a constantly frustrating environment is a damn life lesson if I've ever seen one.
Thanks for your time, I hope you enjoyed this. Let me know.
Was denkst du?