Monday, October 1, 2012

How to convert ods to csv

It was big surprise for me that there are so many options to convert proprietary XLS to CSV, but there are nothing available to convert open source ODS to CSV easily. Frankly speaking I have found some scripts, but I failed to make them to do what I need. Some of them were too old with outdated libraries. Some scripts were just macros intended to use the OpenOffice and LibreOffice cores, but for me it was not a good idea because of the speed and memory consumption. This situation really strange, maybe I overlook something. Once I read one guy complaining that ODS format is too complicated to make conversion library. He suggested to use XLS to have possibility easily export any data table to CSV. So far I had not so many data and I converted them just through LibreOffice interface manually. But that was really boring if you have to do this several times per hour. Let's find a more or less simple python solution.

 An ODS file is a ZIP archive containing several files and folders. The data are contained in a file content.xml. To unpack data from the file we will use zipfile library.
import zipfile
xml = zipfile.ZipFile(filepath).read('content.xml') #filepath is a full path to ods file

So in xml variable we have data in XML format. We will use pyquery and lxml libraries to parse XML.
from pyquery import PyQuery as pq
from lxml.cssselect import CSSSelector
d = pq(xml, parser='xml')
ns={'table': 'urn:oasis:names:tc:opendocument:xmlns:table:1.0'}
selr = CSSSelector('table|table-row', namespaces=ns)
selc = CSSSelector('table|table-cell', namespaces=ns)
rowxs = pq(selr(d[0]))
data = []
for ir,rowx in enumerate(rowxs):
    cells = pq(selc(rowx))
    if cells.text():
        data.append([cells.eq(ic).text().encode('utf-8') for ic in range(len(cells))])

Now we have to save our data in a CSV file.
import os,csv
root,ext=os.path.splitext(filepath)
with open(''.join([root,'.csv']),'wb') as f:
    for row in data:
        dw = csv.writer(f)
        dw.writerow(row)

That's it! Easy - no? Not so easy. The pyquery library does not always parse ODS xml in a good way. There is a problem: special denotation is used to show that there are several empty cells in a table or there are several cells with the same data. I had to use rather dirty hack to overcome this problem. I replace all these occurrences by multiple simple xml tags before parsing.
import re
def rep_repl(match):
    return '<table:table-cell>%s' %match.group(2) * int(match.group(1))
def repl_empt(match):
    n = int(match.group(1))
    pat = '<table:table-cell/>'
    return pat*n if (n<100) else pat
p_repl = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^/>]*>(.+?table-cell>)')
p_empt = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^>]*>')
xml = re.sub(p_repl, rep_repl, xml)
xml = re.sub(p_empt, repl_empt, xml)
I limited maximum number of empty cells by 100. More than one hundred cells is truncated to one. I have to do this as there can be several thousands of empty useless sells at the end of the rows and empty rows with thousands empty cells. Of course, it is possible to add two more regular expressions for this cases, but I don't like such hacks and maybe I will find some other more beautiful way to get rid of regex completely. Maybe you know better solutions? You are welcome to leave your comments!

 So the whole function csv2ods to make conversion is the following:
import sys,zipfile,re,os,csv
from pyquery import PyQuery as pq
from lxml.cssselect import CSSSelector

def ods2csv(filepath):

    xml = zipfile.ZipFile(filepath).read('content.xml')

    def rep_repl(match):
        return '<table:table-cell>%s' %match.group(2) * int(match.group(1))
    def repl_empt(match):
        n = int(match.group(1))
        pat = '<table:table-cell/>'
        return pat*n if (n<100) else pat

    p_repl = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^/>]*>(.+?table-cell>)')
    p_empt = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^>]*>')
    xml = re.sub(p_repl, rep_repl, xml)
    xml = re.sub(p_empt, repl_empt, xml)

    d = pq(xml, parser='xml')
    ns={'table': 'urn:oasis:names:tc:opendocument:xmlns:table:1.0'}
    selr = CSSSelector('table|table-row', namespaces=ns)
    selc = CSSSelector('table|table-cell', namespaces=ns)
    rowxs = pq(selr(d[0]))
    data = []
    for ir,rowx in enumerate(rowxs):
        cells = pq(selc(rowx))
        if cells.text():
            data.append([cells.eq(ic).text().encode('utf-8') for ic in range(len(cells))])

    root,ext=os.path.splitext(filepath)
    with open(''.join([root,'.csv']),'wb') as f:
        for row in data:
            dw = csv.writer(f)
            dw.writerow(row)

ods2csv(os.path.expanduser('~/foo.ods')) #example
Script is tested for the python v2.7.2.