It was big surprise for me that there are so many options to convert proprietary XLS to CSV, but there are nothing available to convert open source ODS to CSV easily. Frankly speaking I have found some scripts, but I failed to make them to do what I need. Some of them were too old with outdated libraries. Some scripts were just macros intended to use the OpenOffice and LibreOffice cores, but for me it was not a good idea because of the speed and memory consumption. This situation really strange, maybe I overlook something. Once I read one guy complaining that ODS format is too complicated to make conversion library. He suggested to use XLS to have possibility easily export any data table to CSV. So far I had not so many data and I converted them just through LibreOffice interface manually. But that was really boring if you have to do this several times per hour. Let's find a more or less simple python solution.
An ODS file is a ZIP archive containing several files and folders. The data are contained in a file content.xml. To unpack data from the file we will use zipfile library.
So in xml variable we have data in XML format. We will use pyquery and lxml libraries to parse XML.
Now we have to save our data in a CSV file.
That's it! Easy - no? Not so easy. The pyquery library does not always parse ODS xml in a good way. There is a problem: special denotation is used to show that there are several empty cells in a table or there are several cells with the same data. I had to use rather dirty hack to overcome this problem. I replace all these occurrences by multiple simple xml tags before parsing.
So the whole function csv2ods to make conversion is the following:
An ODS file is a ZIP archive containing several files and folders. The data are contained in a file content.xml. To unpack data from the file we will use zipfile library.
import zipfile xml = zipfile.ZipFile(filepath).read('content.xml') #filepath is a full path to ods file
So in xml variable we have data in XML format. We will use pyquery and lxml libraries to parse XML.
from pyquery import PyQuery as pq from lxml.cssselect import CSSSelector d = pq(xml, parser='xml') ns={'table': 'urn:oasis:names:tc:opendocument:xmlns:table:1.0'} selr = CSSSelector('table|table-row', namespaces=ns) selc = CSSSelector('table|table-cell', namespaces=ns) rowxs = pq(selr(d[0])) data = [] for ir,rowx in enumerate(rowxs): cells = pq(selc(rowx)) if cells.text(): data.append([cells.eq(ic).text().encode('utf-8') for ic in range(len(cells))])
Now we have to save our data in a CSV file.
import os,csv root,ext=os.path.splitext(filepath) with open(''.join([root,'.csv']),'wb') as f: for row in data: dw = csv.writer(f) dw.writerow(row)
That's it! Easy - no? Not so easy. The pyquery library does not always parse ODS xml in a good way. There is a problem: special denotation is used to show that there are several empty cells in a table or there are several cells with the same data. I had to use rather dirty hack to overcome this problem. I replace all these occurrences by multiple simple xml tags before parsing.
import re def rep_repl(match): return '<table:table-cell>%s' %match.group(2) * int(match.group(1)) def repl_empt(match): n = int(match.group(1)) pat = '<table:table-cell/>' return pat*n if (n<100) else pat p_repl = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^/>]*>(.+?table-cell>)') p_empt = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^>]*>') xml = re.sub(p_repl, rep_repl, xml) xml = re.sub(p_empt, repl_empt, xml)I limited maximum number of empty cells by 100. More than one hundred cells is truncated to one. I have to do this as there can be several thousands of empty useless sells at the end of the rows and empty rows with thousands empty cells. Of course, it is possible to add two more regular expressions for this cases, but I don't like such hacks and maybe I will find some other more beautiful way to get rid of regex completely. Maybe you know better solutions? You are welcome to leave your comments!
So the whole function csv2ods to make conversion is the following:
import sys,zipfile,re,os,csv from pyquery import PyQuery as pq from lxml.cssselect import CSSSelector def ods2csv(filepath): xml = zipfile.ZipFile(filepath).read('content.xml') def rep_repl(match): return '<table:table-cell>%s' %match.group(2) * int(match.group(1)) def repl_empt(match): n = int(match.group(1)) pat = '<table:table-cell/>' return pat*n if (n<100) else pat p_repl = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^/>]*>(.+?table-cell>)') p_empt = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^>]*>') xml = re.sub(p_repl, rep_repl, xml) xml = re.sub(p_empt, repl_empt, xml) d = pq(xml, parser='xml') ns={'table': 'urn:oasis:names:tc:opendocument:xmlns:table:1.0'} selr = CSSSelector('table|table-row', namespaces=ns) selc = CSSSelector('table|table-cell', namespaces=ns) rowxs = pq(selr(d[0])) data = [] for ir,rowx in enumerate(rowxs): cells = pq(selc(rowx)) if cells.text(): data.append([cells.eq(ic).text().encode('utf-8') for ic in range(len(cells))]) root,ext=os.path.splitext(filepath) with open(''.join([root,'.csv']),'wb') as f: for row in data: dw = csv.writer(f) dw.writerow(row) ods2csv(os.path.expanduser('~/foo.ods')) #exampleScript is tested for the python v2.7.2.