Text files with Python

By einar

October 23, 2006 - Comments

Finally I cleaned up my code enough to post it here. It’s probably still ugly, but not as ugly as when I wrote it down the first time. It’s all about manipulating text files, to be precise tab-delimited files. All the snippets are published under the GNU GPL v2 (not that I think that anyone would use them, but just in case…).

I started with one file, listing Entrez Gene identifiers for a number of genes coming out of a statistical analysis, along with their SAM scores to see if they were differentially expressed or not. I needed to add information such as gene name and symbol, Gene Ontology, and Uniprot/SwissProt IDs, leaving out the statistical paramters. I tried at first to use the EUtils package of Bioptyhon, but due to both my lack of skill and the total lack of documentation, I dropped the idea and moved to a different plan.

First, I used DAVID to obtain all the annotation data I needed. There are some columns that are redundant, so I decided to remove them as well. Once I had the original files (with the SAM data) and the DAVID results, I could start:

import sys

import csv

import re

import tempfile

I used csv to easily handle comma-delimited files, tempfile to handle temporary files securely, sys to get command line arguments and re to do some regular expression matching and substituting. Basically, I had a series of functions that first of all obtained the SAM data and encoded them (1, up-regulation; 0, not differentially expressed; -1 down-regulation), creating a dictionary:

def getSAMFlag(file):

sam_dict={}

for row in file:

if row[0] == “locuslink”:

continue

if row[22] != “NA”:

if float(row[22]) > 0:

row[22] = “1”

elif float(row[22]) < 0:

row[22] = “-1”

else:

row[22] = “0”

sam_dict[row[0]]=row[22]

return sam_dict

I had also to prepare the file coming out of DAVID, stripping the useless fields. As csv.reader gives an iterator that returns a row for each cycle, it turned out to be quite easy:

def displayColumns(file,dest,cards=0):

file_csv = csv.reader(file,dialect=“ncbi”)

dest_csv = csv.writer(dest,dialect=“ncbi”)

for row in file_csv:

if cards == 1:

if row[4] ==“GENE_SYMBOL”:

row = row[0:2] + row[4:8] + row[3:4]

dest_csv.writerow(row)

continue

geneCardsURL = “< URL removed >”

preURL = “< a xhref="”

postURL = “">”

endURL = “< /a>”

row[4] = re.sub(", “,”",row[4]) # Togliamo la virgola e lo spazio da fine colonna

row[4] = preURL + geneCardsURL + row[4] + postURL + row[4] + endURL

if row[3] == “”:

row[3] = “N/A”

row = row[0:2] + row[4:8] + row[3:4]

dest_csv.writerow(row)

The optional “cards” parameter creates a HTML to link to the GeneCards database in order to query gene symbols. I removed the URL just for formatting purposes (and for some reason “a href” becomes “a xhref”), but it’s easy to fetch it by querying by gene symbol. This code creates a new table with Entrez Gene ID, Gene Name, Gene Symbol, chromosome and cytoband and also the GO Cellular Component level 3 (adding “N/A” if there is no annotation). The re.sub is used to remove a comma followed by a space that is present at the end of the Gene Symbol annotation.Once I had all of this, I wrote a function to write the SAM results into this new table:

def writeSAM(file,data_file,dest):

file_csv = csv.reader(file,dialect=“ncbi”)

dest_csv = csv.writer(dest,dialect=“ncbi”)

data_file_csv = csv.reader(data_file,dialect=“ncbi”)

sam_dict = getSAMFlag(data_file_csv)

sam_keys = sam_dict.keys()

for row in file_csv:

if row[0] == “ENTREZ_GENE_ID”:

row.append(“Flag SAM”)

if row[0] in sam_keys:

row.append(sam_dict[row[0]])

dest_csv.writerow(row)

The if for ENTREZ_GENE_ID is used to add a header (“Flag SAM”) to the columns. There’s nothing much to say about the actual program, if not pointing out the very easy creation of temporary files:

temp = tempfile.NamedTemporaryFile()

And last but not least, the class definition of the dialect “ncbi” that I used to parse the text:

class ncbi:

delimiter = ‘\t’

quotechar = ‘"’

escapechar = None

doublequote = True

skipinitialspace = False

lineterminator = ‘\n’

quoting = csv.QUOTE_NONE

This is invoked using the csv.register_dialect method after instantiating:

ncbi = ncbi()

dial = csv.register_dialect(“ncbi”, ncbi)

Even though my programming style is probably bad, I have to notice that the code I presented is not in the order it appears in the script (obviously). In any case, if there are any suggestions to improve, let me know.

Comments