Replace letters in column to generate new column

Issue

I have a tab-delimited file which contains two columns(ref and alt). I want to make new column by replacing alt column letter in the ref column. But I don’t want any replacement for empty rows and letters like TTGA( whose length is more than 1)

following is my input file

ref alt
T   C
C   
T   A,C
G   TTGA
C   

Expected output

ref alt         
T   C   C   T   T
C       C   C   C
T   A,C T   A   C
G   TTGA    G   G   G
C       C   C   C

the explanation for the output

1)In ref column Tis there in first column,second row, and in adjacent alt column there is C present in second column,second row, so i print ref column as new column as it is( see 3rd column) and then i replaced T with Cfrom alt column.

  1. There is C in first column, third row and in adjacent alt column there is nothing so i will not paste ref column as it is as new column.

  2. There is T in ref column at first column, 4th row and in adjacent alt column there is A,C (second column,4th row) so paste ref column as it is (4th column )and i replaced T with A first and then again I paste the ref column as it is and replaced T with C( 5th column, 4th row)

  3. In first row ,5t column G is there and in adjacent alt column TTGA(length is more than 1) is there so i will not paste ref column as it is as new column.

  4. C is there in first column, 6th row but in adjacent alt column there is nothing to replace, so I will not paste ref column as it is as new column.

Solution

An Awk solution

File ./replaceletters.awk:

#! /usr/bin/awk -f

BEGIN {
    FS = OFS = "\t"
}
# First line
NR == 1 {
    print $1,$2
    next
}
# Case: Only one column:
#  A -> A; empty; A; A; A
NF == 1 {
    print $1,"",$1,$1,$1
    next
}
# Case: Two columns, one letter on second column:
#  A; B -> A; B; B; A; A
NF == 2 && length($2) == 1 {
    print $1,$2,$2,$1,$1
    next
}
# Case: Two columns, two letters on second column:
#  A; B,C -> A; B,C; A; B; C
NF == 2 && $2 ~ /^.,.$/ {
    C1 = C2 = $2
    gsub(/,.*$/, "", C1)
    gsub(/^.*,/, "", C2)
    print $1,$2,$1,C1,C2
    next
}
# Case: Other cases with two columns
#  A; X -> A; X; A; A; A
NF == 2 {
    print $1,$2,$1,$1,$1
    next
}

Executable modes:

chmod 755 ./replaceletters.awk

Launched like:

./replaceletters.awk input01.txt

Output:

ref alt
T   C   C   T   T
C       C   C   C
T   A,C T   A   C
G   TTGA    G   G   G
C       C   C   C

Answered By – Arnaud Valmary

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published