Awk column value in file 1 is in the range of two columns in file 2

Issue

I modified the question based on the comments.

I would like to match two files: if $4 in file 1 is in the range of $3 and $4 in file 2, I would like to print file 1 with $6 in file 2. If there is no match, I would like to print NA in the output. If there are overlapping ranges, I would like to print the first match (sorting based on $4 of file 1).

File 1:

1   rs537182016 0   8674590 A   C
1   rs575272151 0   69244805    G   C
1   rs544419019 0   69244469    G   C
1   rs354682    0   1268900 G   C

File 2:

18  16  8674587 8784575 +   ABAT
10349   17  69148007    69244815    -   ABCA10
23461   17  69244435    69327182    -   ABCA5

Output:

1   rs537182016 0   8674590 A   C ABAT
1   rs575272151 0   69244805    G   C ABCA10
1   rs544419019 0   69244469    G   C ABCA10
1   rs354682    0   1268900 G   C NA

I tried the following based on previous answers, but it did not work. The output is an empty file.

awk 'FNR == NR { val[$1] = $4 }
     FNR != NR { if ($1 in val && val[$1] >= $3 && val[$1] <= $4)
                     print $1, val[$1], $6
               }' file1 file2 > file3

Solution

Assumptions:

  • in the case of multiple matches OP has stated we only use the 'first' match; OP hasn’t (yet) defined 'first' so I’m going to assume it means the order in which lines appear in file2 (aka the line number)

One awk idea:

awk '
FNR==NR { min[++n]=$3                                # save 1st file values in arrays; use line number as index
          max[n]=$4
          col6[n]=$6
          next
        }
        { for (i=1;i<=n;i++)                         # loop through 1st file entries
              if (min[i] <= $4 && $4 <= max[i]) {    # if we find a range match then ...
                 print $0, col6[i]                   # print current line plus column #6 from 1st file and then ...
                 next                                # skip to next line of input; this keeps us from matching on additional entries from 1st file
              }
          print $0, "NA"                             # if we got here there was no match so print current line plus "NA"
        }
' file2 file1

NOTE: make note of the order of the input files; the first answer (below) was based on an input of file1 file2; this answer requires the order of the input files to be flipped, ie, file2 file1

This generates:

1   rs537182016 0   8674590 A   C ABAT
1   rs575272151 0   69244805    G   C ABCA10
1   rs544419019 0   69244469    G   C ABCA10
1   rs354682    0   1268900 G   C NA



NOTE: following is based on OP’s original question and expected output (revision #2); OP has since modified the expected output to such an extent that the following answer is no longer valid …

Assumptions:

  • in file1 both rs575272151 / 69244805 and rs544419019 / 69244469 match 2 different (overlapping) ranges from file2 but OP has only showed one set of matches in the expected output; from this I’m going to assume …
  • once a match is found for an entry from file1, remove said entry from any additional matching; this will eliminate multiple matches for file1 entries
  • once a match is found for a line from file2 then stop looking for matches for that line (ie, go to the next intput line from file2); this will eliminate multiple-matches for file2
  • OP has not provided any details on how to determine which mulit-match to keep so we’ll use the first match we find

One awk idea:

awk '
FNR==NR { val[$2]=$4; next }
        { for (i in val)                            # loop through list of entries from 1st file ...
              if ($3 <= val[i] && val[i] <= $4) {   # looking for a range match and if found ... 
                 print $0,i                         # print current line plus 2nd field from 1st file and then ...
                 delete val[i]                      # remove 1st file entry from further matches and ... 
                 next                               # skip to next line of input from 2nd file, ie, stop looking for additional matches for the current line
              }
        }
' file1 file2

This generates:

18  16  8674587 8784575 +   ABAT rs537182016
10349   17  69148007    69244815    -   ABCA10 rs575272151
23461   17  69244435    69327182    -   ABCA5 rs544419019

NOTES:

  • the for (i in val) construct is not guaranteed to process the array entries in a consistent manner; net result is that in the instance where there are multiple matches we simply match on the ‘first’ array entry provided by awk; if this ‘random’ nature of the for (i in val) is not acceptable then OP will need to update the question with additional details on how to handle multiple matches
  • for this particular case we actually generate the same output as expected by OP, but the assignments of rs575272151 and rs544419019 could just as easily be reversed (due to the nature of the for (i in val) construct)

Answered By – markp-fuso

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published