# Awk column value in file 1 is in the range of two columns in file 2

## Issue

I modified the question based on the comments.

I would like to match two files: if \$4 in file 1 is in the range of \$3 and \$4 in file 2, I would like to print file 1 with \$6 in file 2. If there is no match, I would like to print NA in the output. If there are overlapping ranges, I would like to print the first match (sorting based on \$4 of file 1).

File 1:

``````1   rs537182016 0   8674590 A   C
1   rs575272151 0   69244805    G   C
1   rs544419019 0   69244469    G   C
1   rs354682    0   1268900 G   C
``````

File 2:

``````18  16  8674587 8784575 +   ABAT
10349   17  69148007    69244815    -   ABCA10
23461   17  69244435    69327182    -   ABCA5
``````

Output:

``````1   rs537182016 0   8674590 A   C ABAT
1   rs575272151 0   69244805    G   C ABCA10
1   rs544419019 0   69244469    G   C ABCA10
1   rs354682    0   1268900 G   C NA
``````

I tried the following based on previous answers, but it did not work. The output is an empty file.

``````awk 'FNR == NR { val[\$1] = \$4 }
FNR != NR { if (\$1 in val && val[\$1] >= \$3 && val[\$1] <= \$4)
print \$1, val[\$1], \$6
}' file1 file2 > file3
``````

## Solution

Assumptions:

• in the case of multiple matches OP has stated we only use the `'first'` match; OP hasn’t (yet) defined `'first'` so I’m going to assume it means the order in which lines appear in `file2` (aka the line number)

One `awk` idea:

``````awk '
FNR==NR { min[++n]=\$3                                # save 1st file values in arrays; use line number as index
max[n]=\$4
col6[n]=\$6
next
}
{ for (i=1;i<=n;i++)                         # loop through 1st file entries
if (min[i] <= \$4 && \$4 <= max[i]) {    # if we find a range match then ...
print \$0, col6[i]                   # print current line plus column #6 from 1st file and then ...
next                                # skip to next line of input; this keeps us from matching on additional entries from 1st file
}
print \$0, "NA"                             # if we got here there was no match so print current line plus "NA"
}
' file2 file1
``````

NOTE: make note of the order of the input files; the first answer (below) was based on an input of `file1 file2`; this answer requires the order of the input files to be flipped, ie, `file2 file1`

This generates:

``````1   rs537182016 0   8674590 A   C ABAT
1   rs575272151 0   69244805    G   C ABCA10
1   rs544419019 0   69244469    G   C ABCA10
1   rs354682    0   1268900 G   C NA
``````

NOTE: following is based on OP’s original question and expected output (revision #2); OP has since modified the expected output to such an extent that the following answer is no longer valid …

Assumptions:

• in `file1` both `rs575272151 / 69244805` and `rs544419019 / 69244469` match 2 different (overlapping) ranges from `file2` but OP has only showed one set of matches in the expected output; from this I’m going to assume …
• once a match is found for an entry from `file1`, remove said entry from any additional matching; this will eliminate multiple matches for `file1` entries
• once a match is found for a line from `file2` then stop looking for matches for that line (ie, go to the next intput line from `file2`); this will eliminate multiple-matches for `file2`
• OP has not provided any details on how to determine which mulit-match to keep so we’ll use the first match we find

One `awk` idea:

``````awk '
FNR==NR { val[\$2]=\$4; next }
{ for (i in val)                            # loop through list of entries from 1st file ...
if (\$3 <= val[i] && val[i] <= \$4) {   # looking for a range match and if found ...
print \$0,i                         # print current line plus 2nd field from 1st file and then ...
delete val[i]                      # remove 1st file entry from further matches and ...
next                               # skip to next line of input from 2nd file, ie, stop looking for additional matches for the current line
}
}
' file1 file2
``````

This generates:

``````18  16  8674587 8784575 +   ABAT rs537182016
10349   17  69148007    69244815    -   ABCA10 rs575272151
23461   17  69244435    69327182    -   ABCA5 rs544419019
``````

NOTES:

• the `for (i in val)` construct is not guaranteed to process the array entries in a consistent manner; net result is that in the instance where there are multiple matches we simply match on the ‘first’ array entry provided by `awk`; if this ‘random’ nature of the `for (i in val)` is not acceptable then OP will need to update the question with additional details on how to handle multiple matches
• for this particular case we actually generate the same output as expected by OP, but the assignments of `rs575272151` and `rs544419019` could just as easily be reversed (due to the nature of the `for (i in val)` construct)