Issue
I have a txt file that has little over 10,000 lines, within those lines there is a list of programming languages which I need to extract, it goes like this:
P+ – "Experience with Remote Procedure Calls in a Real-Time Control System", B. Carpenter et al, Soft Prac & Exp 14(9):901-907 (Sep 1984).
P4 – Rusty Lusk [email protected] A macro/subroutine package
for parallel programming, using monitors on shared memory machines,
message passing on distributed memory machines. Implemented as a
subroutine library for C and Fortran. An enhancement of the "Argonne
macros", PARMACS. ftp://info.mcs.anl.gov/pub/p4t1.2.tar.Z info:
[email protected]PABC – Intermediate language recognized by the Parallel ABC machine,
used in the implementation of Concurrent Clean. "The PABC Simulator",
E.G.J.M.H. NM-^Zecker, TR 89-19, U Nijmegen 1989.
I only need to extract the name of each language, avoiding everything else, and taking into consideration some names have more than one word, so i tried using the "-" as a separator. However I cant find how to properly do this.
First i tried:
awk '{ print $1 }' RS="\n\n" ORS= language.TXT
or awk '{ print $1 }' RS= ORS="\n\n" language.TXT
But the only output is the very first word of the file:
The
I also did:
$ awk -F "-" '{ print $1 }' language.TXT
Which does give me each name but since it takes every line of the description too, it outputs something like (compare to example above):
+ System", B. Carpenter et al, Soft Prac & Exp 14(9):901 P4 parallel programming, using monitors on shared memory machines, message passing on distributed memory machines. Implemented as a subroutine library for C and Fortran. An enhancement of the "Argonne macros", PARMACS. ftp://info.mcs.anl.gov/pub/p4t1.2.tar.Z info: [email protected] PABC in the implementation of Concurrent Clean. "The PABC Simulator", E.G.J.M.H. NM
What would be the proper way to do this using awk’s "paragrah mode"?
As a note, I am using gawk
Solution
Using any awk in any shell on every Unix box, this is how to use awks paragraph mode:
$ awk -v RS= '{print $1}' file
P+
P4
PABC
The above assumes that none of the strings you want output can contain blanks since you don’t include any of those in your sample input If those strings can contain spaces then this might be what you need instead if they can’t contain <blank>-
:
$ awk -v RS= -F'(^|\n) *| +-' '{print $2}' file
P+
P4
PABC
If they can contain <blank>-
then you need to tell us how to recognize them in the input.
Answered By – Ed Morton
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0