How to extract the first segment of each paragraph from a txt file using awk?

Issue

I have a txt file that has little over 10,000 lines, within those lines there is a list of programming languages which I need to extract, it goes like this:

P+ – "Experience with Remote Procedure Calls in a Real-Time Control System", B. Carpenter et al, Soft Prac & Exp 14(9):901-907 (Sep 1984).

P4 – Rusty Lusk lusk@anta.mcs.anl.gov. A macro/subroutine package
for parallel programming, using monitors on shared memory machines,
message passing on distributed memory machines. Implemented as a
subroutine library for C and Fortran. An enhancement of the "Argonne
macros", PARMACS. ftp://info.mcs.anl.gov/pub/p4t1.2.tar.Z info:
p4@mcs.anl.gov

PABC – Intermediate language recognized by the Parallel ABC machine,
used in the implementation of Concurrent Clean. "The PABC Simulator",
E.G.J.M.H. NM-^Zecker, TR 89-19, U Nijmegen 1989.

I only need to extract the name of each language, avoiding everything else, and taking into consideration some names have more than one word, so i tried using the "-" as a separator. However I cant find how to properly do this.
First i tried:

awk '{ print $1 }' RS="\n\n" ORS= language.TXT

or awk '{ print $1 }' RS= ORS="\n\n" language.TXT

But the only output is the very first word of the file:

The

I also did:

 $ awk -F "-" '{ print $1 }' language.TXT

Which does give me each name but since it takes every line of the description too, it outputs something like (compare to example above):

+ 
System", B. Carpenter et al, Soft Prac & Exp 14(9):901

P4 
parallel programming, using monitors on shared memory machines, message
passing on distributed memory machines.  Implemented as a subroutine
library for C and Fortran.  An enhancement of the "Argonne macros",
PARMACS.
ftp://info.mcs.anl.gov/pub/p4t1.2.tar.Z
info: p4@mcs.anl.gov

PABC 
in the implementation of Concurrent Clean.  "The PABC Simulator",
E.G.J.M.H. NM

What would be the proper way to do this using awk’s "paragrah mode"?

As a note, I am using gawk

Solution

Using any awk in any shell on every Unix box, this is how to use awks paragraph mode:

$ awk -v RS= '{print $1}' file
P+
P4
PABC

The above assumes that none of the strings you want output can contain blanks since you don’t include any of those in your sample input If those strings can contain spaces then this might be what you need instead if they can’t contain <blank>-:

$ awk -v RS= -F'(^|\n) *| +-' '{print $2}' file
P+
P4
PABC

If they can contain <blank>- then you need to tell us how to recognize them in the input.

Answered By – Ed Morton

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published