Email html to csv file

Issue

I have one email with html format and need to download it and need to make one csv semicolon field separator result to a new file.

Example of the email received:

Content-Type: text/html; charset=UTF-8
<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st= yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">DH</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">Mon</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
(UTC)</th></tr><tr><th style=3D"padding: 8px;">Arael Amarel</th><th style=
=3D"padding: 8px;">30549214</th><th style=3D"padding: 8px;">099981496</th><=
th style=3D"padding: 8px;">43</th><th style=3D"padding: 8px;">-</th><th sty=
le=3D"padding: 8px;">2019-07-11T10:06:34.311Z</th></tr><tr><th style=3D"pad=
ding: 8px;background-color: #dddddd">MATIN TARDEI</th><th style=3D"padding=
: 8px;background-color: #dddddd">45159820</th><th style=3D"padding: 8px;bac=
kground-color: #dddddd">094432451</th><th style=3D"padding: 
8px;background-=
color: #dddddd">32</th><th style=3D"padding: 8px;background-color: #dddddd"=
-</th><th style=3D"padding: 8px;background-color: #dddddd">2019-07- 
11T10:2=
8:41.198Z</th></tr>

Needed csv output:

Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z

If I open this mail on Client there make the table all ok, but I think it´s there a problem of format with procmail if I put in .html file this content (saved by procmail) of procmail and open it it´s make impossible to process the content if I look this content all the end of line are marked with a "=" which means a lot of problems, furtermore they are some several problems in the alignment of the table and other stuff which make it a nightmare to process the content to extract.

I had made one procmailrc with the filter to convert the html format to plain

procmailrc file:

MAILDIR=/new/mail/htmlconvert
:0
* ^Content-Type: text/html.*;
{
:0c
$MAILDIR/converted/
:0fwb
| `which html2text`
:0fwh
| `which formail` -i "Content-Type: text/plain; charset=UTF-8"
}

This is a try number 1, didn’t work the converter uses I tough html2text converter if I use html2text directly from the file originated de result is:

html2text

===============================================================================
 1px solid #dddddd;border-collapse: collapse;text-align: left;">
px;background-color: #cce6ff">NAME
px;background-color: #cce6ff">CI
= px;background-color: #cce6ff">DH
px;backgro= und-color: #cce6ff">FG
px;background-color: #c= ce6ff">Mon
px;background-color: #cce6ff">DATE= (UTC)
px;">Arael Amarel
px;">30549214
px;">099981496
<= th style=3D"padding: 8px;">43
px;">-
px;">2019-07-11T10:06:34.311Z
px;background-color: #dddddd">MATIN TARDEI
 8px;background-color: #dddddd">45159820
px;bac= kground-color: #dddddd">094432451
px;background-= color: #dddddd">32
px;background-color: #dddddd"= >-
px;background-color: #dddddd">2019-07-11T10:2= 8:41.198Z
px;">

Already tried lynx -dump -force-html to the file and the result isn’t nothing good to reach the format csv output.

html2text -nobs (file)

Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z

Update:
I have applied the solution of tripleee to the procmailrc, however the format of the mail is still the same of the original source, the qprint didn’t change the format with this change. However have tried to make it directly to the file and works fine.
The actual solution:

qprint -d -n <1563019338.1197_0.localhost.localdomain |
html2text -style pretty |
awk '/^-------------------------------------------------------------------------------/{p=1}p'

The – line is the separator of the body of the mail and the before content, this shows out:

-------------------------------------------------------------------------------

NAME         CI       CD   FG  HJ DATE (UTC)
Yaiaa Fereeira        52104575 097325303 20    -     2019-07-12T10:46:24.716Z
Gabtiel Aosta Sclavi   42445135 098322361 42    -     2019-07-12T11:07:36.110Z

Need now to make this content to the csv out, I thought it will be more easy to the first part but want to automate it to the procmail to do it with the mail download.

The result of procmail changing the procmailrc is the mail with the body still having the "=" as line end, but in the header have:

Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8 

Update
The email result source with qprint in the procrc

Return-Path: 
Delivered-To: 
Return-path: 
Envelope-to: 
Delivery-date: Sat, 13 Jul 2019 08:03:48 -0300
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8
Date: Sat, 13 Jul 2019 11:03:02 +0000 (UTC)
From: 
Mime-Version: 1.0
To: 
Message-ID: 
Subject:Fri Jul 12 2019
X-Spam-Flag: NO

<b>Fri Jul 12 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">NAME</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">CD</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">HJ</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
 (UTC)</th></tr><tr><th style=3D"padding: 8px;">Yaiaa Fereeira</th><th st=
yle=3D"padding: 8px;">52104575</th><th style=3D"padding: 8px;">097325303</t=
h><th style=3D"padding: 8px;">20</th><th style=3D"padding: 8px;">-</th><th =
style=3D"padding: 8px;">2019-07-12T10:46:24.716Z</th></tr>

I have the log in the stdin because procmail can`t write logfile as you can see in this log detail:

1 message for [email protected] at aaa.com (25330 octets).
reading message [email protected]@aaa.com:1 of 1 (25330 octets)........................procmail: Error while writing to "/info/in/log"
procmail: [20191] Mon Jul 15 08:55:34 2019
procmail: Assigning "FORMAIL=/usr/bin/formail"
procmail: Assigning "QPRINT=/usr/local/bin/qprint"
procmail: Match on "^Content-Type: text/html;"
procmail: Assigning "LASTFOLDER=converted/new/1563191734.20191_0.localhost.localdomain"
 Subject: Sun Jul 14 2019
  Folder: converted/new/1563191734.20191_0.localhost.localdomain          24985
procmail: Executing " qprint -d -n | html2text -nobs "
procmail: Executing " formail -I "Content-Type: text/html; charset=UTF-8"
procmail: Skipped "Mail"
procmail: Skipped "/"
From [email protected]  Mon Jul 15 08:55:34 2019
 Subject: Sun Jul 14 2019
  Folder: **Bounced**                                                     24985
fetchmail: MDA returned nonzero status 73
 not flushed

Solution

The sample in your post does not look like a valid email body at all. I’m guessing it’s a body part within a MIME message with Content-type: text/html (as vaguely indicated) and Content-transfer-encoding: quoted-printabe. The latter is what introduces the = escapes which you regard as problematic. Decoding them is actually fairly trivial, but how exactly to do that from Procmail depends on the overall composition of the containing message, and the utilities available to you. Unfortunately, Procmail itself has no idea about MIME structures, so you’ll have to rely on external tools.

As an aside the `which ...` commands in your recipe are completely redundant. For which to work, the utilities which you are looking for need to be in your PATH … which means Procmail can find them without which.

If something is not in Procmail’s default PATH, simply update PATH near the top of your .procmailrc file. This should also remove the need to use variables like $FORMAIL etc. Just use formail and make sure it’s available on Procmail’s PATH.

For your recipe to work, the MIME structure needs to be a single-part message. If that is indeed the case, and your html2text is otherwise correct, the only fix you need is to decode the content-transfer-encoding before piping through that. Assuming you have qprint, and with the superfluous which calls removed, that leaves

:0
* ^Content-Type: text/html.*;
{
  :0c  # no need to spell out $MAILDIR/ prefix
  converted/
  :0fwb
  | qprint -d | html2text
  :0fwh
  | formail -i "Content-Type: text/plain; charset=UTF-8" \
        -i "Content-transfer-encoding: 8bit"
}

If in fact the MIME body structure is more complex, perhaps edit your question to include the actual email source instead of your current ad-lib paraphrase.

In other words, and in some more detail, if your input message looks like

From: sender <[email protected]>
To: you <[email protected]>
Subject: HTML table
MIME-Version: 1.0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color:....

then the recipe above should basically work. But on the other hand, if your actual message is more like

From: sender <[email protected]>
To: you <[email protected]>
Subject: HTML table
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=0xdeadbeef

This is a multi-part MIME message.

--0xdeadbeef
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color:....

--0xdeadbeef--

then the first condition will not match (the headers don’t contain Content-type: text/html), but the actions inside the block will also need to be updated in several places because the MIME wrapping around the HTML body part needs to be unwrapped or somehow otherwise restructured. Here is a really quick and dirty attempt at solving this.

:0
* ^Content-Type: multipart/mixed
{
  :0c  # no need to spell out $MAILDIR/ prefix
  converted/
  :0fwb
  | perl -0777 -pe 's/=([0-9A-F]{2})/ chr(oct("0x$1"))/ge; \
    s/=\n//g; \
    s%</table>.*%%s; \
    s%.*<table[^<>]*>%%s; \
    s%<tr[^<>]*><t[dh][^<>]*>%\n%g; \
    s%<t[dh][^<>]*>%;%g; \
    s%</t[rdh]>%%g; \
    s%^\n+%%;'
  :0fwh
  | formail -i "Content-Type: text/plain; charset=UTF-8" \
        -i "Content-transfer-encoding: 8bit"
}

With minor adaptations, it should work for the single-part variation, too. But you should realize that the Perl script is a really rough cut, not a proper HTML parser.

The f flag causes Procmail to replace the input message with the output from the pipeline. The formail call is then necessary because the original MIME headers are no longer correct after you have replaced the original content with content of a different type and with a different encoding. If you just want to extract the CSV data into an external file instead, the latter can be skipped and the former can be simplified to just

:0
* ^Content-type: text/html
{
  :0c
  converted/
  :0b  # no w flag necessary either once we drop f
  | qprint -d | html2text >>result.csv
}

where again we assume a single-part MIME message as input. Whether to overwrite the output file instead of appending (or perhaps write to a different CSV file each time) will depend on your specific use case, and how often you expect to receive these messages.


(Not in particular an endorsement of qprint; there are many comparable utilities, but nothing particularly ubiquitous. It’s unfortunate that the GNU Coreutils maintainers steadfastly refuse to include a similar utility.)

Answered By – tripleee

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published