How to remove escape codes from strings after scraping a website

Issue

I try to learn data science with python in simplilearn. in matplotlib learning section they do web scraping from here.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url="https://www.hubertiming.com/results/2018MLK" #OPEN LINK
html=urlopen(URL)
soup=BeautifulSoup(html,"lxml")
title = soup.title
print (title)
print(title.text)
links = soup.find_all('a',href=True)
for link in links:
    print (link['href'])
data =[]
allrows=soup.find_all("tr")
for row in allrows:
    row_list = row.find_all("td")
    dataRow=[]
    data_converted = []
    for cell in row_list:
        dataRow.append(cell.text)
    data.append(dataRow)
data=data[4:]
print(data[-2:])

And this is the results

[['190', '2087', '\r\n\r\n                    LEESHA POSEY\r\n\r\n                ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n                    112 of 113\r\n\r\n                ', 'F 40-54', '\r\n\r\n                    36 of 37\r\n\r\n                ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n                    ZULMA OCHOA\r\n\r\n                ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n                    113 of 113\r\n\r\n                ', 'F 40-54', '\r\n\r\n                    37 of 37\r\n\r\n                ', '0:00', '1:43:27']]

how can I get rid the \r\n\r\n?? i already use "replace" function and it say "'list' object has no attribute 'replace'" and also I can not use strip neither.

Solution

You are having 2D List

What are we leveraging:
  1. List Comprehension
  2. strip() method
  3. Thats it 🙂

Use the below code:

text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
result = [[j.strip() for j in i] for i in text]
print(result)

Output:

[['190', '2087', 'LEESHA POSEY', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00', '1:33:53'], ['191', '1216', 'ZULMA OCHOA', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']]

Answered By – abhira0

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published