Importing text files with Tensorflow in Google Colab from Github

Issue

I am trying to load and preprocess text documents in Google Colab using Tensorflow and I seem to be having problems importing my text files.

Based on Example 2 of the tutorial, I ran

DIRECTORY_URL = 'Github URL'
FILE_NAMES = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

However, I’m getting these odd results

Sentence:  b'            name-with-owner="YWtlNzAwL1B5dGhvbg=="'
Label: 3
Sentence:  b'        </tr>'
Label: 0
Sentence:  b'        </tr>'
Label: 0

Am I correct in assuming that Tf is reading in the raw text files from GitHub? I went and tried the first example in the above tutorial by uploading a zipped folder of the text files and trying to unzip via Tf in Colab, then reading the files.

data_url = 'GithubUrlWithZip.7z'

dataset_dir = utils.get_file(
    origin=data_url,
    extract=True,
    cache_dir=None,
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent
text_dir = dataset_dir/"datasets"
list(text_dir.iterdir())

but the text files did not read well when I checked.

sample_file = text_dir/"file1.txt"

with open(sample_file) as f:
  print(f.read())

The files reads as such

<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">

  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX1obPnf4Yp7dI0ZTWO+ljg==" rel="stylesheet" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" integrity="sha512-cZa7DZqvMBwD236uzEunO/G1dvw8/QftyT2UtLWKQFEy0z0eq0R5WPwqVME+3NSZG1YaLJAaIqtU+m0zWf/6SQ==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_dimmed-7196bb0d9aaf.css" /><link data-color-theme="dark_high_contrast" crossorigin="anonymous" media="all" integrity="sha512-WVoKqJ4y1nLsdNH4RkRT5qrM9+n9RFe1RHSiTnQkBf5TSZkJEc9GpLpTIS7T15EQaUQBJ8BwmKvwFPVqfpTEIQ==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_high_contrast-595a0aa89e32.css" /><link data-color-theme="dark_colorblind" crossorigin="anonymous" media="all" integrity="sha512-XpAMBMSRZ6RTXgepS8LjKiOeNK3BilRbv8qEiA/M3m+Q4GoqxtHedOI5BAZRikCzfBL4KWYvVzYZSZ8Gp/UnUg==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_colorblind-5e900c04c491.css" /><link data-color-theme="light_colorblind" crossorigin="anonymous" media="all" integrity="sha512-3HF2HZ4LgEIQm77yOzoeR20CX1n2cUQlcywscqF4s+5iplolajiHV7E5ranBwkX65jN9TNciHEVSYebQ+8xxEw==" rel="stylesheet" data-href="https://github.githubassets.com/assets/light_colorblind-dc71761d9e0b.css" /><link data-color-theme="light_high_contrast" crossorigin="anonymous" media="all" integrity="sha512-+J8j3T0kbK9/sL3zbkCfPtgYcRD4qQfRbT6xnfOrOTjvz4zhr0M7AXPuE642PpaxGhHs1t77cTtieW9hI2K6Gw==" rel="stylesheet" data-href="https://github.githubassets.com/assets/light_high_contrast-f89f23dd3d24.css" /><link data-color-theme="light_tritanopia" crossorigin="anonymous" media="all" integrity="sha512-AQeAx5wHQAXNf0DmkvVlHYwA3f6BkxunWTI0GGaRN57GqD+H9tW8RKIKlopLS0qGaC54seFsPc601GDlqIuuHg==" rel="stylesheet" data-href="https://github.githubassets.com/assets/light_tritanopia-010780c79c07.css" /><link data-color-theme="dark_tritanopia" crossorigin="anonymous" media="all" integrity="sha512-+u5pmgAE0T03d/yI6Ha0NWwz6Pk0W6S6WEfIt8veDVdK8NTjcMbZmQB9XUCkDlrBoAKkABva8HuGJ+SzEpV1Uw==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_tritanopia-faee699a0004.css" />

I also tried having the local files into my Google Drive, but for some reason I had issues reading in the text files. In the future if I’m working with larger and more numerous text data, I presume I wouldn’t be uploading them to Google Drive then reading the local files from Colab, so I want to properly learn how to import text files from Github or another source.

The tutorials seem to have their files in Google API storage (I was using this Google Cloud link), but when I tried uploading my files there, it was extremely slow.

Is there another method generally used for ML models like this or with text-based work?

Solution

Looks like you might be getting the webpage for GitHub instead of the actual "raw" file.

You can get raw files using this kind of URL: https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/README.md

Note the "raw" at the start of the URL.

This is the original URL here: https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/README.md

Without the raw URL, the file won’t download properly.

GitHub Readme page with Raw button highlighted

Answered By – Daniel Bourke

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published