Why does the lines count differently using two different way. to load text?

Issue

import pathlib

file_path = 'vocab.txt'
vocab = pathlib.Path(file_path).read_text().splitlines()
print(len(vocab))

count = 0
with open(file_path, 'r', encoding='utf8') as f:
  for line in f:
    count += 1

print(count)

The two counts are 2122 and 2120. Shouldn’t they be same?

Solution

So, looking at the documentation for str.splitlines, we see that the line delimiters for this method are a superset of "universal newlines":

This method splits on the following line boundaries. In particular,
the boundaries are a superset of universal newlines.

Representation Description
\n Line Feed
\r Carriage Return
\r\n Carriage Return + Line Feed
\v or \x0b Line Tabulation
\f or \x0c Form Feed
\x1c File Separator
\x1d Group Separator
\x1e Record Separator
\x85 Next Line (C1 Control Code)
\u2028 Line Separator
\u2029 Paragraph Separator

A a line for a text-file will by default use the universal-newlines approach to interpret delimiters, from the docs:

When reading input from the stream, if newline is None, universal
newlines mode is enabled. Lines in the input can end in '\n', '\r', or
'\r\n', and these are translated into '\n' before being returned to
the caller. If newline is '', universal newlines mode is enabled, but
line endings are returned to the caller untranslated. If newline has
any of the other legal values, input lines are only terminated by the
given string, and the line ending is returned to the caller
untranslated.

Answered By – juanpa.arrivillaga

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published