Python – Why does re.findall return a list of tuples when the pattern only contains one group

findallpythonregex

Say I have a string s containing letters and two delimiters 1 and 2. I want to split the string in the following way:

  • if a substring t falls between 1 and 2, return t
  • otherwise, return each character

So if s = 'ab1cd2efg1hij2k', the expected output is ['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k'].

I tried to use regular expressions:

import re
s = 'ab1cd2efg1hij2k'
re.findall( r'(1([a-z]+)2|[a-z])', s )

[('a', ''),
 ('b', ''),
 ('1cd2', 'cd'),
 ('e', ''),
 ('f', ''),
 ('g', ''),
 ('1hij2', 'hij'),
 ('k', '')]

From there i can do [ x[x[-1]!=''] for x in re.findall( r'(1([a-z]+)2|[a-z])', s ) ] to get my answer, but I still don't understand the output. The documentation says that findall returns a list of tuples if the pattern has more than one group. However, my pattern only contains one group. Any explanation is welcome.

Best Solution

You pattern has two groups, the bigger group:

(1([a-z]+)2|[a-z])

and the second smaller group which is a subset of your first group:

([a-z]+)

Here is a solution that gives you the expected result although mind you, it is really ugly and there is probably a better way. I just can't figure it out:

import re
s = 'ab1cd2efg1hij2k'
a = re.findall( r'((?:1)([a-z]+)(?:2)|([a-z]))', s )
a = [tuple(j for j in i if j)[-1] for i in a]

>>> print a
['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k']
Related Question