Remove in-text citation numbers but not decimal numbers without referencing groups? (regex)

I've wrote small python program to make regex changes and to convert my pdf textbook into audio files to listen to while I drive. It occurred to me that I could use the pdf reading program Librera Reader which has built in TTS and regex replacement to do this task more flexibly and while being able to read along easily. However, Librera Reader can't use a group reference in the replacement text.

This is the substitution I had been using:

([a-zA-Z|\)|%][\.|\,|a-z|\)])\d+(?:[-,]\d+)*

Here is a simplified version that does most of the work for the purpose of this question:

([a-zA-Z][\.])\d+

Replaced with:

\1

Is there a way to use Regex to capture a letter followed by a period followed by a number like this without using a group reference in the replacement and without capturing a number period number string. so that I could make the following conversion:

test words.7 Also 1.5 is a number that can test.9

test words. Also 1.5 is a number that can test.

1 answer

  • answered 2020-08-03 09:11 Wiktor Stribiżew

    I understand you used | inside [...] to "better" visually separate parts of the character class, but you also made | part of the class that now matches a literal pipe. You need to remove these pipes.

    To solve the current problem, you may turn the capturing group into a positive lookbehind because the pattern is of known length (only two chars before the number (range) you want to remove).

    You may use

    (?<=[a-zA-Z)%][.,a-z)])\d+(?:[-,]\d+)*
    

    See the regex demo

    The (?<=[a-zA-Z)%][.,a-z)]) positive lookbehind matches a location that is immediately preceded with

    • [a-zA-Z)%] - an ASCII letter, ) or % and then
    • [.,a-z)] - ., ,, a lowercase ASCII letter or ).