Useful Perl Scripts With Regular Expressions Replace On Multiple Files

Useful Perl Scripts With Regular Expressions

By Matthew Drouin - 2003-12-18 Page: 1 2 3 4 5 6

Replace On Multiple Files

We are going to use File::Find, a Perl Module, to parse all the files in a directory and it's subdirectories. This module will work on Unix and Windows machines as well as Mac OS machines but Mac users will want to consult the File::Find documents to see a few of the issues that Mac's have with it and their work around.

This code below will traverse directories but not symbolic links. This means that if there is a real subdirectory in the directory that you tell it to run on then this script will parse all the files in that subdirectory and all the subdirectories but will not follow symbolic links. You can make it follow symbolic links by using the follow attribute but you will want to read the documentation on that.

#!/usr/bin/perl

use File::Find;
use strict;

my $directory = "/home/directory";

find (\&process, $directory);

sub process
{
    my @outLines; #Data we are going to output
    my $line;      #Data we are reading line by line

    #  print "processing $_ / $File::Find::name\n";

    # Only parse files that end in .html
    if ( $File::Find::name =~ /\.html$/ ) {

        open (FILE, $File::Find::name ) or
        die "Cannot open file: $!";

        print "\n" . $File::Find::name . "\n";
        while ( $line = <FILE> ) {
        $line =~ s/<body([^>]*)>/<body>/i;
            push(@outLines, $line);
        }
        close FILE;

        open ( OUTFILE, ">$File::Find::name" ) or
        die "Cannot open file: $!";

        print ( OUTFILE @outLines );
        close ( OUTFILE );

        undef( @outLines );
    }
}

In the code above we first start out by doing a use File::Find; which allows us to use the find function. We then define my $directory and set it to the path of the directory we want to parse. The last thing we do in the main part of the code is to call the find function which we need to pass the address off the processing function, this is a call back function that will be called with each file and directory found within the main directory. The second argument is the actual directory or directories we want to use.

The most complex part of this script is the actual processing subroutine which is called with each file and directory found within the main directory. There is no way to tell find to only select certain types of files so this means that our processing code will even try to run on directories and if we try to open a directory, at least in windows, the script will crash. Also we do not want to be parsing image files or other binary files for the body tag first because we could certainly mess them up and secondly we do not want to change them.

Since we know we only want to parse HTML documents and change the body tags we can easily just add an if statement that says if the file ends with .html then lets parse the file. From here, since we know we have an html file, we open the file and then search the whole file for the body tag. When we find the body tag we replace the body tag and keep searching. To be more efficient we could have stopped our searching but I will leave that little modification up to you.

The next thing we do is close the current file and then reopen the file in write mode. We then write everything out, if we made a change or not, to the out file. We then clean up by closing the output file and we do an undef, just to be clean, on the @outLines, which is the array that olds all the data we are going to write out.

View Useful Perl Scripts With Regular Expressions Discussion

Page: 1 2 3 4 5 6 Next Page: Converting From Unix Files To Windows Files