Introduction: Coding Connections in Perl

"Connessione" - "Make the break-through connection" is one of the key points from the book Work Like Davinci by Michael Gelb

In this Instructable, you will learn about the process I followed to create a reusable module in Perl. This module encapsulates the details for evaluating pairwise associations in Perl. These evaluations can be used for decision support, finding common connections, as well as unique or duplicate connection detection. We will first go through simple models of approaches, then build a module, test it, discuss further development, and finish off with how to share it with the broader Perl community. This Instructable was created as a submission to the Coded Creations Competition, so favs & votes are appreciated.

Background:

In working on applications developed in Perl, I have frequently had need to make observations based on associations to help ensure data integrity. Perl's associative arrays are well suited to making key value pairs. However, I found that I needed to find connections based values. The built in reverse function isn't up to the job when associations are not unique. The following is a walk through of how I developed a reusable Perl module to inspect key value pairs.

Components Required:

  • Perl - Perl was pre-configured on my Linux machine. For other machines, you can download Perl for free from perl.org
  • Your favorite text editor
  • Terminal / command line interface

Step 1: Pairwise Connections in Perl - Methods

Consider the following example:
Telephone books provide an alphabetically ordered list of names, allowing the reader to find what phone number is assigned to a person. There is also such a thing as a "reverse directory" which allows one to start with the phone number and get the name of the person associated with that number. (Handy when you want to find out who is interrupting your dinner!).

Now the problem, let's say you are working with a data source that does not guarantee 1-1 associations. An example of this would be telephone party lines. A problem emerges when more than one person is associated with a number. In Perl, if you used an associative array for quick search, a reversed version might lose information. The images here illustrate how using the reverse function on a hash can lead to loss of information. In one example, we have an associative array of phone numbers. It looks like we lost track of "Max". In the other example, we have fictitious catalog of internet connected devices. In this case we lost track of "consoles". Other obvious indications of multiple devices using the same address aside, the loss of information will make evaluating with a plain associative array problematic.

On a larger scale, records are typically composed of several fields. A phone book, also includes street addresses. The address field is useful for some look-ups but not for others. The address field would be an example of an associated value that is not indexed. So, one could select the needed fields for evaluation and a unique id field that relates it back to the larger record. These selected field can help with support decision making, and the the unique id field can then enable you to select from other fields. For example, got the cellphone, but want to try the land line.

I've invested time and effort to make working in Perl with this type of data easier and I'd like to share my solution. A Perl "module" is perfect for this. Perl provides "modules" that can be used to encapsulate details of an implementation for reuse. In the world of Perl you'll find a vast library of modules provided by CPAN, and any developer can make their own module with marginal effort and even offer it up to others. I like to build a reusable module for two reasons. First I want to encapsulate the detail of implementation. Second I want to make consistent reuse of the implementation. To use a module one simply includes it in their code using the keyword use.

The things I want to implement with this module include:

  • Initialization & member access / assert
  • Print & Debug
  • Add Key Value Pair
  • Remove Key Value Pair
  • Does Key exist
  • List All Keys
  • Invert Key Value Pair

Step 2: Simple Model

Perl essentially has 3 primitive data types. Stings, indexed arrays and associative arrays.
Very early on, I discovered it was easier for me to work with pointers to these structures, and de-reference where needed. Associative arrays lend themselves well to key value pairings. However, this is not the end of this consideration, as a key may map to more than one value. Depending on the application, strictly defining the value of the association may clobber a prior value. So let's see how we might model multiple values that map to a key.

Consider the following approaches:

Code:

#!/usr/bin/perl <br>use strict; 				# Just do it!
use warnings; 
use Data::Dumper;  <br>
my $key1 = 'city1';
my $key2 = 'city2';
my $key3 = 'city3';

my $val1 = 'zip1';
my $val2 = 'zip2';
my $val3 = 'zip3';
my $val4 = 'zip1'; 


my @assoc = ($val1,$val3,$val4);	# Make a quick list Notice the repeated value
my $hashref;				# Initialize the structure
#approach 1 hash of string-tokens
foreach my $value (@assoc) {
  if ($hashref->{$key1}) {
    my $current_value = $hashref->{$key1};
    $hashref->{$key1} = $current_value . "," . $value;
  }else{
    $hashref->{$key1} =  $value;
  }
}
# print Dumper $hashref;
#approach 2 key-assoc-cnt hash
foreach my $value (@assoc) {
  if ($hashref->{$key2}->{$value}) {
    $hashref->{$key2}->{$value} ++;
  }else{
    $hashref->{$key2}->{$value} = 1;
  }
}
# print Dumper $hashref;    
#approach 3 hash - array associations
foreach my $value (@assoc) {
  if ($hashref->{$key3}) {
    push( @{$hashref->{$key3}} , $value);	#switch order push vs unshift
  }else{
    $hashref->{$key3} =  [ $value ];
  }
}
print Dumper $hashref;

Output:

#perl association_approaches.pl<br>$VAR1 = {
          'city2' => {
                       'zip3' => 1,
                       'zip1' => 2
                     },
          'city1' => 'zip1,zip3,zip1',
          'city3' => [
                       'zip1',
                       'zip3',
                       'zip1'
                     ]
        };

In each of these approaches, the behavior is determined by the existence / definition of the key. Additionally, the methods leverage Perls "autovivification" concept. Each key will build up a series "tokens." In the first approach, the value of a key is directly used as a string. If more than one token is required, it is added to the string separated by an arbitrary character. In this case it would be a comma, but more generally, it may be selected to enable tokenizing with the split function. In the second approach, the value is kept separated by making it another layer of association. Ultimately, the hash will need to be assigned a value. It could be set to be any number of values. The value selected is initialized to 1 and incremented in subsequent registrations. The third approach looks at using indexed arrays for each of the values.

The first approach was used in an implementation, where I was under a bit more of a time crunch. It shows its upbringing as something used to handle csv files. I found that as tokens became more complicated it became harder and harder to pick a reliable token separator. When I had time to go back and evaluate alternative approaches I found that I liked the second approach. Where as the third approach was cumbersome for my application, the second approach offered the ability to create a unique list with the side benefit of counting the number of times the pair was defined. In my application the order of input was not important, as long as being able to iterate over the entire set of keys. applications where ordering is more significant, using arrays makes more sense.

Now let's see what we can do with approach number 2.

# original hash from approach 2 
$VAR1 = {
        'city2' => {<br>                    'zip3' => 1,
                    'zip1' => 2
                   }
}
# Extending from approach 2 above
# Reverse it
my $revhash = {};
foreach my $key (keys(%$hashref)){
foreach my $val (keys(%{$hashref->{$key}})){
  $revhash->{$val}->{$key} = $hashref->{$key}->{$val};
}
}
print Dumper $revhash;

Output:

#perl revit.pl

$VAR1 = {
          'zip3' => {
                      'city2' => 1
                    },
          'zip1' => {
                      'city2' => 2
                    }
        };
#

Notice how the keys and values have swapped, and that the link count remains the same.

Step 3: Building the Module

So let's see what we learned in the simple model and combine it with the methods list to create a reusable module. The pairwise.pm file attached contains the executable code.

The structure of a module is as follows :

cat PACKAGENAME.pm

#!/usr/bin/perl   
use strict;<br>package  PACKAGENAME;

<all that module code>

1; 
# optionally 
# documentation  
# commented out test cases 

Let's flesh out the module methods described in step 1.

#initialization & member methods

sub new { 
  my $self = {}; 
  $self->{'map'} = {};        #empty hash 
  $self->{'sort'} = 'dnc';    #sort order
  $self->{'sep'} = ',';       #separator character default comma
  $self->{'debug'} = '';      #debug tokens 
  bless($self); 
  return $self; 
}
sub debug {
    my ( $self, $val ) = @_; 
    $self->{'debug'} = $val if defined($val); 
    return $self->{'debug'}; 
}
sub reset { 
    my ( $self, $val ) = @_; 
    $self->{'map'} = {}; 
    return ; 
}
#Helper method to override separator for map2str
sub separator { 
    my ( $self, $val ) = @_; 
    $self->{'sep'} = $val if defined($val); 
    return $self->{'sep'};     
}

# Print Methods

sub map2str {
  my ($self ) = @_;
  my $str  = '';
  my $hashref = $self->{'map'};
  foreach my $key (keys(%$hashref)){
  #print "key=$key\n";
  foreach my $val (keys(%{$hashref->{$key}})){
     # line format= key,cnt,val\n
       $str .= $key;
       $str .= $self->{'sep'};
       $str .= $self->{map}->{$key}->{$val};
       $str .= $self->{'sep'};
       $str .= $val;
       $str .= "\n";  
  }
  } 
  if ($self->{'debug'}  =~ m/map2str/) {
    $str .= "DEBUG" . "\n";
    $str .= Dumper $self->{'map'};
  } 
  return $str; 
}

# Association Manipulation Methods

#Add Key Value Pair
sub addkvp { 
  my ($self , $key , $val) = @_;
  my $res = '1';
  if ( defined($val) && defined($key)) {
   $self->{'map'}->{$key}->{$val} ++ ;
  } else {
   $self->{'map'}->{$key}->{$val} = '1';
   $res = '0';
  } 
  return  $res; 
}

#assert the whole map if you already have itsub assertkvps { 
  my ($self , $map) = @_;
  my $res = '1';
  if ( defined($map) ) {
   $self->{'map'} = $map ;
  } else {
   $res = '0';
  } 
  return  $res; 
}
#Remove Key Value Pair
sub removekey { 
  my ($self, $key ) = @_; 
  my $is_reg = 0 ; 
  if (exists $self->{map}->{$key}) {  
   $is_reg = 1 ; 
   delete $self->{map}->{$key}; 
  } 
  return $is_reg; 
}
#Invert Key Value Pair<br>sub invertkvp { 
  my ($self) = @_; 
  my $hashref = $self->{map}; 
  my $revhash = {};
  foreach my $key (keys(%$hashref)){
  foreach my $val (keys(%{$hashref->{$key}})){
    $revhash->{$val}->{$key} = $hashref->{$key}->{$val};
  }
  }
  $self->{map} = $revhash; 
  return; 
}

#Association Evaluation Methods:

# Does Key exist returns 0/2 Boolean based on key's existance
sub is_registered { 
  my ($self, $key ) = @_; 
  my $is_reg = 0 ; 
  if (exists $self->{map}->{$key}) {  
    $is_reg = 1 ; 
  } 
  return $is_reg; 
}
# List All Keys returns arrayref
sub getkeys { 
    my ($self, $key ) = @_; 
    #my @reg;
    my @reg  = keys(%{ $self->{map} }); 
    return \@reg; 
}
# get key value returns arrayref
sub getvalue { 
    my ($self, $key ) = @_; 
    my @reg;
    if (exists $self->{'map'}->{$key}) {         
        @reg  = keys(%{ $self->{'map'}->{$key} });
    } else { print "no\n";} 
    return \@reg; 
} <br>
#get the map back as a hash 
sub gethash { 
    my ($self) = @_; 
    my $reg = $self->{map}; 
    return $reg; 
}
#non-uniq Key Value Pair returns hashref of keys with more then 1 value
sub nonuniqkvp { 
  my ($self) = @_; 
  my $hashref = $self->{map}; 
  my $nonuniq = {};
  foreach my $key (keys(%$hashref)){
  foreach my $val (keys(%{$hashref->{$key}})){
    if ($hashref->{$key}->{$val} gt 1) {
      $nonuniq->{$key}->{$val} = $hashref->{$key}->{$val};
    }
  }
  } 
  return $nonuniq; 
}
#non-uniq Key Value Pair returns a hashref of kvp with link count greater then 1
sub nonuniqkeys { 
  my ($self) = @_; 
  my $hashref = $self->{map}; 
  my $nonuniq = {};
  foreach my $key (keys(%$hashref)){
    if (scalar(keys(%{$hashref->{$key}}) gt '1')) {
      $nonuniq->{$key} = $hashref->{$key};
    }
  } 
  return $nonuniq; 
}

Don't forget to close out the package with "1;"

1;

Next let's exercise the package with test case examples.

Step 4: Examples and Test Cases

Let's exercise the module, with some test cases.

To use the module we need to

  • add the location of the module to the @INC path list
    • This is done after at the first line with -I ./
    • ./ assumes the module is in the same directory as the script
  • add the line "use pairwise;"
    • note the package name, filename of the module were chosen to match

The teststubs_pairwise.pl contains a running version of these examples:

#!/usr/bin/perl -I ./ <br>use strict; 
use Data::Dumper;  
use pairwise;
##################
teststub1();
teststub2();
teststub3();
teststub4();
##################

#teststubs listed here

Test case 1: Who's with me.

Donuts are serious business if you want your fair share.
Nothing says we take donuts seriously like automating a call to action.
In the first example we will build a contact association map & and create a call to action.

Code

sub teststub1 {<br>     print "teststub1\n";
  my $kvp1 = pairwise->new();
     $kvp1->addkvp('Chad'  ,'locust team');
     $kvp1->addkvp('Mike'  ,'locust team');
     $kvp1->addkvp('Dave'  ,'locust team');
     $kvp1->addkvp('Jon'   ,'locust team');
     $kvp1->addkvp('Chad'  ,'volunteer');
     $kvp1->addkvp('Jon'   ,'volunteer');
     $kvp1->addkvp('Steve' ,'Citizen Forester');
     $kvp1->addkvp('Dave' ,'Music Fan');
     # print Dumper $kvp1;
     print $kvp1->map2str() . "\n";
     $kvp1->invertkvp();
     print $kvp1->map2str() . "\n";
     my $locust_team = $kvp1->getvalue('locust team');
     print "locust team Assemble! Attn:" . join (' ', @$locust_team) . "\n";
return;
}

Output (Those donuts don't stand a chance!)

teststub1
Dave,1,Music Fan
Dave,1,locust team
Jon,1,locust team
Jon,1,volunteer
Mike,1,locust team
Chad,1,locust team
Chad,1,volunteer
Steve,1,Citizen Forester

Music Fan,1,Dave
locust team,1,Dave
locust team,1,Mike
locust team,1,Jon
locust team,1,Chad
Citizen Forester,1,Steve
volunteer,1,Jon
volunteer,1,Chad

locust team Assemble! Attn:Dave Mike Jon Chad

Test Case 2: 1 potato 2 potato ....

Can we find out if there is more than 1 potato

Code

sub teststub2 {<br>     print "teststub2\n";
  my $kvp1 = pairwise->new();
     $kvp1->addkvp('potato1'  ,'red');
     $kvp1->addkvp('potato2'  ,'blue');
     $kvp1->addkvp('potato1'  ,'blue');
     $kvp1->addkvp('potato1'   ,'red');
     $kvp1->addkvp('potato2'  ,'blue');
     $kvp1->addkvp('potato3'   ,'red');
     $kvp1->addkvp('potato4'   ,'red');
     $kvp1->addkvp('potato4'   ,'red');
     #print $kvp1->map2str() . "\n";
     $kvp1->invertkvp();
     #print $kvp1->map2str() . "\n";
     #Mr Potato impersonators...
     my $redpotatos = $kvp1->getvalue('red');
     if (scalar(@$redpotatos) gt '1') {
    print "More then 1 red potato here \n";
     } else {
    print "Nothing is here ... move along \n";
     }
}

Output (Notice the size of the array provides an indication of link strength)

teststub2
More than 1 red potato here

Test Case 3 : The Hunt for the Red Potato

In logic there is a concept of .Modus Tollens. MT boils down "denying the consequent."

Said another way, if p implies q and you don't have q then you don't have p
We will use this as part of a process of elimination to find the true ---uhhh--- red potato.

Code

sub teststub3 {
     print "teststub3\n";
  my $kvp1 = pairwise->new();
     $kvp1->addkvp('potato1'  ,'blue');
     $kvp1->addkvp('potato2'  ,'red');
     $kvp1->addkvp('potato1'  ,'red');
     $kvp1->addkvp('potato1'   ,'blue');
     $kvp1->addkvp('potato2'  ,'red');
     $kvp1->addkvp('potato3'   ,'blue');
     $kvp1->addkvp('potato4'   ,'blue');
     $kvp1->addkvp('potato4'   ,'blue');

     print "starting with : \n";
     print $kvp1->map2str() . "\n";

     $kvp1->invertkvp();
     my $bluelist = $kvp1->getvalue('blue');
     
     foreach my $tot (@$bluelist){
    $kvp1->removekey($tot);
        print "removing $tot from consideration\n";
     }
     print "Modus Tolens NOT blue : \n";
     print $kvp1->map2str() . "\n";  
}

Output (Notice despite multiple associative links, the potatoes with blue associations were removed in 1 step)

More then 1 red potato here <br>teststub3
starting with : 
potato4,2,blue
potato2,2,red
potato3,1,blue
potato1,2,blue
potato1,1,red
removing potato4 from consideration
removing potato3 from consideration
removing potato1 from consideration
Modus Tolens NOT blue : 
potato2,2,red

Test Case 4 : Histograms & Finding intersections

So if values represent a coordinate for a specific key, it can be possible to determine intersections

Code:

sub teststub4 {<br>     print "teststub4 finding interesection of defined points\n";
  my $kvp1 = pairwise->new();
     $kvp1->addkvp('wire1'  ,'2,1');
     $kvp1->addkvp('wire1'  ,'2,2');
     $kvp1->addkvp('wire1'  ,'2,3');
     $kvp1->addkvp('wire1'  ,'2,4');
     $kvp1->addkvp('wire1'  ,'2,4');

     $kvp1->addkvp('wire2'  ,'4,2');
     $kvp1->addkvp('wire2'  ,'3,2');
     $kvp1->addkvp('wire2'  ,'2,2');
     $kvp1->addkvp('wire2'  ,'1,2');
     print "starting with :\n";
     print $kvp1->map2str() . "\n"; 
     $kvp1->invertkvp();

     my $multi = $kvp1->nonuniqkeys();
     print "nonuniqkeys returns a hash of keys with multiple associations / overlaps\n" . Dumper $multi;

     $multi = $kvp1->nonuniqkvp();
     print "nonuniqkeys returns a hash of kvp defined  multiple times \n" . Dumper $multi;
   
}

Output: (Notice that the field contains commas and the separator is by default a comma. This illustrates how tokenizing can be made more complicated. The seperator function can be used to change what the map2str method joins the fields on.. Alternatively, one could strategically take advantage of the situation. In either case, consistent repeatable implementation allows the developer to understand and work with the behavior. )

teststub4 finding intersection of defined points<br>starting with :
wire1,1,2,3
wire1,1,2,2
wire1,2,2,4
wire1,1,2,1
wire2,1,3,2
wire2,1,1,2
wire2,1,4,2
wire2,1,2,2

nonuniqkeys returns a hash of keys with multiple associations / overlaps
$VAR1 = {
          '2,2' => {
                     'wire1' => 1,
                     'wire2' => 1
                   }
        };
nonuniqkeys returns a hash of kvp difined  multiple times 
$VAR1 = {
          '2,4' => {
                     'wire1' => 2
                   }
        };

Step 5: Reader Exploration

This module has plenty of room to grow. I have listed some of my ongoing development / improvement notes

  • I created 1 debug method as part of the tbl2str method.
    The method uses a regular expression to match the function name.
    If a match is found the method will add diagnostic information to its output.
    • This allows for debugging without hacking up the module with print statements, and you still get to skip using the debugger :P .
    • What other methods would have useful debug output? -- each time an item is added/removed?
    • Is this the best way to print Debug info? would it interfere with normal operation
  • Should the inversion of the key value pairs be persistent, that is to say replace the original hash, or should it be returned and asserted separately?
    • I typically have needed to re-invert.
    • What problems would come up if I kept both kvp inversions?
  • The keen eye will have noticed that the new method contained a field called "sort." The idea of this fields is to define a preference for how things are printed out... One might enhance the module using different sort methods
  • I wrote this to solve a very specific problem with data integrity in a variety of inputs.
    • What other uses can you think of?

    • Might this map to higher order connections? IE using more than 2 table features, or key-abstraction-value chain

    Step 6: Sending My Module to CPAN

    CPAN provides a massive library for developers.

    At the time of this writing they claim the following stats.

    The Comprehensive Perl Archive Network (CPAN) currently
    has 149,808 Perl modules in 31,669 distributions, written by 12,135 authors, mirrored on 251 servers.
    The archive has been online since October 1995 and is constantly growing.
    - http://www.cpan.org/index.html

    Something about that looks generated by calculation, but I digress.

    Originally, I hesitated to put my code up to CPAN, mostly due to concerns about naming and time commitment to maintain the module. However, I'd rather lean towards sharing my effort and if it becomes too much, it looks like I can hand off the primary ownership. Let's save some folks a bit of their midnight oil

    Stay Tuned

    Because CPAN Uses the "PAUSE" mechanism to facilitate contributions and ensure quality, my code is not yet available in CPAN. Stay tuned for more on how the submission process works out for me. (As a first time submitter, I don't know if I will get through the approval process before the Instructables Coding Contest takes place. If not perhaps that's something for a future Instructable.) Essentially, I would like to share this solution with anyone who needs to solve the similar problem, so feel free to comment or contact me with questions.

    Coded Creations

    Participated in the
    Coded Creations