Kitz ADSL Broadband Information
adsl spacer  
Support this site
Home Broadband ISPs Tech Routers Wiki Forum
 
     
   Compare ISP   Rate your ISP
   Glossary   Glossary
 
Please login or register.

Login with username, password and session length
Advanced search  

News:

Author Topic: Hash functions  (Read 2061 times)

Weaver

  • Senior Kitizen
  • ******
  • Posts: 11459
  • Retd s/w dev; A&A; 4x7km ADSL2 lines; Firebrick
Hash functions
« on: January 28, 2017, 04:56:50 AM »

Can your average very cheap and outrageously fast (non-cryptographic) hash function on a string of bytes or machine words be made better-behaved by adding or XORing in i, the offset to or index number of the current machine word being fetched, into the hash algorithm as you go? That is

Code: [Select]
  for ( size_t i=0; i < length; i++)
      {
      hash = hash + i; /* extra step */
      hash = some_fn( hash, data[ i ] );
      }

or, more generally,

Code: [Select]
  for ( size_t i=0; i < length; i++)
      {
      hash = some_fn( hash, data[ i ], i );
      }

instead of

Code: [Select]
for ( size_t i=0; i < length; i++)
      {
      hash = some_fn( hash, data[ i ] );
      }

Since the first case is only something that has an effect more like XORing /adding in a long constant string, I am doubtful that it is any better than simply XORing / adding the total length, which of course is vastly cheaper. The general case 2, is a completely different matter. Am I correct in my thinking?
« Last Edit: January 28, 2017, 04:00:24 PM by Weaver »
Logged

burakkucat

  • Respected
  • Senior Kitizen
  • *
  • Posts: 38300
  • Over the Rainbow Bridge
    • The ELRepo Project
Re: Hash functions
« Reply #1 on: January 28, 2017, 03:30:25 PM »

Your first two code snippets are identical . . .  ???
Logged
:cat:  100% Linux and, previously, Unix. Co-founder of the ELRepo Project.

Please consider making a donation to support the running of this site.

Weaver

  • Senior Kitizen
  • ******
  • Posts: 11459
  • Retd s/w dev; A&A; 4x7km ADSL2 lines; Firebrick
Re: Hash functions
« Reply #2 on: January 28, 2017, 04:01:07 PM »

Oops, cut and paste failure somehow. Hopefully now fixed.
Logged

Weaver

  • Senior Kitizen
  • ******
  • Posts: 11459
  • Retd s/w dev; A&A; 4x7km ADSL2 lines; Firebrick
Re: Hash functions
« Reply #3 on: February 01, 2017, 04:16:14 AM »

Actually, initialising the hash to equal the string length, or something related to it, is a very cheap alternative, and perhaps no worse in some situations. But if the inner some_fn() is non-linear anyway then perhaps options 2 or 1 are better, because where strings differ in that two bytes are swapped then the hashes will have a chance of coming out different.
« Last Edit: February 01, 2017, 04:32:00 AM by Weaver »
Logged

Weaver

  • Senior Kitizen
  • ******
  • Posts: 11459
  • Retd s/w dev; A&A; 4x7km ADSL2 lines; Firebrick
Re: Hash functions
« Reply #4 on: February 01, 2017, 04:23:38 AM »

I should have said that all I want from the hash is that its output has a good distribution for use in a hash table, nothing more.
Logged

burakkucat

  • Respected
  • Senior Kitizen
  • *
  • Posts: 38300
  • Over the Rainbow Bridge
    • The ELRepo Project
Re: Hash functions
« Reply #5 on: February 01, 2017, 05:27:30 PM »

Might I suggest that you may find the answer by running simulations with the proposed code and then analysing the output (especially from "corner cases")?  :-\
Logged
:cat:  100% Linux and, previously, Unix. Co-founder of the ELRepo Project.

Please consider making a donation to support the running of this site.

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5369
Re: Hash functions
« Reply #6 on: February 01, 2017, 06:48:43 PM »

In the dim and distant past, I've had to dream up a few hash table lookup algorithms.   Optimisation was always paramount, else why bother hashing?

I can't remember (probably never knew) the academically optimal solutions, but I did usually give some thought to 'how random is the data I am hashing'?  For example, when hashing 48 bit MAC addresses, one might consider whether the upper bits that identify the vendor may change less frequently than the lower bits. In which case, optimisation might be a case of halfing the workload by simply ignoring the upper bits.

Hasten to add, above is just an example to illustrate a point, and it may be a flawed example.   That would hopefully not detract from the point I am trying to make. :D
Logged

Weaver

  • Senior Kitizen
  • ******
  • Posts: 11459
  • Retd s/w dev; A&A; 4x7km ADSL2 lines; Firebrick
Re: Hash functions
« Reply #7 on: February 02, 2017, 08:26:16 PM »

Burakkucat is quite right about doing a simulation. I wouldn't know how to choose the data - indecision - as I simply had this in mind as a very general theoretical question rather than having a particular use case in mind. But that is indeed a poor excuse.

I read something about the cases
     xx xx 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
vs  xx xx 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
and the desire to not have them hash to the same value.

And also the cases
       xx xx xx aa xx xx bb xx xx xx
vs    xx xx xx bb xx xx aa xx xx xx

and
        xx xx xx aa xx xx
vs
        xx xx aa xx xx xx

where bytes are moved around or swapped, it is just positions that vary. My solution of simply xoring in the index position is not strong enough, so something more costly (what) would be needed to fix that.
Logged

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5369
Re: Hash functions
« Reply #8 on: February 02, 2017, 11:29:51 PM »

Unless you have some knowledge of the nature of the data, why should swapped bytes be worth special treatment.

Surely...

        xx xx xx aa xx xx
vs
        xx xx aa xx xx xx

Is no more or less likely, in a truly random world, than...

        xx xx xx aa xx xx
vs
        xx xx bc xx xx xx



Logged

Weaver

  • Senior Kitizen
  • ******
  • Posts: 11459
  • Retd s/w dev; A&A; 4x7km ADSL2 lines; Firebrick
Re: Hash functions
« Reply #9 on: February 02, 2017, 11:38:11 PM »

No, the article I read suggested that it is undesirable that swapped bytes give the same hash, and that the length be irrelevant in a stream of all zeros. I like the idea of including the length as part of the data, it for a hash table I am not sure the first 'requirement' is significant.

I wrote some code in D and in C and compiled both with GCC for x86-64 and inspected generated code in order to examine the likely cost of including an index in as part of an extremely cheap and fairly nasty typical sting hash. It may have a small effect because of register pressure if the index is thrown away by the compiler and a pointer is compared with an end address, rather than using a (possibly scaled) indexed addressing mode.
Logged
 

anything