Generating and Analyzing
Prime Numbers


Preface

Feedback

If you wish to comment on this material in the form of additions, detected errors or questions about the code, send an email to primes1e12 "at" earthlink.net.

I am not a mathematician. This project represents a programming challenge to me. I'll be happy to discuss programming, but please direct questions on deep mathematics to more qualified authorities.

Don Kostuch
October, 2018

Getting Started

I recently took an interest in prime numbers after watching an episode of "Infinite Series" on YouTube. I was (and am) fascinated by Goldbach's Conjecture:

All even numbers (greater than 2) are the sum of two prime numbers.

No counter example has been found for values less than 4E18.

I view this not so much as an opportunity to uncover new worlds of mathematics, as a challenge to produce software that runs efficiently and correctly on a typical home PC.


Requirements and Challenges

My initial work focused on finding a program structure that ran with reasonable performance.

After a few tests I ran out of prime numbers for testing. Although a Google search of "Prime Numbers" produces 65 million hits, lists of more than a few thousand primes is scarce.

I found a site that does provide primes up to one trillion (1E12), consisting of over 3700 files. This would have been enough to keep me busy and satisfied for a while. Unfortunately there were errors scattered among the 3700--files missing or flawed.

My focus switched to creating a collection of primes suitable for my tests. I downloaded several prime generators that proved accurate but slow. The repair of the flawed files would take several months, running on five computers.

I concentrated on producing 64 bit primes (the native limit on my PC) and devising a way to efficiently represent and store them. This would allow me to get back to Goldbach and leave a useful package of software for others to experiment with.

The initial sections describe the fundamental elements of the programs and reasons behind their design. If their purpose is not clear at first, read on and carefully study the top level programs and classes to see how they are used.

I use Visual Studio 12 C++ for these examples. A free version is available from Microsoft.com. Nearly all the code is "Standard Vanilla" C++, with heavy use of the Standard Template Library.

To run these programs on machines other than the one running the Visual Studio compiler, check the following configuration setting:

  Configuration Properties
    C/C++
      Code Generation
        Runtime Library

  Change  "Multi-threaded DLL (/MD)"

  To      "Multi-threaded (/MT)".

This will considerably expand the size of the executable, but it will not require any possibly absent resources on the new target machine.

Some advice to simplify debugging:

Avoid complex, deeply nested expressions. You may be unable to check the value of intermediate expressions and may also confuse the debugger.

Use exceptions to detect and report

Programming errors
Invariant violations
Normal "exceptional" conditions such as end of file.

Pack lots of information about the exceptional condition into the exception object's string. It is only created and used infrequently.

In a few places I use Windows or DOS interfaces, which are encapsulated in functions or classes to hide the details from the main client code. The description of the code should make conversion to other systems relatively easy.




Objectives

Performance and Simplicity

The following designs are meant to be "easy" to explain and debug and provide reasonable performance and memory usage for an interesting collection of primes (at least up to a trillion). The code works up to the largest 64 bit prime.

I have not broken the "square root" limit, but it is at beyond my current horizon.

A file containing a sequence of primes is named by non-prime values preceding the first prime and following the last prime. This avoids ambiguity in a sequence of files or other groups. The last prime in a group will not be duplicated at the beginning of the next group, since the specifiers are not prime.


Internal Prime Representation

For computational purposes I represent primes as unsigned 64 bit (unsigned long long) integers, the largest native type on PCs. The limit is 18,446,744,073,709,551,615 (18 Quintillion, 1.8E19). This is small potatoes compared to the REALLY LARGE primes, but it is enough to keep me and my PC occupied for the foreseeable future.


Primality Verification

I have compared the output of my prime generator with numerous available prime lists and generators. I have not detected any errors, but neither have I tried all 18 Quintillion possibilities.


File Representation


ASCI Text

Publicly available prime lists usually use ASCI characters, with values separated by tab or newline characters. They require no special software. C++ handles them with the class istream and the extraction operator, operator>>(). The existence and location of any given prime candidate requires either a sequential scan or a non-trivial binary search (the values are not at a predicable location).

The list of primes up to a trillion (1E12) occupies 454 GB. The repetition of digit patterns allows this to be compressed to 61 GB.




PrimeC -- Prime Compact
Using Compressed Bit Mapping

Since all primes beyond 5 end in 1, 3, 7 or 9, a sequence of 20 potential primes can be mapped into one 8 bit byte. Each bit represents a particular odd number and the bit specifies if that number is prime.

The location of the byte and bit specifying any given number is easily computed and facilitates direct access to determine its primality. The representation can be efficiently copied between a file and a corresponding byte array of unsigned char.

This representation has no upper bound on the value of the primes nor the size of a file. Since every odd candidate is represented, the size of the file is fixed for any interval of the same size.

The last 1334 64 bit primes span a range of 51558--one prime for 45 non-primes. In this range the average prime uses under 2.5 bytes, compared to 20 characters (bytes), or 8 bytes as a binary integer.

The approximate number of non-primes per prime is ln(x).

Decimal
Digits
Non-primes
per prime
Bits
per Prime
PrimeC Bits
1022329
20446418
408812836
8018425672

As the size of primes grows, the length of the representation grows at the same rate as the prime density. The primeC representation is thus always less than the ordinary binary representation.

The compressed decimal character representation is about 1/9 the expanded size, and thus comparable in size to primeC, but cannot be used directly.

A convenient file size is for a range 10 Billion (1E10) numbers, which 500MB.

The internal details of primec are presented in the sections on classes PrimeCVector, PrimeCArray, PrimeCFileWriter, PrimeCFileReader, and DirectoryPrimeC.




Non-Class Header Files


Prime.hpp

The file Prime.hpp provides a common set of identifiers used in the prime related classes and programs.

Besides reporting runtime errors, exceptions are used to report normal ending conditions: end of file, end of sequence, etc.




Classes


StopWatch

A StopWatch is used to measure elapsed time for processes or events. Its primary use is for progress reports to update the user on elapsed time and estimated time to completion.


StopWatch Header


StopWatch Code



MicroWatch


A MicroWatch is similar to a StopWatch, but with microsecond resolution. It is useful for performance analysis.

MicroWatch Header


Progress


A Progress object is used to release a message to the user to assure the program is running properly. The constructor argument determines the time interval between messages.

The volatile bool release specifies whether the time for a report has expired. volatile tells the compiler to avoid using a cached copy of its value and to always access the original variable. No special access restrictions are necessary since the occurrence (or not) of a report is not crucial to the correctness of the program.

release is initialized to false. The first report will occur after the specified wait interval.

go() is called by the client code to determine if the progress message should be issued. Eventually work() will set release = true, causing go() to return true.

go() immediately sets release = false causing go() to return false on subsequent calls until release is set true again.

work() sleeps for the specified wait interval, and then sets release = true and goes back to sleep.

If the interval between calls to go() is longer than the Progress wait interval, the timing will be delayed, but the program will continue otherwise normally.

If a non-zero value is specified for the second constructor argument, a report will will be issued (true returned) every time go() is called that number of times. independently of the wait interval


Progress Header



Directory

A Directory object collects the file names in a directory into an STL vector of strings and allows the client code to process the list with native vector functions, especially iterators.

The load() function uses a DOS system() call to collect the file names--crude but simple. It can be easily modified to work in non-Windows environments.


Directory Header



Directory Code



DirectoryStream

DirectoryStream uses a Directory to combine a collection of txt files into one virtual stream. When the client code reaches end of file the DirectoryStream automatically closes the file and opens the next file without interruption.


DirectoryStream Header

Here is a sample client code snippet. get() will throw an EOF exception only when the last file in the directory reaches end of file.

   DirectoryStream<Prime> primeStream( directoryName ) ;
// ...
   while ( true )
   {
     Prime v = primeStream.get() ;   // Get a txt prime
//   ...
   }
// ...

Class DirectoryPrimeC performs a similar function for primeC files.


CandidateGenerator


The CandidateGenerator emits a sequence of possible primes, culling out many non-primes with simple tests. The primality of the candidate is verified by a PrimeCVector or PrimeVector.

The culling algorithm is:

Ignore all even numbers.

Ignore numbers ending in 5.

Small values (2, 3, 5, and 7) are handled as special cases.

This immediately reduces the sieving by 60%.


CandidateGenerator Header



CandidateGenerator Code



PrimeVector


The PrimeVector determines the primality of a number in the function isPrime(), by computing the residue, using primes up to the square root of the candidate number, until a zero result is obtained. The list of primes is stored in the inherited vector class, and can thus be used by client code as a vector.

The square root of the largest 64 bit prime is 32 bits, so the PrimeVector stores only 32 bit primes (up to 4294967291).

The primary vector services are:

Sequential access to primes for sieving and client use.
Fast binary lookup for isPrime().

The PrimeVector initializes itself, using a CandidateGenerator and the PrimeVector's own previously collected primes. This allows the user to create a collection of prime files from scratch.

The process is initiated by "seeding" the PrimeVector with the first 10 primes and then extending the list. The verification process requires only primes up to the square root of the new candidate. No other data is required.

An attempt to append (push_back()) an element to an already full vector causes it to double its reserved space. If this occurs when the last value is appended, the total memory footprint during the resize operation is three times the final requirement and the vector ends up twice as large as necessary. If the final size is accurately known, the resize is avoided by reserving the needed space when the vector is created. There are 203280221 32 bit primes so the constructor performs this allocation.

PrimeVector Header



PrimeVector Code



PrimeGenerator

The PrimeGenerator uses a CandidateGenerator and a PrimeVector to produce primes in a given range.

To generate primes near the end of the 64 bit representation, the PrimeVector must contain all (203,280,221) primes up to 2**32. The memory footprint is about 900MB, large, but manageable in an ordinary PC. Execution time becomes the limiting factor at this level.

For earlier prime values, the PrimeVector needs only a few thousand primes.


PrimeGenerator Header



PrimeGenerator Code


Class PrimeGeneratorC performs a similar function with these differences. It requires a previously created primec file containing the 32 bit primes. It uses less RAM and loads much faster, but does not produce primes from scratch, like the PrimeGenerator.


PrimeC

PrimeC is a compact (c in PrimeC) representation of primes with these characteristics:

The storage (or file space) required for each prime is much less than characters or binary--about the same the compressed (zipped) representation.

The most commonly used function, isPrime(), is fast and efficient.


PrimeCVector


PrimeCVector is an abstract class that provides the common algorithms for navigating primeC data:

  // Location of v, or exception if v not prime 
  PrimeCVector::iterator find( const PrimeRange v ) ;
 
  // Return v or next value ending in 1,3,7,9 
  Prime nextPossible( const PrimeRange v ) ;
 
  // Return iterator to v or subsequent prime.
  PrimeCVector::iterator findNextPrime( const PrimeRange v ) ;
 
  bool isPrime( const PrimeRange v )  ;

PrimeCVector::iterator

PrimeCVector::iterator provides the basic sequential access:

                          // Throws exception on error
  Prime  operator*() ;    // Return current prime target
                          //  stored in iterator
   
  void operator++( int ) ;   // Move to next prime and cache it

  void operator--( int ) ;   // Move to previous prime and cache it

PrimeCVector Header



PrimeCVector Code



PrimeCArray

PrimeCArray is a concrete class derived from PrimeCVector, and implemented as an unsigned char array. It provides fast access, avoiding external file references, but is constrained to available internal storage. The collection of primes up to 2^32 is 209MB.

PrimeCArray Header



PrimeCArray Code



PrimeCFileWriter


PrimeCFileWriter implements the PrimeCVector interface to create new primec files. After the file has been created, access the prime data with PrimeCFileReader (for large sequences) or PrimeCArray for sequences that will fit in RAM. A range of 1E10 is 488MB.

See BuildPrimeC for sample client code.


PrimeCFileWriter Header



PrimeCFileWriter Code



PrimeCFileReader


BuildPrimeC provides access to primec data with these services:

            // Locate prime equal to or greater than v
  PrimeCVector::iterator  findPrime( const PrimeRange v ) ;

            // Return first prime and subsequent primes
  Prime getPrime() ;   

PrimeCFileReader Header


PrimeCFileReader Code



DirectoryPrimeC

DirectoryPrimeC provides sequential access to primes from the combination of all the primec files in a directory, without interruption in the client code. Here is a sample of client code which compares primec data to txt primes.


  // ...

  DirectoryStream	<Prime> psin( txtDirectoryName ) ;
  DirectoryPrimeC        pcin( PrimeCdirectoryName ) ;

  while ( true )         // Exit on EOF exception (end of either directory)
  {                      // Both get() bridge to next file, if any
    sv = psin.get() ;    // Get a txt prime   
    cv = pcin.get() ;    // Get a compact prime   
     
    if ( sv != cv )      // Prime values don't match?
    {
     cout_cerr << "Mismatch: " << sv 
               << " , " << cv << endl ;
     pause() ;
  }
  //...

DirectoryPrimeC Header


PrimeGeneratorC

PrimeGeneratorC is distinguished from PrimeGenerator only in that it is not self-generating, it requires a previously created list of primes to initialize it for testing of new prime candidates.


PrimeGeneratorC Header



PrimeGeneratorC Code



Programs

This is a set of programs to produce and test various files.


Goldbach


Synopsis

The Goldbach program, the original impetus of this project, uses the constructed files and classes to verify the hypothesis for the range of available primes.


Operation


The (DOS) command

  Goldbach 4 1000 smallList  FloatDir
will verify that the even integers from 4 to 1000 can be formed from the sum of two primes.

smallList is a primec file that is copied into a primec array used to check the primality of the smaller member of a Goldbach pair. FloatDir contains a collection of primes covering the range of even number to be tested.

The core logic is:

Select an even candidate.
Find the next smaller prime.
Subtract the prime from the candidate.
If the difference is prime, the prime pair is found.
If not, find the next smaller prime and continue.

Two classes are the heart of the program:

PrimeCArray

An array of "small" (beginning at 2) primes, used to determine if the difference is prime. It is initialized at the beginning to the program and is not changed.

PrimeFloatVector

The term "Float" refers to the range of the vector which "floats" upward to encompass the prime values needed by the Goldbach algorithm as the candidate even values increase beyond the original/current upper bound of the vector.

PrimeFloatVector is a vector of primes with a variable beginning, used to provide the first element of the Goldbach pair. The program begins by subtracting the next smaller prime from the even candidate. When the even candidate exceeds the largest prime in the vector, the bottom half of the vector is discarded and the top half is filled with succeeding primes from the FloatDir. The vector is large enough to supply a prime element for the pair without "underflowing".


PrimeFloatVector Header



PrimeFloatVector Code



Goldbach Code



BuildTxtPrime


Synopsis

BuildTxtPrime produces a txt file of primes in the range specified by the command line parameters. The numbers are separated by newlines.


Operation

The (DOS) command

BuildTxtPrime 2 1000

will produce a txt file named "000000000002_to_000000001000.txt" containing the primes in the interval 2 to 1000, separated by newlines.

The first parameter must be 2 or larger.

Short lists of small numbers will run quickly. The running time expands both with the length of the interval and the value of the primes generated. One of my desktops produced primes near 1.8E19 in a range of 100000 in just over 12 hours.


BuildTxtPrime Code



BuildPrime4


Synopsis


BuildPrime4 constructs three files containing all 32 bit primes:

Text
Prime4, 32 bit binary
PrimeC, Compressed; see PrimeCVector for details.


BuildPrime4 Operation

   BuildPrime4

creates the three file:

   000000000000_to_004294967292.primec
   0_to_4294967291.prime4   (32 bit binary)
   0_to_4294967291.txt      (ASCI text)

BuildPrime4 Code



RunTxtPrimeGen


Synopsis


RunTxtPrimeGen runs BuildTxtPrime in segments to avoid total loss of results in the event of computer or power failure. Near 1.8E19, generating primes in a range of 100000 takes about 12 hours. The resulting files (about 50KB) can be catenated with the DOS copy command.


RunTxPrimeGen Operation

   RunTxtPrimeGen start end range

Start and end specify the overall range to be generated.

range specifies the range for each individual file.



RunTxPrimeGen Code



Summary

References

Prime Lists

www.primos.mat.br/indexen.html

The site contains various lists of primes up 1E12 (one trillion). The primes are represented as ASCI characters with a tab character separator. The complete series is over 3000 files containing 10 million (1E7) primes per file. The two problems in using these files are:

The location of the breaks between files is "random", every 10,000,000 prime values.

The file names are unrelated to the content:
"Ate_100G_part1.txt" contains 2 to 179424673.


https://en.wikipedia.org/wiki/List_of_prime_numbers


The first 1000 primes, followed by lists of notable types of prime numbers.


http://mathforum.org/dr.math/faq/faq.prime.num.html

Dr. Math on primes covering among other subjects:
    Finding Prime Numbers
    Frequency of Primes
    Is Zero Prime or Composite?
    Large Prime Numbers
    No Largest Prime Number
    p, p+8, p+22 Not Prime
    Primality Test
    Prime Factorization
    Prime Number Formula
    Prime Number Information
    Prime Number Theorems
    Prime Numbers 20-30
    Relative Primes
    Why is 1 Not Considered Prime?
    Writing a Program to Generate Primes

    Finding Primes and Proving Primality
    The First 10,000 Primes
    The Great Internet Mersenne Prime Search
    How Many Primes Are There?
    The Largest Known Primes
    Mersenne Prime - Susan Stepney
    Mersenne Primes - Jon Vinopal
    Mersenne Primes: History, Theorems and Lists
    Prime Number (Eric Weisstein's World of Mathematics)
    Prime Numbers and Factors
    The Prime Pages - Chris Caldwell
    73939133 - Prime Numbers - Amazing Number Facts
    Prime Theorem of the Century
    Students' Mersenne Prime Page
    Why do people find these primes? - Chris Caldwell

Google

The query "prime number list" produces 480,000,000 hits.