Creating an application for checking lottery numbers – Design, Modelling and Code creation

In Brazil we have a lottery that pays millions of reais / dollars twice a week.  This value is paid for the winner that got all 6 numbers (out of 60) correctly.  We can buy a ticket with the numbers we choose (or random numbers) for up to 8 draws, what means that we can win at any draw within the next 30 days or so.

But they also pay money if you got 4 or 5 numbers drawn.  It can be a tedious process to check if you got any number right, since there are 15 combinations for 4 numbers, 6 combinations for 5 numbers and a single combination for all the 6 numbers, with a total of 22 paying combinations.

In the first version of my code, I am checking all those 22 combinations automatically for a user that plays only 6 numbers (yes, we can play more numbers, up to 15, with increasing costs).  But how to be sure that this is done in an efficient manner?

I started with the process that I usually start any application:

  • What I want to be done?
  • What interfaces will I have to work with the user?  (e.g. command line, graphical interface, web interface, etc.)
  • What interfaces will I have to work with the data? (e.g. one single app accessing the data, more than one app, web app, external access, just internal access, etc.)
  • Where do I want things done? (e.g. and the frontend code, at backend code, at database, in the server, etc.)
  • etc.

This makes me think and define some simple rules that will be the foundations of my API and that will guide my code.

Most of the time, I do not have answers to all those questions and need to play on the safe side.  The usual answers — and something I have to be very careful with since it is already a bias — are:

  • <something>
  • Web Interface
  • A web app that might have external parties consuming the informaiton from it
  • Database, browser, backend — in that order.

With that in mind, I started the development of the database based on the data that the lottery provides and that I thought would be useful for some statistical analysis.

This would be analyzing the customer’s requirements (myself, in this case) and modelling the database to be able to represent that information in a manner that will be able to answer the basic questions from the customer.

The second step would be analyzing indices and other database specific questions that will allow me to deal with the volume of data, the type of queries that will be performed, the number of concurrent accesses, the type of concurrent queries, etc.  It is the work of a DBA to plan that.

The third step is creating the code.  The first question is: where?

As I wanted to try it out fast, and the situation is not all that complicated, I could code it inside the database itself (PostgreSQL allows me to code very complex things inside of it, in many different programming languages, so this is one of the reasons why I must be careful to not have everything inside the database itself).  The next question becomes: in which language?

The fastest language is plain SQL, then plpgsql, and then plpython (that is Python embedded on the database). My first thought, out of convenience was Python.  Then, I do not needed a full set of it to justify requiring plpython to run a simple query, so I moved to plpgsql.  But, this meant that I was not generating the combinations dynamically, so why not trying to go to plain SQL?  And that is what I ended up doing.

The following code helped me to get all possible combinations for 4 and 5 numbers out of 6 possible numbers:

>>> a=itertools.combinations([1, 2, 3, 4, 5, 6], 4)
>>> for x in a:
...    print x
...

and

>>> a=itertools.combinations([1, 2, 3, 4, 5, 6], 5)
>>> for x in a:
...    print x
...

This I copied onto my SQL code, applied some magic text replacement commands and ended up with my SQL function in PostgreSQL:

CREATE OR REPLACE FUNCTION f_v_is_possible_result(
       i_dezena1 INTEGER,
       i_dezena2 INTEGER,
       i_dezena3 INTEGER,
       i_dezena4 INTEGER,
       i_dezena5 INTEGER,
       i_dezena6 INTEGER)
       RETURNS SETOF INTEGER
AS $_$
SELECT contest_number
       FROM megasena
       WHERE
-- Checking four numbers
        ARRAY[$1, $2, $3, $4] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $2, $3, $5] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $2, $3, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $2, $4, $5] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $2, $4, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $2, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $3, $4, $5] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $3, $4, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $3, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $4, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$2, $3, $4, $5] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$2, $3, $4, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$2, $3, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$2, $4, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$3, $4, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
-- Checking five numbers
        ARRAY[$1, $2, $3, $4, $5] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $2, $3, $4, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $2, $3, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $2, $4, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$1, $3, $4, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
        ARRAY[$2, $3, $4, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6] OR
-- Checking six numbers
        ARRAY[$1, $2, $3, $4, $5, $6] <@ ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6]
       ORDER BY contest_number;
$_$ LANGUAGE SQL;

In this particular case, I exploited some advantages from PostgreSQL and worked with array comparisons to check any of the possible combinations, since the number of conditions was small.  If I had been working with all 15 numbers that one can play, then I would have to dynamically generate that in the code.

The “<@” operator working with arrays means “is contained in”, so I check that the combinations from the left side are contained in the array of drawn numbers.  “dezena1” to “dezena6” are the drawn numbers — their column names, actually –, already ordered in ascending order (even though with arrays on PostgreSQL this ordering does not matter, the only requirement is that all elements from the left side are available on the right side to be a successful match).

This approach, as I mentioned before, would not work with all the 15 numbers since there are 1365 combinations for 4 numbers out of 15, 3003 combinations for 5 numbers out of 15 and 5005 combinations for 6 numbers out of 15, adding up to 9373 possible combinations.  Even with magic text replacement, this would mean a lot of work and a huge code for a SQL function.  Not the best idea at all…  In this case, I would go with a Python function.

With that being done, now I can move on to the next step: the web interface building.

As with the latest applications, I will be using ExtJS and grids.  The numbers from my bet will be the filters and this function above will help me limiting the number of lines I will be showing on the grid.

Around version 2 or 3, I will be changing it to allow up to 15 numbers to be played, with a minimum of 6 numbers.  But, then, it is another project, with another design phase, decision making, etc.  I will try remembering to post about it here when that time comes (and if I get there, since I never play more than 6 numbers anyway…).

Why this post?  Because the process is the same for every application and because this is a somewhat common problem.  Using arrays and their special operators shows up as an elegant solution to a problem.

Happy hacking and wish me luck with my bets. 🙂

 

UPDATE1:. I won a lesser prize.  Found out with the above code. About US$ 300, last week.

UPDATE2:. After a series of interactions with a colleague, I changed the above code to the one below:

CREATE OR REPLACE FUNCTION f_v_is_possible_result(
       i_dezena1 INTEGER,
       i_dezena2 INTEGER,
       i_dezena3 INTEGER,
       i_dezena4 INTEGER,
       i_dezena5 INTEGER,
       i_dezena6 INTEGER,
       i_matches INTEGER = 4,
       OUT contest_number INTEGER,
       OUT numbers INTEGER[],
       OUT quantity INTEGER)
       RETURNS SETOF RECORD
AS $_$
WITH numbers AS (SELECT contest_number, ARRAY_INTERSECT (ARRAY[$1, $2, $3, $4, $5, $6],
                                                   ARRAY[dezena1, dezena2, dezena3, dezena4, dezena5, dezena6]) AS intersect
                 FROM megasena 
                 ORDER BY contest_number)
     SELECT numbers.contest_number,
            numbers.intersect AS numbers,
            ARRAY_LENGTH(numbers.intersect, 1) AS quantity
     FROM numbers
     WHERE ARRAY_LENGTH(numbers.intersect, 1) >= $7;
$_$ STABLE LANGUAGE SQL;

The “WITH” clause will allow me to process the information more than once while preventing the query from being repeated at every time I use it.

The “i_matches” parameter allow me to show cases where there was no prize won or filter out specific types of prizes (4, 5 or 6 numbers matching).

The new code also shows me which numbers matched.  Here’s the new output:

megasena=# select * FROM f_v_is_possible_result(03, 11, 20, 49, 51, 59);
 contest_number |   numbers    | quantity 
----------------+--------------+------------
           1335 | {3,20,51,49} |          4
(1 row)

megasena=# select * FROM f_v_is_possible_result(03, 11, 20, 49, 51, 59, 3);
 contest_number |   numbers    | quantity 
----------------+--------------+------------
             56 | {20,51,59}   |          3
            217 | {3,51,49}    |          3
            296 | {3,20,51}    |          3
            548 | {51,49,59}   |          3
            658 | {11,3,20}    |          3
            857 | {3,49,59}    |          3
           1034 | {3,20,49}    |          3
           1035 | {11,51,59}   |          3
           1057 | {3,49,59}    |          3
           1192 | {20,49,59}   |          3
           1247 | {11,3,51}    |          3
           1335 | {3,20,51,49} |          4
(12 rows)

megasena=#

My prize was on contest number 1335.

Numbers, precision and PostgreSQL

It is interesting how numbers can create problems while developing systems.

While we can’t write one third (1/3) in decimal without resorting to infinite representation of the number through periodic repetitions (I don’t know if the term “periodic number” is correct in English) there is the same issue with other base systems, for example, it is impossible to write one tenth (0.1 or 1/10) in binary without periodic numbers.

A second issue is not the numeric conversion itself, but the manner in which this number will be stored at the database server. There are different standards that try to solve the conversion issue (BCD representation, for example) but even then, people (mainly developers and DBAs) need to be aware of them and what are the implications on their usage.

Other problems happen when making comparisons. Due to the innacuracy of the float type and the hardware where the software is run, assuming that you can get x = 0.05 to work all the time is something problematic (specially when 0.05 becomes 0.0500000001…). The recommended approach, then, is using intervals, so instead of x = 0.05 we would write the equivalent to 0.4999 < x < 0.5001 (or any other variant that takes into account the precision you can get on your numeric system and platform(s)).

Do you want a whole new beast to fight with? Name it rounding. Yes. It might appear to be a simple operation, but then ask for someone on the Maths field to round some numbers and ask the same for someone in the Physics field. You will get different results. And different explanation for the processes. (And if you start including error coefficients, error propagation, etc. then this beast will really become a nightmare…)

Back to my small tiny world (with not so many beasts to fight…), the reason that led me to write about this was some discussion on numbers, their representation on screen and their storage on database systems. It is amazing how a lot of developers lack some formality on their understanding of what they do. It is like they are only code replicators or writers. There is, of course, a vast number of them that understand what they do, but I feel that the majority don’t know what they are doing. Specially with more high level languages.

On the database side, PostgreSQL’s numeric types is a manual page that is largely ignored (forgotten?) by its users. It is nice that it also brings up attention and some awareness about the numeric issues and their representation /storage on systems.

I have submitted a comment on that page asking for more clarification on how to calculate NUMERIC and DECIMAL storage requirements… The current information is just:

The actual storage requirement is two bytes for each group of four
decimal digits, plus five to eight bytes overhead.

Which is helpful, but could be improved. Maybe some examples could make it a lot clearer.