{"id":20339054,"url":"https://github.com/szczyglis-dev/python-lottery-dataset-analyze","last_synced_at":"2025-06-29T13:02:15.031Z","repository":{"id":152686772,"uuid":"518145539","full_name":"szczyglis-dev/python-lottery-dataset-analyze","owner":"szczyglis-dev","description":"[Python] A Jupyter notebook illustrating methods for analyzing a historical lottery results dataset. The example demonstrates assessing linear relationships between variables, incorporating astronomical data, and visualizing number distributions.","archived":false,"fork":false,"pushed_at":"2024-08-26T16:34:37.000Z","size":284,"stargazers_count":9,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-11T23:15:11.202Z","etag":null,"topics":["analyze-data","astronomy","csv","data-science","datasets","jupyter","linear-regression","lottery-draw","notebook-jupyter","plot","predictive-modeling","probability-distribution","python","random","relationship","skyfield"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/szczyglis-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-07-26T16:54:22.000Z","updated_at":"2025-03-19T09:14:32.000Z","dependencies_parsed_at":"2025-04-11T23:12:48.037Z","dependency_job_id":"4acb6eec-9f4a-4579-80db-c047d308d8eb","html_url":"https://github.com/szczyglis-dev/python-lottery-dataset-analyze","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/szczyglis-dev/python-lottery-dataset-analyze","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szczyglis-dev%2Fpython-lottery-dataset-analyze","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szczyglis-dev%2Fpython-lottery-dataset-analyze/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szczyglis-dev%2Fpython-lottery-dataset-analyze/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szczyglis-dev%2Fpython-lottery-dataset-analyze/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/szczyglis-dev","download_url":"https://codeload.github.com/szczyglis-dev/python-lottery-dataset-analyze/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szczyglis-dev%2Fpython-lottery-dataset-analyze/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262598137,"owners_count":23334665,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analyze-data","astronomy","csv","data-science","datasets","jupyter","linear-regression","lottery-draw","notebook-jupyter","plot","predictive-modeling","probability-distribution","python","random","relationship","skyfield"],"created_at":"2024-11-14T21:15:16.132Z","updated_at":"2025-06-29T13:02:14.840Z","avatar_url":"https://github.com/szczyglis-dev.png","language":"Jupyter Notebook","funding_links":["https://www.buymeacoffee.com/szczyglis"],"categories":[],"sub_categories":[],"readme":"Release: **1.0.1** | build: **2024.08.26** | Jupyter Notebook, Data Science, Python: **\u003e=3.7**\n\n# Python Lottery Dataset Analyze\n\n**This Jupyter notebook illustrates several methods for analyzing a dataset containing historical results from various lotteries. The example demonstrates how to analyze the linear relationships between individual fields by extending the dataset with astronomical data. Additionally, it shows how to visualize the distribution of numbers in specific positions. The `skyfield` package is used to calculate the distances between celestial bodies, and this data is subsequently appended to the dataset.**\n\n## Requirements\n\n- Python 3\n- Jupyter Notebook\n\n**Required packages**\n\n- pandas\n- numpy\n- matplotlib\n- seaborn\n- scipy\n- skyfield\n\n**Screenshot of the Final Result**\n\n![lot_dataset](https://user-images.githubusercontent.com/61396542/181068160-e7971313-7dc1-45d5-86b0-f1755d61ebdc.png)\n\n## Usage step by step\n\n**1. Configuration, Initialization, and Module Import**\n\nWe will use historical drawing results for several popular number lotteries in Poland as input data. The draw results will be downloaded to `CSV` files and saved in a local directory on the disk. \n\nThe block below includes configurations for each of these lotteries, such as the names of the columns that will be used later in the `DataFrame` object created from the dataset, number ranges, and the format in which the individual drawing dates are saved. At the end of the block, you should specify the name of the lottery to be analyzed. This block will also load astronomical data for several celestial bodies, which will then be used to extend the dataset with the distances between these celestial bodies at the time of each draw. This data will be used to test the correlation between these events/variables.\n\n```python\nimport os\nimport math\nfrom datetime import datetime\nimport requests\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.stats import linregress\nfrom skyfield.api import load\n\n# define URLs with lotteries historical results in CSV\ncsv_urls = {\n    'lotto': 'https://www.wynikilotto.net.pl/download/lotto.csv',\n    'lotto_plus': 'https://www.wynikilotto.net.pl/download/lotto_plus.csv',\n    'eurojackpot': 'https://www.wynikilotto.net.pl/download/eurojackpot.csv',\n    'minilotto': 'https://www.wynikilotto.net.pl/download/mini_lotto.csv',\n    'multi': 'https://www.wynikilotto.net.pl/download/multi_multi.csv'\n}\n\n# [CSV config]\n# header - list with CSV column names\n#   idx - number of record\n#   date - date field\n#   time - time/hour field\n#   n(x) - primary number(x) field\n#   m(x) - secondary number(x) field\n# n_range - list with primary numbers range [from, to]\n# m_range - list with secondary numbers range [from, to]\n# n_count - number of primary numbers\n# m_count - number of secondary numbers\n# date_format - date field string format\n\ncsv_config = {\n    'lotto': {        \n        'header': ['idx', 'date', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6'],\n        'n_range': [1, 49],\n        'm_range': [],\n        'n_count': 6,\n        'm_count': 0,\n        'date_format': '%d.%m.%Y'\n    },\n    'lotto_plus': {\n        'header': ['idx', 'date', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6'],\n        'n_range': [1, 49],\n        'm_range': [],\n        'n_count': 6,\n        'm_count': 0,\n        'date_format': '%d.%m.%Y'\n    },\n    'eurojackpot': {\n        'header': ['idx', 'date', 'n1', 'n2', 'n3', 'n4', 'n5', 'm1', 'm2'],\n        'n_range': [1, 50],\n        'm_range': [1, 12],\n        'n_count': 5,\n        'm_count': 2,\n        'date_format': '%d.%m.%Y'\n    },\n    'minilotto': {\n        'header': ['idx', 'date', 'n1', 'n2', 'n3', 'n4', 'n5'],\n        'n_range': [1, 42],\n        'm_range': [],\n        'n_count': 5,\n        'm_count': 0,\n        'date_format': '%d.%m.%Y'\n    },\n    'multi': {\n        'header': ['idx', 'date', 'time', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', \n                    'n10', 'n11', 'n12', 'n13', 'n14', 'n15', 'n16', 'n17', 'n18', 'n19', 'n20', 'm1'],\n        'n_range': [1, 80],\n        'm_range': [1, 80],\n        'n_count': 20,\n        'm_count': 1,\n        'date_format': '%d.%m.%Y'\n    }\n}\n\n# specify download dir for CSV files\ncsv_dir = os.path.join(os.getcwd(), 'csv')\n\n# choose lottery\nlottery = 'lotto'\n\n# init astronomical data\nplanets = load('de421.bsp')\nearth, moon, sun, mars = planets['earth'], planets['moon'], planets['sun'], planets['mars']  \n```\n\n**2. Function Definitions**\n\nThe following cell defines the functions that will be used in subsequent blocks. These functions include downloading and saving the dataset to CSV files and extending the downloaded dataset with new values, which will then be used for further analysis.\n\n```python\n# create directory for CSV download if not exists\ndef csv_dir_create(csv_dir):\n    if not os.path.exists(csv_dir):\n        os.makedirs(csv_dir)\n        \n\n# download and save CSV dataset\ndef csv_update(csv_urls, csv_dir):\n    for k, url in csv_urls.items():\n        r = requests.get(url, allow_redirects=True)\n        name = k + '.csv'\n        fname = os.path.join(csv_dir, name)\n        open(fname, 'wb').write(r.content)\n        print('Downloaded: ' + fname)\n        \n        \n# load CSV dataset        \ndef csv_load(name, header, csv_dir):\n    file = os.path.join(csv_dir, name+'.csv')\n    return pd.read_csv(file, header=None, names=header)\n\n\n# save dataframe to CSV file\ndef csv_save(df, name, csv_dir):\n    file = os.path.join(csv_dir, name+'.csv')\n    df.to_csv(file, index=False) \n    \n\n# append date part to series\ndef df_append_date_part(part, row, dt_format):\n    dt = datetime.strptime(row.date, dt_format)\n    return int(dt.strftime(part))\n\n\n# get date parts\ndef df_get_date_parts(row, dt_format):\n    dt = datetime.strptime(row.date, dt_format)\n    y = int(dt.strftime('%Y'))\n    m = int(dt.strftime('%m'))\n    d = int(dt.strftime('%d'))\n    return y, m, d\n\n\n# append astro planets distance to series\ndef df_append_astro_distance(obj1, obj2, row, dt_format):\n    ts = load.timescale() \n    y, m, d = df_get_date_parts(row, dt_format)\n    t = ts.utc(y, m, d, 9, 0) \n    return obj1.at(t).observe(obj2).apparent().distance().au\n\n\n# append numbers ranges to series\ndef df_append_range(row, num_idx):\n    j = 10;\n    while j \u003c= 100:\n        if row[num_idx] \u003e= (j - 10) and row[num_idx] \u003c j:\n            return int((j - 10)/10)\n        j+= 10\n```\n\n**3. Download CSV Files with Datasets**\n\nThe cell below will download CSV files containing the datasets. **Tip:** to avoid downloading new data and instead use the already downloaded files, this block should be commented out after the data has been downloaded.\n\n```python\nprint('Downloading datasets....')\n      \ncsv_dir_create(csv_dir)\ncsv_update(csv_urls, csv_dir)\n\n```\n\n**Output:**\n```\nDownloading datasets....\nDownloaded: ./csv/lotto.csv\nDownloaded: ./csv/lotto_plus.csv\nDownloaded: ./csv/eurojackpot.csv\nDownloaded: ./csv/minilotto.csv\nDownloaded: ./csv/multi.csv\n```\n\n**4. Extend the Dataset with Additional Fields**\n\nThe following code expands the dataset with new fields. Numerical values for the saved draw dates will be added, such as year, month, day, day of the week, and day of the year. Additionally, the distances to individual celestial bodies (Earth-Moon, Earth-Sun, Earth-Mars) at the time of each draw will be calculated and appended to the dataset. The dataset will also include fields that define the range of numbers at a given position.\n\n```python\n# get CSV config for selected lottery\ncfg = csv_config[lottery]\ndt_format = cfg['date_format']\nheader = cfg['header']\n\n# load CSV dataset and create Data Frame from it\ndf = csv_load(lottery, header, csv_dir)\n\n# append date parts as integers\ndf['year'] = df.apply(lambda row: df_append_date_part('%Y', row, dt_format), axis=1)\ndf['month'] = df.apply(lambda row: df_append_date_part('%m', row, dt_format), axis=1)\ndf['day'] = df.apply(lambda row: df_append_date_part('%d', row, dt_format), axis=1)\ndf['day_of_week'] = df.apply(lambda row: df_append_date_part('%w', row, dt_format), axis=1)\ndf['day_of_year'] = df.apply(lambda row: df_append_date_part('%j', row, dt_format), axis=1)\n\n# append distances from earth to moon, sun \u0026 mars\ndf['dist_moon_au'] = df.apply(lambda row: df_append_astro_distance(earth, moon, row, dt_format), axis=1)\ndf['dist_sun_au'] = df.apply(lambda row: df_append_astro_distance(earth, sun, row, dt_format), axis=1)\ndf['dist_mars_au'] = df.apply(lambda row: df_append_astro_distance(earth, mars, row, dt_format), axis=1)\n\n# append decimal ranges of numbers to n(i)r fields that corresponds numbers at positions n1-n(i)\nlimit = cfg['n_count']\nif limit \u003e 0:\n    for i in range(1, limit+1):\n        range_field = 'n' + str(i) + 'r'\n        num_field = 'n' + str(i)           \n        df[range_field] = df.apply(lambda row: df_append_range(row, num_field), axis=1)\n\n# append decimal ranges of numbers to m(i)r fields that corresponds numbers at positions m1-m(i)\nlimit = cfg['m_count']\nif limit \u003e 0:\n    for i in range(1, limit+1):\n        range_field = 'm' + str(i) + 'r'\n        num_field = 'm' + str(i)           \n        df[range_field] = df.apply(lambda row: df_append_range(row, num_field), axis=1)\n\n# save extended dataset with appended extra data\ncsv_save(df, lottery + '_extended', csv_dir)\n\n#df = df.iloc[4424:,:] # you can truncate dataset to period in time\n```\n\n**5. Linear Regression Relationship Calculation**\n\nThe code below calculates the correlation between various events, such as the impact of the distances between celestial bodies on the lottery numbers and the correlation between individual lottery numbers themselves. The correlation between the distances of celestial bodies and the numbers is expected to oscillate around 0, indicating that these events are not correlated. A slightly higher correlation may appear when attempting to correlate the numbers drawn in the same draw with each other.\n\n```python\n# define relationship pairs to check\nrelations = {\n    'sun_n1': ['dist_sun_au', 'n1'],\n    'moon_n1': ['dist_moon_au', 'n1'],\n    'mars_n1': ['dist_mars_au', 'n1'],\n    'sun_n1r': ['dist_sun_au', 'n1r'],\n    'moon_n1r': ['dist_moon_au', 'n1r'],\n    'day_n1': ['day', 'n1'],\n    'month_n1': ['month', 'n1'],\n    'year_n1': ['year', 'n1'],\n    'dweek_n1': ['day_of_week', 'n1'],\n    'dweek_n1r': ['day_of_week', 'n1r'],\n    'dyear_n1': ['day_of_year', 'n1'],\n    'dyear_n1r': ['day_of_year', 'n1r'],\n    'n1_n2': ['n1', 'n2'],\n    'n2_n3': ['n2', 'n3'],\n    'n3_n4': ['n3', 'n4'],\n    'n4_n5': ['n4', 'n5'],\n    'n5_n6': ['n5', 'n6']\n}\n\n# calculate linear regression relationship between fields\nprint(\"[R] - Linear Regression Relationship:\\n\")\nrX = []\nrY = []\nfor name, item in relations.items():\n    x1 = item[0]\n    x2 = item[1]\n    if x1 in df and x2 in df:\n        slope, intercept, r, p, std_err = linregress(df[x1], df[x2])        \n        rX.append(name)\n        rY.append(r)\n        print(x1+' \u003e '+x2+': ' + str(r))\n        \nrY, rX = zip(*sorted(zip(rY, rX)))\n    \n# display relationships on plot\nfig = plt.figure()\nfig.set_figheight(5)\nfig.set_figwidth(15)\nax = fig.add_subplot(1, 1, 1)\nax.set_title('Relationship')\nax.set_ylabel('Pair')\nax.set_xlabel('R value')\nax.barh(rX, rY)\nplt.show()\n```\n\n**Output:**\n\n![relations](https://user-images.githubusercontent.com/61396542/181093223-68f3f239-8386-4502-8d74-d4c84ce5c8e2.png)\n\n**6. Distribution of the Frequency of Numbers in Specific Positions**\n\nThe following cell displays the frequency distribution of the drawn numbers, categorized by their positions.\n\n```python\nprint('Distribution of the Frequency of Numbers in Specific Positions:')\n\nnum_of_numbers = cfg['n_count']\ncols = 3\nrows = math.ceil(num_of_numbers/cols)\nrow = 0\ncol = 0\nf, ax = plt.subplots(rows, cols, figsize=(25, 15))\nfor i in range(1, num_of_numbers+1):\n    idx = 'n' + str(i)\n    data = df[idx].to_numpy()\n    sns.histplot(data, kde=True, ax=ax[row][col])\n    ax[row][col].set_title('Position: '+idx)\n    ax[row][col].set_xlabel('Number')\n    ax[row][col].set_ylabel('Count')\n    ax[row][col].axvline(x=data.mean(), color='red')\n    if col \u003e= (cols - 1):\n        col = 0\n        row+=1\n    else:\n        col+= 1\n        \nplt.show()\n```\n**Output:**\n\n\n![lot_nums](https://user-images.githubusercontent.com/61396542/181068072-fe838cab-af6e-41c8-a9dd-35c9316eedd7.png)\n\n\n**7. Display the Dataset**\n\nThe cell below displays the extended dataset that was prepared in the previous steps.\n\n```python\n# display dataset\nprint(\"[DATASET]\\n\")\nprint(df.to_string())\n```\n\n**Output:**\n\n![lot_dataset](https://user-images.githubusercontent.com/61396542/181068160-e7971313-7dc1-45d5-86b0-f1755d61ebdc.png)\n\n\n## Changelog\n\n**1.0.0** - First release (2022-07-26)\n\n**1.0.1** - Updated documentation (2024-08-26)\n\n--- \n**Notebook is free to use, but if you like it, you can support my work by buying me a coffee ;)**\n\nhttps://www.buymeacoffee.com/szczyglis\n\n**Enjoy!**\n\nMIT License | 2022 Marcin 'szczyglis' Szczygliński\n\nhttps://github.com/szczyglis-dev/python-lottery-dataset-analyze\n\nContact: szczyglis@protonmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszczyglis-dev%2Fpython-lottery-dataset-analyze","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fszczyglis-dev%2Fpython-lottery-dataset-analyze","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszczyglis-dev%2Fpython-lottery-dataset-analyze/lists"}