{"id":20048499,"url":"https://github.com/lefteris-souflas/redis-mongodb-assignment","last_synced_at":"2026-05-08T06:05:35.874Z","repository":{"id":229509898,"uuid":"776918818","full_name":"Lefteris-Souflas/Redis-MongoDB-Assignment","owner":"Lefteris-Souflas","description":"Analyzing classified ads data from the used motorcycles market. Tasks involve utilizing Redis Bitmaps for analytics on seller actions and MongoDB for analyzing bike listings. Includes data installation, cleaning, and analysis.","archived":false,"fork":false,"pushed_at":"2024-04-17T20:00:35.000Z","size":816,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-12T20:33:35.876Z","etag":null,"topics":["big-data-processing","bitmap","json","mongo-database","r","redis","redis-vs-rdbms-comparison"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Lefteris-Souflas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-24T19:55:18.000Z","updated_at":"2024-04-17T20:01:18.000Z","dependencies_parsed_at":"2024-03-24T21:22:23.104Z","dependency_job_id":"c0469270-67fc-4c6e-8adf-7db7f8562745","html_url":"https://github.com/Lefteris-Souflas/Redis-MongoDB-Assignment","commit_stats":null,"previous_names":["codeninjatech/redis-mongodb-assignment","lefteris-souflas/redis-mongodb-assignment"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lefteris-Souflas%2FRedis-MongoDB-Assignment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lefteris-Souflas%2FRedis-MongoDB-Assignment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lefteris-Souflas%2FRedis-MongoDB-Assignment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lefteris-Souflas%2FRedis-MongoDB-Assignment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Lefteris-Souflas","download_url":"https://codeload.github.com/Lefteris-Souflas/Redis-MongoDB-Assignment/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241478061,"owners_count":19969214,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data-processing","bitmap","json","mongo-database","r","redis","redis-vs-rdbms-comparison"],"created_at":"2024-11-13T11:44:20.191Z","updated_at":"2026-05-08T06:05:30.830Z","avatar_url":"https://github.com/Lefteris-Souflas.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Redis \u0026 MongoDB Assignment\n\nAssignment 2 for the Big Data Systems \u0026 Architectures Course of AUEB's MSc in Business Analytics\n\n## Instructions\n\nYou are going to use REDIS and MongoDB to perform an analysis on data related to classified ads from the used motorcycles market.\n\n1. Install REDIS and MongoDB on your workstations. Version 4 of REDIS for Windows is available [here](https://github.com/tporadowski/redis/releases). If you have an older version, make sure that you upgrade since some of the commands needed for the assignment are not supported by older versions. The installation process is straightforward.\n2. Download the BIKES_DATASET.zip dataset from [here](https://drive.google.com/open?id=1m4W6anTDphWRnHDwsh-hlexOGrAkMrSq).\n3. Download the RECORDED_ACTIONS.zip dataset from [here](https://drive.google.com/open?id=1wyL8nQKDEu6rdr9BH6CgBwGnPnvRT8cJ).\n4. Do the tasks listed in the “TASKS” section.\n\n## Scenario\n\nYou are a data analyst at a consulting firm and you have access to a dataset of ~30K classified ads from the used motorcycles market. You also have access to some seller related actions that have been tracked in the previous months. You are asked to create a number of programs/queries for the tasks listed in the “TASKS” section.\n\n## Assignment Notes\n\n- You may work on any programming language of your choice. However, working with R is recommended, since the material uploaded on Moodle uses R in order to demonstrate Redis’ usage.\n- Assignment should be done in groups of two.\n- The dataset is in JSON format. It needs cleaning. You don’t need to follow the guidelines provided below. You may do the cleaning any way you like.\n- In your deliverable, you should include (along with your code) a report justifying the steps you took in order to perform the tasks. The report should be VERY brief.\n- Your code should be fully commented.\n- Optional tasks will have no effect on your final grade. However it’s strongly recommended that you at least try these out in order to understand the actual benefit of the tools/technologies that you are using.\n- You don’t have to follow the tips provided in the tasks. You can do it any way you prefer. However, they may come in handy.\n\n## Tasks\n\n### Task 1\n\nIn this task you are going to use the “recorded actions” dataset in order to generate some analytics with REDIS.\n\nAt the end of each month, the classifieds provider sends a personalized e-mail to some of the sellers with a number of suggestions on how they could improve their listings. Some e-mails may have been sent two or three times in the same month due to a technical issue. Not all users open these e-mails. However, we keep track of the e-mails that have been read by their recipients. Apart from that you are also given access to a dataset containing all the user ids along with a flag on whether they performed at least one modification on their listing for each month.\n\nIn brief, the datasets are the following:\n-\temails_sent.csv “Sets of EmailID, UserID, MonthID and EmailOpened”\n-\tmodified_listings.csv “Sets of UserID, MonthID, ModifiedListing”\n\nThe first dataset contains User IDs that have received an e-mail at least once. The second dataset contains all the User IDs of the classifieds provider and a flag that indicates whether the user performed a modification on his/her listing. Both datasets contain entries for the months January, February and March.\n\nYou are asked to answer a number of questions using REDIS Bitmaps. A Bitmap is the data structure that immediately pops in your head when the need is to map Boolean information for a huge domain into a compact representation. REDIS, being an in-memory data structure server, provides support for bit manipulation operations. However, there isn’t a special data structure for Bitmaps in REDIS. Rather, bit level operations are supported on the basic REDIS structure: Strings. Now, the maximum length for REDIS strings is 512 MB. Thus, the largest domain that REDIS can map as a Bitmap is 2^32 (512 MB = 2^29 bytes = 2^32 bits).\n\nBitmaps examples:\n\nLet’s take the following bitmap as an example. Each bit corresponds to a client. Our company has 8 clients in total. The value of 1 means that the client purchased something from our online store in August:\n\nAugustSales:\n\n| 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |\n|---|---|---|---|---|---|---|---|\n\n-\tClients at the 0,3,5,6,7 positions did not purchase anything. \n-\tClients at the 1,2,4 positions did at least one transaction in August.\n\nLet’s add another bitmap to the example. It contains the September sales of the same company for the exact same clients: \n\nSeptemberSales:\n| 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |\n|---|---|---|---|---|---|---|---|\n\n-\tClients at the 2,3,6,7 positions did not purchase anything. \n-\tClients at the 0,1,4,5 positions did at least one transaction in September.\n\nIn order to create a Bitmap in REDIS you may use the SETBIT command. The syntax of SETBIT is: \n\n\u003e SETBIT key offset value\n\nIn order to create the SeptemberSales Bitmap we should enter the following commands:\n\n\u003e SETBIT SeptemberSales 0 1\n\n\u003e SETBIT SeptemberSales 1 1\n\n\u003e SETBIT SeptemberSales 4 1\n\n\u003e SETBIT SeptemberSales 5 1\n\nHaving these Bitmaps at hand, makes it very easy for us to calculate things like:\n-\tWhich clients ordered at least once for two months in a row?\n-\tWhich clients have not placed any orders within these two months?\n\nThis can be achieved with the use of bit-wise logical operations.\n\nFor example, in order to find out the clients that ordered at least once every month, we could perform an “AND” bitwise operation:\n\nAugustSales AND SeptemberSales:\n| 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |\n|---|---|---|---|---|---|---|---|\n\nIn REDIS the following bitwise operations are supported:\n\n| BITWISE OPERATION | PERFORMANCE |\n| :---: | :---: |\n| AND | A bitwise AND performs the logical AND operation on each pair of the corresponding bits. If both bits are 1, the bit in the resulting binary representation is 1 (1 \u0026 1 = 1); otherwise, the result is 0 (1 \u0026 0 = 0 and 0 \u0026 0 = 0). For example: 0101 AND 0011 = 0001 |\n| OR | A bitwise OR performs the logical inclusive OR operation on each pair of corresponding bits. The result is 0 if both bits are 0; otherwise, the result is 1. For example: 0101 OR 0011 = 0111 |\n| XOR | A bitwise XOR performs the logical exclusive OR operation on each pair of corresponding bits. The result is 1 if only the first bit is 1 or only the second bit is 1, but will be 0 if both are 0 or both are 1. For example: 0101 XOR 0011 = 0110 |\n| NOT | The bitwise NOT, performs logical negation on each bit. Bits that are 0 become 1, and those that are 1 become 0. For example: NOT 0111 =\u003e 1000 |\n\nThese operations are performed with the BITOP command. The results of each command are written in a new key.\n\nThe syntax is as follows:\n\n| BITWISE OPERATION |\tSYNTAX | EXAMPLE |\n| :---: | :---: | :---: |\n| AND |\tBITOP AND destkey srckey1 srckey2 | BITOP AND results AugustSales SeptemberSales |\n| OR |\tBITOP OR destkey srckey1 srckey2 | BITOP OR results AugustSales SeptemberSales |\n| XOR |\tBITOP XOR destkey srckey1 srckey2 |\tBITOP XOR results AugustSales SeptemberSales |\n| NOT |\tBITOP NOT destkey srckey | BITOP NOT results AugustSales |\n\nIn all these examples, the results will be written in the key “results”.\n\nIn order to count the number of “1”s in a key, we may use the BITCOUNT command. So, in order to count the number of clients that ordered in August, we would do: \n\n\u003e BITCOUNT AugustSales\n\nNow that you are familiar with all the theory and tools that you need to work with Bitmaps in REDIS, let’s proceed with your assignment.\n\nGeneral Note: Some users may have received more than one e-mail in the same month. If a client opened at least one of the e-mails that she/he received in the same month then we will classify this client as having opened this month’s newsletter.\n\nProvide answers for the following questions:\n\n1. How many users modified their listing on January? \nTip: Create a BITMAP called “ModificationsJanuary” and use “SETBIT -\u003e 1” for each user that modified their listing. Use BITCOUNT to calculate the answer.\n2.\tHow many users did NOT modify their listing on January?\nTip: Use “BITOP NOT” to perform inversion on the “ModificationsJanuary” BITMAP and use BITCOUNT to calculate the answer. Combine the results with the answer of 1.1. Do these numbers match the total of your users? Even if they don’t, an explanation of why this happens will give you the full grade. Keep in mind that all BITOP operations happen at byte-level increments.\n3.\tHow many users received at least one e-mail per month (at least one e-mail in January and at least one e-mail in February and at least one e-mail in March)?\nTip: Create three BITMAPS “EmailsJanuary”, “EmailsFebruary” and “EmailsMarch”. Fill these with “SETBIT” and use “BITOP AND” followed by “BITCOUNT” in order to calculate the answer.\n4.\tHow many users received an e-mail on January and March but NOT on February?\nTip: Perform “BITOP AND” on “EmailsJanuary” and “EmailsMarch”. Perform an inversion of “EmailsFebruary” and use “BITOP AND” as well.\n5.\tHow many users received an e-mail on January that they did not open but they updated their listing anyway?\nTip: Create a new BITMAP “EmailsOpenedJanuary”.\n6.\tHow many users received an e-mail on January that they did not open but they updated their listing anyway on January OR they received an e-mail on February that they did not open but they updated their listing anyway on February OR they received an e-mail on March that they did not open but they updated their listing anyway on March?\nTip: Create two new BITMAPs “EmailsOpenedFebruary” and “EmailsOpenedMarch”. Do the same thing you did on 1.5 and calculate the answer using “BITOP OR”.\n7.\tDoes it make any sense to keep sending e-mails with recommendations to sellers? Does this strategy really work? How would you describe this in terms a business person would understand?\nTip: You may use the findings of the previous questions or calculate anything else you want in order to justify your answer. \n8.\t(Optional Task) Do the previous subtasks again by using any type of relational or non-relational database. Compare the complexity of the solutions. Then benchmark the query execution time for the dataset that you have. At last, boost the number of entries to 1 billion rows (create your own dummy entries). Perform the benchmark again.\n\n### Task 2\n\nIn this task you are going to use the “bikes” dataset in order to generate some analytics with MongoDB.\n\n1.\tAdd your data to MongoDB.\n- [x] **Tip 1**: You are free to structure your data whatever way you see fit. Before deciding on that, read the other tasks below. These will help you in order to decide on the data cleaning actions that you need to perform. You are allowed to perform any data cleaning actions you like. Please document all the actions that you performed (briefly) along with the reasoning behind any of your actions. The dataset is not clean. You might need to remove entries (or edit them) in order to maintain a clean database.\n- [x] **Tip 2**: You will need to read all the files from R, do some cleaning and then add the data to MongoDB. When dealing with files split in that many folders, there are two options on how you read these files. The first (simpler) option is to write some code that will recursively read each folder, discover the files and bring them to memory. If you choose this route, every time you execute your code, the files have to be re-discovered. This takes time. Another option is to generate a list with all the paths of the files and use this file as an index whenever you need to do some kind of manipulation (read/write) to these files. This approach will be much faster. In order to build that file, open the folder through a terminal and run the following command:\no\tWindows Powershell: dir -Recurse -Name -File \u003e files_list.txt\no\tUnix Terminal: find * | grep json \u003e files_list.txt\no\tWindows CMD: dir /a-D /S /B \u003e files_list.txt\nNow, the only thing you have to do through your code is read the “files_list.txt” line by line and load the file that is in the path contained in each row. No time will be spent in order to discover the files in case you want to re-run your code. In this assignment, you have a total of ~30K files. In a real-life scenario this number could have been several millions. In this case, the second option would most probably be your only option.\n- [x] **Tip 3**: You will need to work on your data prior to writing to the database. Code samples of working with MongoDB through R are available on the “mongo.r” file. Use this as a reference along with the documentation of the package used.\n\n2.\tHow many bikes are there for sale?\n\n3.\tWhat is the average price of a motorcycle (give a number)? What is the number of listings that were used in order to calculate this average (give a number as well)? Is the number of listings used the same as the answer in 2.2? Why?\n\n4.\tWhat is the maximum and minimum price of a motorcycle currently available in the market?\n- [x] **Tip**: The numbers should make sense.\n\n5.\tHow many listings have a price that is identified as negotiable?\n- [x] **Tip**: Search for the word “Negotiable” in the ad.\n\n6.\t(Optional) For each Brand, what percentage of its listings is listed as negotiable?\n\n7.\t(Optional) What is the motorcycle brand with the highest average price?\n\n8.\t(Optional) What are the TOP 10 models with the highest average age? (Round age by one decimal number)\n- [x] **Tip**: Calculate age based on registration date. You don’t need to take into account the months (only years). Group by model, calculate AVG Age and then Sort. Keep the TOP 10. In case of draws, treat it the same way you would in a real-life scenario. \n\n9.\t(Optional) How many bikes have “ABS” as an extra? \n\n10.\t(Optional) What is the average Mileage of bikes that have “ABS” AND “Led lights” as an extra?\n\n11.\t(Optional) What are the TOP 3 colors per bike category?\n\n12.\t(Optional) Identify a set of ads that you consider “Best Deals”. \n- [x] **Tip**: Describe “why” in a manner that a business person would understand. Justify your decision with actual data. Even though it’s not really needed, you are free to use external data sources.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flefteris-souflas%2Fredis-mongodb-assignment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flefteris-souflas%2Fredis-mongodb-assignment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flefteris-souflas%2Fredis-mongodb-assignment/lists"}